pipeline.src.flows.regulations_checkup ====================================== .. py:module:: pipeline.src.flows.regulations_checkup Functions --------- .. autoapisummary:: pipeline.src.flows.regulations_checkup.make_html_hyperlinks pipeline.src.flows.regulations_checkup.extract_monitorfish_regulations pipeline.src.flows.regulations_checkup.extract_legipeche_regulations pipeline.src.flows.regulations_checkup.add_article_id pipeline.src.flows.regulations_checkup.get_extraction_datetimes pipeline.src.flows.regulations_checkup.get_modified_regulations pipeline.src.flows.regulations_checkup.transform_modified_regulations pipeline.src.flows.regulations_checkup.get_missing_references pipeline.src.flows.regulations_checkup.get_unknown_links pipeline.src.flows.regulations_checkup.get_dead_links pipeline.src.flows.regulations_checkup.format_dead_links pipeline.src.flows.regulations_checkup.get_outdated_references pipeline.src.flows.regulations_checkup.format_outdated_references pipeline.src.flows.regulations_checkup.get_main_template pipeline.src.flows.regulations_checkup.get_body_template pipeline.src.flows.regulations_checkup.get_style pipeline.src.flows.regulations_checkup.render_body pipeline.src.flows.regulations_checkup.render_main pipeline.src.flows.regulations_checkup.get_recipients pipeline.src.flows.regulations_checkup.create_message pipeline.src.flows.regulations_checkup.send_message pipeline.src.flows.regulations_checkup.regulations_checkup_flow Module Contents --------------- .. py:function:: make_html_hyperlinks(urls: Iterable, link_texts: Iterable, logger: logging.Logger = None) -> List[str] Returns a list of html strings of links like link_text for the input `urls` and `link_texts`. :param urls: Iterable of urls :type urls: Iterable :param link_texts: Iterable of link texts :type link_texts: Iterable :returns: `list` of html links :rtype: List[str] .. py:function:: extract_monitorfish_regulations() -> pandas.DataFrame Extracts regulation references from the monitorfish `regulations` table. The ouptut DataFrame contains one line per regulatory reference, which means there can be multiple lines for one regulated zone, if the zone has several regulatory references. Output columns are `law_type`, `topic`, `zone`, `url` and `reference`. Regulatory zones without any regulatory reference are present in the output as a line with `None` as `url` and `reference` values. :returns: DataFrame of regulatory references :rtype: pd.DataFrame .. py:function:: extract_legipeche_regulations() -> pandas.DataFrame Extracts legipeche regulations from the monitorfish `legipeche` table (which is scraped from legipeche by the `Scrape Legipeche` flow). The ouput has one line per document - there can be multiple documents for the same Legipeche page. Output columns are `extraction_datetime_utc`, `extraction_occurence`, `page_title`, `page_url`, `document_title`, and `document_url`. :returns: DataFrame of Legipeche regulations. :rtype: pd.DataFrame .. py:function:: add_article_id(regulations: pandas.DataFrame, url_column: str) -> pandas.DataFrame Adds an `article_id` column to the `regulations` DataFrame, extracting the article_id from the `url_column` according the the Legipeche URL schema. Rows for which the URL does not match the Legipeche URL schema will have an article_id of `None`. :param regulations: DataFrame of regulations :type regulations: pd.DataFrame :param url_column: Name of the column containing URLs of regulation pages :type url_column: str :returns: copy of input `regulations` with an added `article_id` column :rtype: pd.DataFrame .. py:function:: get_extraction_datetimes(legipeche_regulations: pandas.DataFrame) -> Tuple[str, str] Returns the extraction datetimes of `previous` and `latest` legipeche extraction occurences from the `legipeche_regulations` DataFrame. The input must have `extraction_occurence` and `extraction_datetime_utc` columns. :param legipeche_regulations: DataFrame of legipeche extractions. :type legipeche_regulations: pd.DataFrame :returns: extraction datetimes of `previous` and `latest` legipeche extractions :rtype: Tuple[str, str] .. py:function:: get_modified_regulations(legipeche_regulations: pandas.DataFrame, monitorfish_regulations: pandas.DataFrame) -> pandas.DataFrame Filters the input `legipeche_regulations` and returns legipeche regulations (documents) that : - have been either added to or removed from an existing Legipeche page between the `previous` and `latest` Legipeche scraping occurences - belong to a Legipeche page referenced by at least one `monitorfish_regulation` :param legipeche_regulations: :type legipeche_regulations: pd.DataFrame :param monitorfish_regulations: :type monitorfish_regulations: pd.DataFrame :returns: filtered DataFrame of Legipeche regulations :rtype: pd.DataFrame .. py:function:: transform_modified_regulations(modified_regulations: pandas.DataFrame, monitorfish_regulations: pandas.DataFrame) -> pandas.DataFrame Formats `modified_regulations` into a DataFrame suitable for printing in an email. :param modified_regulations: DataFrame with columns : - `extraction_occurence`, having values 'previous' and 'latest - `page_url` - `document_title` - `document_url` :type modified_regulations: pd.DataFrame :param monitorfish_regulations: DataFrame with columns : - `url` (url of the regulatory reference in Monitorfish) - `reference` (name of the regulatory reference in Monitorfish) - `law_type` - `topic` - `zone` :type monitorfish_regulations: pd.DataFrame :returns: formatted DataFrame of regulation modifications :rtype: pd.DataFrame .. py:function:: get_missing_references(monitorfish_regulations: pandas.DataFrame) -> pandas.DataFrame Returns `monitorfish_regulations` with null values as `reference`. :param monitorfish_regulations: monitorfish_regulations. Must have :type monitorfish_regulations: pd.DataFrame :param columns: - `reference` - `law_type` - `topic` - `zone` :returns: Filtered and formatted version of input. :rtype: pd.DataFrame .. py:function:: get_unknown_links(monitorfish_regulations: pandas.DataFrame, legipeche_regulations: pandas.DataFrame) -> set Returns the urls of `monitorfish_regulations` whose `article_id` is either not present in `legipeche_regulations` (i.e. referencing Legipeche articles that might not exist) or null (which corresponds to urls that do not match the legipeche url pattern and which usually point to external websites). :param monitorfish_regulations: :type monitorfish_regulations: pd.DataFrame :param legipeche_regulations: :type legipeche_regulations: pd.DataFrame :returns: subset of `monitorfish_regulations.url` :rtype: set .. py:function:: get_dead_links(monitorfish_regulations: pandas.DataFrame, unknown_links: set, proxies: dict) -> pandas.DataFrame Perfoms get requests to check whether `unknown_links` are dead links, then returns `monitorfish_regulations` that reference a dead link as regulatory reference. :param monitorfish_regulations: :type monitorfish_regulations: pd.DataFrame :param unknown_links: `set` of urls not knonwn (i.e. urls not found when scraping Legipeche) :type unknown_links: set :param proxies: proxies to use when requests time out without proxies :type proxies: dict :returns: filtered `monitorfish_regulations` with only those that reference a dead link :rtype: pd.DataFrame .. py:function:: format_dead_links(dead_links: pandas.DataFrame) -> pandas.DataFrame Format input for printing. .. py:function:: get_outdated_references(monitorfish_regulations: pandas.DataFrame, now: datetime.datetime) -> pandas.DataFrame Returns `monitorfish_regulations` that have an `end_date` which is before `now`. :param monitorfish_regulations: DataFrame of Monitorfish regulations. Must have at least a `end_date` column. :type monitorfish_regulations: pd.DataFrame :param now: now :type now: datetime.datetime :returns: Subset of `monitorfish_regulations` :rtype: pd.DataFrame .. py:function:: format_outdated_references(outdated_references: pandas.DataFrame) -> pandas.DataFrame Format input for printing. .. py:function:: get_main_template() -> jinja2.environment.Template .. py:function:: get_body_template() -> jinja2.environment.Template .. py:function:: get_style() -> str .. py:function:: render_body(body_template: jinja2.environment.Template, previous_extraction_datetime_utc: datetime.datetime, latest_extraction_datetime_utc: datetime.datetime, missing_references: pandas.DataFrame, modified_regulations: pandas.DataFrame, dead_links: pandas.DataFrame, outdated_references: pandas.DataFrame, backoffice_regulation_url: str, utcnow: datetime.datetime) -> str Renders email body as html string. .. py:function:: render_main(main_template: jinja2.environment.Template, style: str, body: str) -> str .. py:function:: get_recipients() -> List[str] .. py:function:: create_message(html: str, recipients: List[str]) -> email.message.EmailMessage .. py:function:: send_message(msg: email.message.EmailMessage) .. py:function:: regulations_checkup_flow(proxies: dict = PROXIES, backoffice_regulation_url: str = BACKOFFICE_REGULATION_URL, get_utcnow_fn=get_utcnow, get_dead_links_fn: Callable = get_dead_links, send_message_fn: Callable = send_message)