pipeline.src.flows.regulations_checkup

Functions

make_html_hyperlinks(→ List[str])

Returns a list of html strings of links like <a href=url>link_text</a> for the

extract_monitorfish_regulations(→ pandas.DataFrame)

Extracts regulation references from the monitorfish regulations table.

extract_legipeche_regulations(→ pandas.DataFrame)

Extracts legipeche regulations from the monitorfish legipeche table (which is

add_article_id(→ pandas.DataFrame)

Adds an article_id column to the regulations DataFrame, extracting the

get_extraction_datetimes(→ Tuple[str, str])

Returns the extraction datetimes of previous and latest legipeche extraction

get_modified_regulations(→ pandas.DataFrame)

Filters the input legipeche_regulations and returns legipeche regulations

transform_modified_regulations(→ pandas.DataFrame)

Formats modified_regulations into a DataFrame suitable for printing in an email.

get_missing_references(→ pandas.DataFrame)

Returns monitorfish_regulations with null values as reference.

get_unknown_links(→ set)

Returns the urls of monitorfish_regulations whose article_id

get_dead_links(→ pandas.DataFrame)

Perfoms get requests to check whether unknown_links are dead links, then returns

format_dead_links(→ pandas.DataFrame)

Format input for printing.

get_outdated_references(→ pandas.DataFrame)

Returns monitorfish_regulations that have an end_date which is before now.

format_outdated_references(→ pandas.DataFrame)

Format input for printing.

get_main_template(→ jinja2.environment.Template)

get_body_template(→ jinja2.environment.Template)

get_style(→ str)

render_body(→ str)

Renders email body as html string.

render_main(→ str)

get_recipients(→ List[str])

create_message(→ email.message.EmailMessage)

send_message(msg)

regulations_checkup_flow([proxies, ...])

Module Contents

Returns a list of html strings of links like <a href=url>link_text</a> for the input urls and link_texts.

Parameters:
  • urls (Iterable) – Iterable of urls

  • link_texts (Iterable) – Iterable of link texts

Returns:

list of html links

Return type:

List[str]

pipeline.src.flows.regulations_checkup.extract_monitorfish_regulations() pandas.DataFrame[source]

Extracts regulation references from the monitorfish regulations table.

The ouptut DataFrame contains one line per regulatory reference, which means there can be multiple lines for one regulated zone, if the zone has several regulatory references.

Output columns are law_type, topic, zone, url and reference.

Regulatory zones without any regulatory reference are present in the output as a line with None as url and reference values.

Returns:

DataFrame of regulatory references

Return type:

pd.DataFrame

pipeline.src.flows.regulations_checkup.extract_legipeche_regulations() pandas.DataFrame[source]

Extracts legipeche regulations from the monitorfish legipeche table (which is scraped from legipeche by the Scrape Legipeche flow).

The ouput has one line per document - there can be multiple documents for the same Legipeche page.

Output columns are extraction_datetime_utc, extraction_occurence, page_title, page_url, document_title, and document_url.

Returns:

DataFrame of Legipeche regulations.

Return type:

pd.DataFrame

pipeline.src.flows.regulations_checkup.add_article_id(regulations: pandas.DataFrame, url_column: str) pandas.DataFrame[source]

Adds an article_id column to the regulations DataFrame, extracting the article_id from the url_column according the the Legipeche URL schema.

Rows for which the URL does not match the Legipeche URL schema will have an article_id of None.

Parameters:
  • regulations (pd.DataFrame) – DataFrame of regulations

  • url_column (str) – Name of the column containing URLs of regulation pages

Returns:

copy of input regulations with an added article_id column

Return type:

pd.DataFrame

pipeline.src.flows.regulations_checkup.get_extraction_datetimes(legipeche_regulations: pandas.DataFrame) Tuple[str, str][source]

Returns the extraction datetimes of previous and latest legipeche extraction occurences from the legipeche_regulations DataFrame.

The input must have extraction_occurence and extraction_datetime_utc columns.

Parameters:

legipeche_regulations (pd.DataFrame) – DataFrame of legipeche extractions.

Returns:

extraction datetimes of previous and latest legipeche

extractions

Return type:

Tuple[str, str]

pipeline.src.flows.regulations_checkup.get_modified_regulations(legipeche_regulations: pandas.DataFrame, monitorfish_regulations: pandas.DataFrame) pandas.DataFrame[source]

Filters the input legipeche_regulations and returns legipeche regulations (documents) that :

  • have been either added to or removed from an existing Legipeche page between the previous and latest Legipeche scraping occurences

  • belong to a Legipeche page referenced by at least one monitorfish_regulation

Parameters:
  • legipeche_regulations (pd.DataFrame)

  • monitorfish_regulations (pd.DataFrame)

Returns:

filtered DataFrame of Legipeche regulations

Return type:

pd.DataFrame

pipeline.src.flows.regulations_checkup.transform_modified_regulations(modified_regulations: pandas.DataFrame, monitorfish_regulations: pandas.DataFrame) pandas.DataFrame[source]

Formats modified_regulations into a DataFrame suitable for printing in an email.

Parameters:
  • modified_regulations (pd.DataFrame) –

    DataFrame with columns :

    • extraction_occurence, having values ‘previous’ and ‘latest

    • page_url

    • document_title

    • document_url

  • monitorfish_regulations (pd.DataFrame) –

    DataFrame with columns :

    • url (url of the regulatory reference in Monitorfish)

    • reference (name of the regulatory reference in Monitorfish)

    • law_type

    • topic

    • zone

Returns:

formatted DataFrame of regulation modifications

Return type:

pd.DataFrame

pipeline.src.flows.regulations_checkup.get_missing_references(monitorfish_regulations: pandas.DataFrame) pandas.DataFrame[source]

Returns monitorfish_regulations with null values as reference.

Parameters:
  • monitorfish_regulations (pd.DataFrame) – monitorfish_regulations. Must have

  • columns

    • reference

    • law_type

    • topic

    • zone

Returns:

Filtered and formatted version of input.

Return type:

pd.DataFrame

Returns the urls of monitorfish_regulations whose article_id is either not present in legipeche_regulations (i.e. referencing Legipeche articles that might not exist) or null (which corresponds to urls that do not match the legipeche url pattern and which usually point to external websites).

Parameters:
  • monitorfish_regulations (pd.DataFrame)

  • legipeche_regulations (pd.DataFrame)

Returns:

subset of monitorfish_regulations.url

Return type:

set

Perfoms get requests to check whether unknown_links are dead links, then returns monitorfish_regulations that reference a dead link as regulatory reference.

Parameters:
  • monitorfish_regulations (pd.DataFrame)

  • unknown_links (set) – set of urls not knonwn (i.e. urls not found when scraping Legipeche)

  • proxies (dict) – proxies to use when requests time out without proxies

Returns:

filtered monitorfish_regulations with only those that reference

a dead link

Return type:

pd.DataFrame

Format input for printing.

pipeline.src.flows.regulations_checkup.get_outdated_references(monitorfish_regulations: pandas.DataFrame, now: datetime.datetime) pandas.DataFrame[source]

Returns monitorfish_regulations that have an end_date which is before now.

Parameters:
  • monitorfish_regulations (pd.DataFrame) – DataFrame of Monitorfish regulations. Must have at least a end_date column.

  • now (datetime.datetime) – now

Returns:

Subset of monitorfish_regulations

Return type:

pd.DataFrame

pipeline.src.flows.regulations_checkup.format_outdated_references(outdated_references: pandas.DataFrame) pandas.DataFrame[source]

Format input for printing.

pipeline.src.flows.regulations_checkup.get_main_template() jinja2.environment.Template[source]
pipeline.src.flows.regulations_checkup.get_body_template() jinja2.environment.Template[source]
pipeline.src.flows.regulations_checkup.get_style() str[source]
pipeline.src.flows.regulations_checkup.render_body(body_template: jinja2.environment.Template, previous_extraction_datetime_utc: datetime.datetime, latest_extraction_datetime_utc: datetime.datetime, missing_references: pandas.DataFrame, modified_regulations: pandas.DataFrame, dead_links: pandas.DataFrame, outdated_references: pandas.DataFrame, backoffice_regulation_url: str, utcnow: datetime.datetime) str[source]

Renders email body as html string.

pipeline.src.flows.regulations_checkup.render_main(main_template: jinja2.environment.Template, style: str, body: str) str[source]
pipeline.src.flows.regulations_checkup.get_recipients() List[str][source]
pipeline.src.flows.regulations_checkup.create_message(html: str, recipients: List[str]) email.message.EmailMessage[source]
pipeline.src.flows.regulations_checkup.send_message(msg: email.message.EmailMessage)[source]
pipeline.src.flows.regulations_checkup.regulations_checkup_flow(proxies: dict = PROXIES, backoffice_regulation_url: str = BACKOFFICE_REGULATION_URL, get_utcnow_fn=get_utcnow, get_dead_links_fn: Callable = get_dead_links, send_message_fn: Callable = send_message)[source]