pipeline.src.flows.regulations_checkup
Functions
|
Returns a list of html strings of links like <a href=url>link_text</a> for the |
|
Extracts regulation references from the monitorfish regulations table. |
|
Extracts legipeche regulations from the monitorfish legipeche table (which is |
|
Adds an article_id column to the regulations DataFrame, extracting the |
|
Returns the extraction datetimes of previous and latest legipeche extraction |
|
Filters the input legipeche_regulations and returns legipeche regulations |
|
Formats modified_regulations into a DataFrame suitable for printing in an email. |
|
Returns monitorfish_regulations with null values as reference. |
|
Returns the urls of monitorfish_regulations whose article_id |
|
Perfoms get requests to check whether unknown_links are dead links, then returns |
|
Format input for printing. |
|
Returns monitorfish_regulations that have an end_date which is before now. |
|
Format input for printing. |
|
|
|
|
|
|
|
Renders email body as html string. |
|
|
|
|
|
|
|
|
|
Module Contents
- pipeline.src.flows.regulations_checkup.make_html_hyperlinks(urls: Iterable, link_texts: Iterable, logger: logging.Logger = None) List[str][source]
Returns a list of html strings of links like <a href=url>link_text</a> for the input urls and link_texts.
- Parameters:
urls (Iterable) – Iterable of urls
link_texts (Iterable) – Iterable of link texts
- Returns:
list of html links
- Return type:
List[str]
- pipeline.src.flows.regulations_checkup.extract_monitorfish_regulations() pandas.DataFrame[source]
Extracts regulation references from the monitorfish regulations table.
The ouptut DataFrame contains one line per regulatory reference, which means there can be multiple lines for one regulated zone, if the zone has several regulatory references.
Output columns are law_type, topic, zone, url and reference.
Regulatory zones without any regulatory reference are present in the output as a line with None as url and reference values.
- Returns:
DataFrame of regulatory references
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.extract_legipeche_regulations() pandas.DataFrame[source]
Extracts legipeche regulations from the monitorfish legipeche table (which is scraped from legipeche by the Scrape Legipeche flow).
The ouput has one line per document - there can be multiple documents for the same Legipeche page.
Output columns are extraction_datetime_utc, extraction_occurence, page_title, page_url, document_title, and document_url.
- Returns:
DataFrame of Legipeche regulations.
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.add_article_id(regulations: pandas.DataFrame, url_column: str) pandas.DataFrame[source]
Adds an article_id column to the regulations DataFrame, extracting the article_id from the url_column according the the Legipeche URL schema.
Rows for which the URL does not match the Legipeche URL schema will have an article_id of None.
- Parameters:
regulations (pd.DataFrame) – DataFrame of regulations
url_column (str) – Name of the column containing URLs of regulation pages
- Returns:
copy of input regulations with an added article_id column
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.get_extraction_datetimes(legipeche_regulations: pandas.DataFrame) Tuple[str, str][source]
Returns the extraction datetimes of previous and latest legipeche extraction occurences from the legipeche_regulations DataFrame.
The input must have extraction_occurence and extraction_datetime_utc columns.
- Parameters:
legipeche_regulations (pd.DataFrame) – DataFrame of legipeche extractions.
- Returns:
- extraction datetimes of previous and latest legipeche
extractions
- Return type:
Tuple[str, str]
- pipeline.src.flows.regulations_checkup.get_modified_regulations(legipeche_regulations: pandas.DataFrame, monitorfish_regulations: pandas.DataFrame) pandas.DataFrame[source]
Filters the input legipeche_regulations and returns legipeche regulations (documents) that :
have been either added to or removed from an existing Legipeche page between the previous and latest Legipeche scraping occurences
belong to a Legipeche page referenced by at least one monitorfish_regulation
- Parameters:
legipeche_regulations (pd.DataFrame)
monitorfish_regulations (pd.DataFrame)
- Returns:
filtered DataFrame of Legipeche regulations
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.transform_modified_regulations(modified_regulations: pandas.DataFrame, monitorfish_regulations: pandas.DataFrame) pandas.DataFrame[source]
Formats modified_regulations into a DataFrame suitable for printing in an email.
- Parameters:
modified_regulations (pd.DataFrame) –
DataFrame with columns :
extraction_occurence, having values ‘previous’ and ‘latest
page_url
document_title
document_url
monitorfish_regulations (pd.DataFrame) –
DataFrame with columns :
url (url of the regulatory reference in Monitorfish)
reference (name of the regulatory reference in Monitorfish)
law_type
topic
zone
- Returns:
formatted DataFrame of regulation modifications
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.get_missing_references(monitorfish_regulations: pandas.DataFrame) pandas.DataFrame[source]
Returns monitorfish_regulations with null values as reference.
- Parameters:
monitorfish_regulations (pd.DataFrame) – monitorfish_regulations. Must have
columns –
reference
law_type
topic
zone
- Returns:
Filtered and formatted version of input.
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.get_unknown_links(monitorfish_regulations: pandas.DataFrame, legipeche_regulations: pandas.DataFrame) set[source]
Returns the urls of monitorfish_regulations whose article_id is either not present in legipeche_regulations (i.e. referencing Legipeche articles that might not exist) or null (which corresponds to urls that do not match the legipeche url pattern and which usually point to external websites).
- Parameters:
monitorfish_regulations (pd.DataFrame)
legipeche_regulations (pd.DataFrame)
- Returns:
subset of monitorfish_regulations.url
- Return type:
set
- pipeline.src.flows.regulations_checkup.get_dead_links(monitorfish_regulations: pandas.DataFrame, unknown_links: set, proxies: dict) pandas.DataFrame[source]
Perfoms get requests to check whether unknown_links are dead links, then returns monitorfish_regulations that reference a dead link as regulatory reference.
- Parameters:
monitorfish_regulations (pd.DataFrame)
unknown_links (set) – set of urls not knonwn (i.e. urls not found when scraping Legipeche)
proxies (dict) – proxies to use when requests time out without proxies
- Returns:
- filtered monitorfish_regulations with only those that reference
a dead link
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.format_dead_links(dead_links: pandas.DataFrame) pandas.DataFrame[source]
Format input for printing.
- pipeline.src.flows.regulations_checkup.get_outdated_references(monitorfish_regulations: pandas.DataFrame, now: datetime.datetime) pandas.DataFrame[source]
Returns monitorfish_regulations that have an end_date which is before now.
- Parameters:
monitorfish_regulations (pd.DataFrame) – DataFrame of Monitorfish regulations. Must have at least a end_date column.
now (datetime.datetime) – now
- Returns:
Subset of monitorfish_regulations
- Return type:
pd.DataFrame
- pipeline.src.flows.regulations_checkup.format_outdated_references(outdated_references: pandas.DataFrame) pandas.DataFrame[source]
Format input for printing.
- pipeline.src.flows.regulations_checkup.render_body(body_template: jinja2.environment.Template, previous_extraction_datetime_utc: datetime.datetime, latest_extraction_datetime_utc: datetime.datetime, missing_references: pandas.DataFrame, modified_regulations: pandas.DataFrame, dead_links: pandas.DataFrame, outdated_references: pandas.DataFrame, backoffice_regulation_url: str, utcnow: datetime.datetime) str[source]
Renders email body as html string.
- pipeline.src.flows.regulations_checkup.render_main(main_template: jinja2.environment.Template, style: str, body: str) str[source]