pipeline.src.helpers.vessels

Functions

make_add_vessels_columns_query(→ sqlalchemy.sql.Select)

Creates a sqlalchemy.select statement representing a query to fetch the

make_find_vessels_query(→ sqlalchemy.sql.Select)

Creates a sqlalchemy.select object representing a query to find vessels in

merge_vessel_id(→ pandas.DataFrame)

The two input DataFrames are assumed to be:

Module Contents

pipeline.src.helpers.vessels.make_add_vessels_columns_query(vessel_ids: list, vessels_table: sqlalchemy.Table, vessels_columns_to_add: list = None, districts_table: sqlalchemy.Table = None, districts_columns_to_add: list = None) sqlalchemy.sql.Select[source]

Creates a sqlalchemy.select statement representing a query to fetch the designated columns from the vessels and / or districts tables for the indicated vessel_ids.

Parameters:
  • vessel_ids (list) – List of vessels id to fetch data for.

  • vessels_table (Table) – vessels table.

  • vessels_columns_to_add (list, optional) – List of columns to get from the vessels table. Defaults to None.

  • districts_table (Table, optional) – districts table. Must be supplied if districts_columns_to_get is given. Defaults to None.

  • districts_columns_to_add (list, optional) – List of columns to get from the districts table. Defaults to None.

Returns:

select statement to execute to get the indicated data.

Return type:

Select

pipeline.src.helpers.vessels.make_find_vessels_query(vessels: pandas.DataFrame, vessels_table: sqlalchemy.Table) sqlalchemy.sql.Select[source]

Creates a sqlalchemy.select object representing a query to find vessels in the vessels table that match any of the lines in the input DataFrame on any of cfr, ircs or external_immatriculation.

Parameters:
  • vessels (pd.DataFrame) – DataFrame. Must have columns cfr, ircs and external_immatriculation. If any other columns are present they are ignored.

  • vessels_table (Table) – sqlalchemy.Table object representing the vessels table. Must have columns cfr, ircs and external_immatriculation. If any other columns are present they are ignored.

Returns:

query object with columns vessel_id, cfr, ircs and

external_immatriculation.

Return type:

Select

pipeline.src.helpers.vessels.merge_vessel_id(vessels: pandas.DataFrame, found_vessels: pandas.DataFrame, logger: logging.Logger) pandas.DataFrame[source]

The two input DataFrames are assumed to be:

  • a list of vessels with cfr, ircs and external_immatriculation identifiers (plus potential other columns) without a vessel_id column

  • a list of vessels with cfr, ircs and external_immatriculation and vessel_id columns (and no other columns). Typically these are the vessels that are found in the vessels table that match one of the identifiers of the vessels DataFrame by the make_find_vessels_query query.

The idea is to add the vessel_id from the second DataFrame as a new column in the first DataFrame, by matching the right lines in both DataFrame.

This is done by perfoming a left join of the input DataFrames using join_on_multiple_keys on [“cfr”, “ircs”, “external_immatriculation”].

Additionnally, the returned vessel_id for each line in the first DataFrame is None if the following conditions are not met :

  • there is no ambiguity: only one vessel in the second DataFrame can be matched to a given line in the first DataFrame

  • there is no conflict: at most one vessel in the first DataFrame can be matched to a given line in the second DataFrame

Lines in the second DataFrame that do not match a line in the first DataFrame are absent from the result.

Lines in the first DataFrame that do not match a line in the second DataFrame are present in the result with a vessel_id of None.

The result always has exactly the same lines as the first input DataFrame.

Parameters:
  • vessels (pd.DataFrame) – Vessels to match to a found_vessel

  • found_vessels (pd.DataFrame) – found_vessels to match to a vessel

  • logger (Logger) – Logger instance

Returns:

Same as vessels with an added vessel_id column.

Return type:

pd.DataFrame