pipeline.src.helpers.vessels ============================ .. py:module:: pipeline.src.helpers.vessels Functions --------- .. autoapisummary:: pipeline.src.helpers.vessels.make_add_vessels_columns_query pipeline.src.helpers.vessels.make_find_vessels_query pipeline.src.helpers.vessels.merge_vessel_id Module Contents --------------- .. py:function:: make_add_vessels_columns_query(vessel_ids: list, vessels_table: sqlalchemy.Table, vessels_columns_to_add: list = None, districts_table: sqlalchemy.Table = None, districts_columns_to_add: list = None) -> sqlalchemy.sql.Select Creates a `sqlalchemy.select` statement representing a query to fetch the designated columns from the `vessels` and / or `districts` tables for the indicated `vessel_ids`. :param vessel_ids: List of vessels `id` to fetch data for. :type vessel_ids: list :param vessels_table: vessels table. :type vessels_table: Table :param vessels_columns_to_add: List of columns to get from the `vessels` table. Defaults to None. :type vessels_columns_to_add: list, optional :param districts_table: districts table. Must be supplied if `districts_columns_to_get` is given. Defaults to None. :type districts_table: Table, optional :param districts_columns_to_add: List of columns to get from the `districts` table. Defaults to None. :type districts_columns_to_add: list, optional :returns: select statement to execute to get the indicated data. :rtype: Select .. py:function:: make_find_vessels_query(vessels: pandas.DataFrame, vessels_table: sqlalchemy.Table) -> sqlalchemy.sql.Select Creates a `sqlalchemy.select` object representing a query to find `vessels` in the `vessels` table that match any of the lines in the input `DataFrame` on any of `cfr`, `ircs` or `external_immatriculation`. :param vessels: `DataFrame`. Must have columns `cfr`, `ircs` and `external_immatriculation`. If any other columns are present they are ignored. :type vessels: pd.DataFrame :param vessels_table: `sqlalchemy.Table` object representing the `vessels` table. Must have columns `cfr`, `ircs` and `external_immatriculation`. If any other columns are present they are ignored. :type vessels_table: Table :returns: query object with columns `vessel_id`, `cfr`, `ircs` and `external_immatriculation`. :rtype: Select .. py:function:: merge_vessel_id(vessels: pandas.DataFrame, found_vessels: pandas.DataFrame, logger: logging.Logger) -> pandas.DataFrame The two input DataFrames are assumed to be: - a list of vessels with `cfr`, `ircs` and `external_immatriculation` identifiers (plus potential other columns) without a `vessel_id` column - a list of vessels with `cfr`, `ircs` and `external_immatriculation` and `vessel_id` columns (and no other columns). Typically these are the vessels that are found in the `vessels` table that match one of the identifiers of the `vessels` DataFrame by the `make_find_vessels_query` query. The idea is to add the `vessel_id` from the second DataFrame as a new column in the first DataFrame, by matching the right lines in both DataFrame. This is done by perfoming a left join of the input DataFrames using join_on_multiple_keys on ["cfr", "ircs", "external_immatriculation"]. Additionnally, the returned `vessel_id` for each line in the first DataFrame is `None` if the following conditions are not met : - there is no ambiguity: only one vessel in the second DataFrame can be matched to a given line in the first DataFrame - there is no conflict: at most one vessel in the first DataFrame can be matched to a given line in the second DataFrame Lines in the second DataFrame that do not match a line in the first DataFrame are absent from the result. Lines in the first DataFrame that do not match a line in the second DataFrame are present in the result with a `vessel_id` of `None`. The result always has exactly the same lines as the first input DataFrame. :param vessels: Vessels to match to a found_vessel :type vessels: pd.DataFrame :param found_vessels: found_vessels to match to a vessel :type found_vessels: pd.DataFrame :param logger: Logger instance :type logger: Logger :returns: Same as vessels with an added `vessel_id` column. :rtype: pd.DataFrame