pipeline.src.helpers.vessels
============================

.. py:module:: pipeline.src.helpers.vessels


Functions
---------

.. autoapisummary::

   pipeline.src.helpers.vessels.make_add_vessels_columns_query
   pipeline.src.helpers.vessels.make_find_vessels_query
   pipeline.src.helpers.vessels.merge_vessel_id


Module Contents
---------------

.. py:function:: make_add_vessels_columns_query(vessel_ids: list, vessels_table: sqlalchemy.Table, vessels_columns_to_add: list = None, districts_table: sqlalchemy.Table = None, districts_columns_to_add: list = None) -> sqlalchemy.sql.Select

   Creates a `sqlalchemy.select` statement representing a query to fetch the
   designated columns from the `vessels` and / or `districts` tables for the
   indicated `vessel_ids`.

   :param vessel_ids: List of vessels `id` to fetch data for.
   :type vessel_ids: list
   :param vessels_table: vessels table.
   :type vessels_table: Table
   :param vessels_columns_to_add: List of columns to get from the
                                  `vessels` table. Defaults to None.
   :type vessels_columns_to_add: list, optional
   :param districts_table: districts table. Must be supplied if
                           `districts_columns_to_get` is given. Defaults to None.
   :type districts_table: Table, optional
   :param districts_columns_to_add: List of columns to get from the
                                    `districts` table. Defaults to None.
   :type districts_columns_to_add: list, optional

   :returns: select statement to execute to get the indicated data.
   :rtype: Select


.. py:function:: make_find_vessels_query(vessels: pandas.DataFrame, vessels_table: sqlalchemy.Table) -> sqlalchemy.sql.Select

   Creates a `sqlalchemy.select` object representing a query to find `vessels` in
   the `vessels` table that match any of the lines in the input `DataFrame` on any of
   `cfr`, `ircs` or `external_immatriculation`.

   :param vessels: `DataFrame`. Must have columns `cfr`, `ircs` and
                   `external_immatriculation`. If any other columns are present they are
                   ignored.
   :type vessels: pd.DataFrame
   :param vessels_table: `sqlalchemy.Table` object representing the `vessels`
                         table. Must have columns `cfr`, `ircs` and `external_immatriculation`. If any
                         other columns are present they are ignored.
   :type vessels_table: Table

   :returns:

             query object with columns `vessel_id`, `cfr`, `ircs` and
               `external_immatriculation`.
   :rtype: Select


.. py:function:: merge_vessel_id(vessels: pandas.DataFrame, found_vessels: pandas.DataFrame, logger: logging.Logger) -> pandas.DataFrame

   The two input DataFrames are assumed to be:

     - a list of vessels with `cfr`, `ircs` and `external_immatriculation` identifiers
       (plus potential other columns) without a `vessel_id` column
     - a list of vessels with `cfr`, `ircs` and `external_immatriculation` and
       `vessel_id` columns (and no other columns). Typically these are the vessels
       that are found in the `vessels` table that match one of the identifiers of the
       `vessels` DataFrame by the `make_find_vessels_query` query.

   The idea is to add the `vessel_id` from the second DataFrame as a new column in the
   first DataFrame, by matching the right lines in both DataFrame.

   This is done by perfoming a left join of the input DataFrames using
   join_on_multiple_keys on ["cfr", "ircs", "external_immatriculation"].

   Additionnally, the returned `vessel_id` for each line in the first DataFrame is
   `None` if the following conditions are not met :

     - there is no ambiguity: only one vessel in the second DataFrame can be matched
       to a given line in the first DataFrame
     - there is no conflict: at most one vessel in the first DataFrame can be matched
       to a given line in the second DataFrame

   Lines in the second DataFrame that do not match a line in the first DataFrame are
   absent from the result.

   Lines in the first DataFrame that do not match a line in the second DataFrame are
   present in the result with a `vessel_id` of `None`.

   The result always has exactly the same lines as the first input DataFrame.

   :param vessels: Vessels to match to a found_vessel
   :type vessels: pd.DataFrame
   :param found_vessels: found_vessels to match to a vessel
   :type found_vessels: pd.DataFrame
   :param logger: Logger instance
   :type logger: Logger

   :returns: Same as vessels with an added `vessel_id` column.
   :rtype: pd.DataFrame