pipeline.src.read_query

Functions

`read_saved_query`(→ Union[pandas.DataFrame, ...)	Run saved SQLquery on a database. Supported databases :
`read_query`(→ Union[pandas.DataFrame, ...)	Run SQLquery on a database. Supported databases :
`read_table`(db, schema, table_name)	Loads database table into pandas Dataframe. Supported databases :

Module Contents

pipeline.src.read_query.read_saved_query(sql_filepath: str | pathlib.Path, *, db: str = None, con: sqlalchemy.engine.Connection | sqlalchemy.engine.Engine = None, chunksize: None | str = None, params: dict | None = None, backend: str = 'pandas', geom_col: str = 'geom', crs: int | None = None, parse_dates: list | dict | None = None, return_pyarrow_dtypes: bool = False, **kwargs) → pandas.DataFrame | geopandas.GeoDataFrame[source]

Run saved SQLquery on a database. Supported databases :

‘ocan’ : OCAN database
‘fmc’: FMC database
‘monitorfish_remote’: Monitorfish database
‘monitorfish_local’: Monitorfish PostGIS database hosted in CNSP
‘monitorenv_remote’: Monitorfish database
‘cacem_local’ : CACEM PostGIS database hosted in CNSP
‘data_warehouse’ : Monitorfish/Monitorenv/RapportNav Data Warehouse

Database credentials must be present in the environement.

Parameters:

sql_filepath (str) – path to .sql file, starting from the saved queries folder. example : ‘ocan/nav_fr_peche.sql’
db (str, optional) – Database name. Possible values : ‘ocan’, ‘fmc’, ‘monitorfish_remote’, ‘monitorfish_local’. If db is None, con must be passed.
con (Union[Connection, Engine], optional) – sqlalchemy.engine.Connection or sqlalchemy.engine.Engine object. Mandatory if no db is given. Ignored if db is given.
chunksize (Union[None, str], optional) – If specified, return an iterator where chunksize is the number of rows to include in each chunk. Defaults to None.
params (Union[dict, None], optional) – Parameters to pass to execute method. Defaults to None.
backend (str, optional) – ‘pandas’ to run a SQL query and return a pandas.DataFrame or ‘geopandas’ to run a PostGIS query and return a geopandas.GeoDataFrame. Defaults to ‘pandas’.
geom_col (str, optional) – column name to convert to shapely geometries when backend is ‘geopandas’. Ignored when backend is ‘pandas’. Defaults to ‘geom’.
crs (Union[None, str], optional) – CRS to use for the returned GeoDataFrame; if not set, tries to determine CRS from the SRID associated with the first geometry in the database, and assigns that to all geometries. Ignored when backend is ‘pandas’. Defaults to None.
parse_dates (Union[list, dict, None], optional) –
- List of column names to parse as dates.
- Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.
- Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime()
return_pyarrow_dtypes (bool, optional) – If True, and db is “data_warehouse”, results are returned as a pandas DataFrame of pyarrow dtypes. Ignored if db is not “data_warehouse”. Defaults to False.
kwargs – passed to pd.read_sql or gpd.read_postgis

Returns:

Query results

Return type:

Union[pd.DataFrame, gpd.DataFrame]

pipeline.src.read_query.read_query(query: str | sqlalchemy.Selectable | sqlalchemy.TextClause, *, db: str = None, con: sqlalchemy.engine.Connection | sqlalchemy.engine.Engine = None, chunksize: None | str = None, params: dict | None = None, backend: str = 'pandas', geom_col: str = 'geom', crs: int | None = None, parse_dates: list | dict | None = None, return_pyarrow_dtypes: bool = False, **kwargs) → pandas.DataFrame | geopandas.GeoDataFrame[source]

Run SQLquery on a database. Supported databases :

‘ocan’ : OCAN database

‘fmc’: FMC database

‘monitorfish_remote’: Monitorfish database

‘monitorfish_local’: Monitorfish PostGIS database hosted in CNSP

‘monitorenv_remote’: Monitorenv database

‘cacem_local’ : CACEM PostGIS database hosted in CNSP

‘data_warehouse’ : Monitorfish/Monitorenv/RapportNav Data Warehouse

Database credentials must be present in the environement.

Parameters:

query (str | Selectable | TextClause) – Query to execute (must be a string if querying data warehouse).
db (str, optional) – Database name. Possible values : ‘ocan’, ‘fmc’, ‘monitorfish_remote’, ‘monitorfish_local’, ‘monitorenv_remote’, ‘cacem_local’. If db is None, con must be passed.
con (Union[Connection, Engine], optional) – sqlalchemy.engine.Connection or sqlalchemy.engine.Engine object. Mandatory if no db is given. Ignored if db is given.
chunksize (Union[None, str], optional) – If specified, return an iterator where chunksize is the number of rows to include in each chunk. Defaults to None.
params (Union[dict, None], optional) – Parameters to pass to execute method. Defaults to None.
backend (str, optional) – ‘pandas’ to run a SQL query and return a pandas.DataFrame or ‘geopandas’ to run a PostGIS query and return a geopandas.GeoDataFrame. Defaults to ‘pandas’.
geom_col (str, optional) – column name to convert to shapely geometries when backend is ‘geopandas’. Ignored when backend is ‘pandas’. Defaults to ‘geom’.
crs (Union[None, str], optional) – CRS to use for the returned GeoDataFrame; if not set, tries to determine CRS from the SRID associated with the first geometry in the database, and assigns that to all geometries. Ignored when backend is ‘pandas’. Defaults to None.
parse_dates (Union[list, dict, None], optional) –
- List of column names to parse as dates.
- Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.
- Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime()
return_pyarrow_dtypes (bool, optional) – If True, and db is “data_warehouse”, results are returned as a pandas DataFrame of pyarrow dtypes. Ignored if db is not “data_warehouse”. Defaults to False.
kwargs – passed to pd.read_sql or gpd.read_postgis

Returns:

Query results

Return type:

Union[pd.DataFrame, gpd.DataFrame]

pipeline.src.read_query.read_table(db: str, schema: str, table_name: str)[source]

Loads database table into pandas Dataframe. Supported databases :

‘ocan’ : OCAN database

‘fmc’: FMC database

‘monitorfish_remote’: Monitorfish database

‘monitorfish_local’: Monitorfish PostGIS database hosted in CNSP

‘monitorenv_remote’: Monitorenv database

‘cacem_local’ : CACEM PostGIS database hosted in CNSP

Parameters:

db (str) – Database name. Possible values : ‘ocan’, ‘fmc’, ‘monitorfish_remote’, ‘monitorfish_local’, ‘monitorenv_remote’, ‘cacem_local’.
schema (str) – Schema name
table_name (str) – Table name

Returns:

Dataframe containing the entire table

Return type:

pd.DataFrame