.. warning:: The connectors are experimental and subject to change without prior notice. They are not covered by semantic versioning guarantees, and backward incompatible modifications to these classes will not result in a major version increment. Use them with caution in production environments or critical processes. Data discovery and retrieval ============================ One of the core use cases of SDMX is enabling data consumers to **discover and retrieve available datasets from a source**. This functionality is formalized in the SDMX-REST "Data Discovery and Retrieval" profile, implemented by the pysdmx connector. Below, we describe the features supported by this connector. .. important:: To use the pysdmx connectors, you need to install the `pysdmx[data]` extra. Check the :ref:`installation guide ` for more information. Setup: Initializing the pysdmx Connector ---------------------------------------- ``pysdmx`` offers two connectors compliant with the SDMX-REST "Data Discovery and Retrieval" profile: - ``pysdmx.api.dc.rest.SdmxConnector``: A basic connector for SDMX-REST services. - ``pysdmx.api.dc.pd.PandasConnector``: A connector, based on the one above, that returns data as a Pandas DataFrame. For this demo, we will use the **Pandas Connector**, which can be initialized as follows: .. code-block:: python from pysdmx.api.dc import Endpoints from pysdmx.api.dc.pd import PandasConnector conn = PandasConnector(Endpoints.BIS) The connector requires the service's entry point URL to retrieve data and metadata, and we can easily get it using the ``Endpoints`` enum. Now that the connector is set up, let's explore its capabilities! Data Discovery, step 1: Discover Available Datasets --------------------------------------------------- To begin, we need to **list available datasets** in the source to **identify those of interest**. This can be achieved using the ``dataflows`` method provided by the connector: .. code-block:: python flows = conn.dataflows() print(f"Found {len(flows)} dataflows.") for f in flows: print(f.short_urn) # Output: # Found 32 dataflows. # Dataflow=BIS:BIS_REL_CAL(1.0) # Dataflow=BIS:WS_CBPOL(1.0) # Dataflow=BIS:WS_CBS_PUB(1.0) # Dataflow=BIS:WS_CBTA(1.0) # Dataflow=BIS:WS_CPMI_CASHLESS(1.0) # ... more lines You can also filter datasets by providing a search term. The method will return dataflows where the term matches the ID, name, or description: .. code-block:: python flows = conn.dataflows("banking") print(f"Found {len(flows)} dataflows.") for f in flows: print(f.short_urn) # Output: # Found 5 dataflows. # Dataflow=BIS:WS_CBS_PUB(1.0) # Dataflow=BIS:WS_LBS_D_PUB(1.0) # Dataflow=BIS.CBS:CBS(1.0) # Dataflow=BIS.LBS:LBSN(1.0) # Dataflow=BIS.LBS:LBSR(1.0) Data Discovery, step 2: Gather Information about a Dataset ---------------------------------------------------------- After identifying a dataflow of interest, the next step is to gather detailed **information about it** to learn how to **query it effectively**. This can be achieved using the ``dataflow`` method provided by the connector: .. code-block:: python # Fetch information about the WS_CBS_PUB dataflow cbs = conn.dataflow(flows[0]) Basic information ^^^^^^^^^^^^^^^^^ You can print details about the dataflow, such as its name and the number of series it contains: .. code-block:: python print(f"Name: {cbs.name}") print(f"Number of series: {cbs.series_count}") # Output: # Name: Consolidated banking # Number of series: 227004 Querying dimensions ^^^^^^^^^^^^^^^^^^^ To query data effectively, examine the dimensions and their possible values: .. code-block:: python for d in cbs.components.dimensions: dv = [c.id for c in d.enumeration] print(f"{d.id}: {','.join(dv)}.") # Output: # FREQ: Q. # L_MEASURE: S,B. # ... more lines Rather than returning all theoretically possible values from the codelist for each dimension, the connector only provides values for which data actually exist, reducing the risk of executing queries that yield no results. Data Retrieval: Download Data ----------------------------- Once you know how to query the data, the next step is to **download it** as a **Pandas Data Frame**. This can be achieved using the ``data`` method provided by the connector. The ``data`` method has one mandatory parameter, ``dataflow``. Specifying the Dataflow ^^^^^^^^^^^^^^^^^^^^^^^ The `dataflow` parameter specifies the dataflow from which to retrieve data. It can accept the following types of input: - **String**: A string representation of the dataflow, which can be an **SDMX URN** (e.g. ``urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:CBS(1.0)``) or a short notation for the dataflow, such as ``BIS:CBS(1.0)``. - **Python Object**: A Python object that has the following three properties: - ``id``: The unique identifier of the dataflow. - ``agency``: The agency responsible for the dataflow. - ``version``: The version of the dataflow. These three properties (``id``, ``agency``, and ``version``) are the standard attributes used in SDMX to uniquely identify artefacts of a certain type, such as a Dataflow. Examples of Python objects that can be passed include ``pysdmx.Dataflow``, ``pysdmx.DataflowInfo``, or ``pysdmx.Reference``. Applying Query Filters ^^^^^^^^^^^^^^^^^^^^^^ To filter the data, you can use `pysdmx` query filters. For example, to retrieve cross-border claims (`L_POSITION=D`) for Switzerland (`L_REP_CTY=CH`): .. code-block:: python from pysdmx.api.dc.query import MultiFilter, Operator, TextFilter f1 = TextFilter("L_POSITION", Operator.EQUALS, "D") f2 = TextFilter("L_REP_CTY", Operator.EQUALS, "CH") mf = MultiFilter([f1, f2]) df = conn.data(cbs, mf) print(df) # Output: # FREQ L_MEASURE # SERIES_KEY TIME_PERIOD # Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2 Q S # 2005-Q3 Q S # 2005-Q4 Q S # 2006-Q1 Q S # 2006-Q2 Q S # ... more lines # [164 rows x 24 columns] By default, the connector infers the index using the series keys and the time period. This behavior can be disabled by setting ``infer_index=False``. Similarly, the generation of series keys is enabled by default and can be turned off by setting ``infer_series_keys=False``. Alternative Query Formats ^^^^^^^^^^^^^^^^^^^^^^^^^ The same query can also be expressed as: - **SQL string**: "L_POSITION = 'D' AND L_REP_CTY = 'CH'" - **Python boolean expression**: "L_POSITION == 'D' and L_REP_CTY == 'CH'" .. code-block:: python df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'") Data Types in the Pandas DataFrame ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default, the Pandas DataFrame applies data types based on the Data Structure Definition (DSD). You can disable this feature by setting ``apply_schema=False`` when calling the ``data`` method. .. code-block:: python print(df.dtypes) # Output: # DATAFLOW str # FREQ category # L_MEASURE category # L_REP_CTY category # CBS_BANK_TYPE category # CBS_BASIS category # L_POSITION category # L_INSTR category # REM_MATURITY category # CURR_TYPE_BOOK category # L_CP_SECTOR category # L_CP_COUNTRY category # OBS_VALUE Float64 # DECIMALS category # UNIT_MEASURE category # UNIT_MULT category # TIME_FORMAT category # COLLECTION category # ORG_VISIBILITY category # AVAILABILITY category # TITLE_GRP string # OBS_STATUS category # OBS_CONF category # OBS_PRE_BREAK Float64 # dtype: object Coded components will be defined as categorical data, making them efficient for analysis. Label Options for Category Fields ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``labels`` parameter allows you to control how category fields are represented in the DataFrame. You can choose from the following options: - **id** (Default): The DataFrame will include only the code IDs for category fields. This is the most compact representation. - **name**: The code IDs are replaced with their corresponding names, as defined in the codelists. - **both**: Both the code IDs and their names are included in the category fields, formatted as ``ID: Name``. You can specify the ``labels`` parameter when calling the connector: .. code-block:: python df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'", labels="name") print(df) # Output: # FREQ L_MEASURE # SERIES_KEY TIME_PERIOD # Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2 Quarterly Amounts outstanding / Stocks # 2005-Q3 Quarterly Amounts outstanding / Stocks # 2005-Q4 Quarterly Amounts outstanding / Stocks # 2006-Q1 Quarterly Amounts outstanding / Stocks # 2006-Q2 Quarterly Amounts outstanding / Stocks # ... more lines # [164 rows x 24 columns] Selecting Specific Columns ^^^^^^^^^^^^^^^^^^^^^^^^^^ The `columns` parameter allows you to specify which columns should be included in the resulting DataFrame. This is useful when you are only interested in a subset of the components, as it can reduce the size of the DataFrame and make the output more focused. .. code-block:: python df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'", columns=["OBS_VALUE", "OBS_STATUS"]) print(df) # Output: # OBS_STATUS OBS_VALUE # SERIES_KEY TIME_PERIOD # Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2 A 805455.0 # 2005-Q3 A 847949.0 # 2005-Q4 A 794753.0 # 2006-Q1 A 919947.0 # 2006-Q2 A 989307.0 # [164 rows x 2 columns] If the `columns` parameter is not specified, all components defined in the Data Structure Definition (DSD) are included in the DataFrame. We hope this demo has provided a helpful introduction to the capabilities of the pysdmx connectors.