Warning

The connectors are experimental and subject to change without prior notice. They are not covered by semantic versioning guarantees, and backward incompatible modifications to these classes will not result in a major version increment. Use them with caution in production environments or critical processes.

Data discovery and retrieval

One of the core use cases of SDMX is enabling data consumers to discover and retrieve available datasets from a source. This functionality is formalized in the SDMX-REST “Data Discovery and Retrieval” profile, implemented by the pysdmx connector. Below, we describe the features supported by this connector.

Important

To use the pysdmx connectors, you need to install the pysdmx[data] extra.

Check the installation guide for more information.

Setup: Initializing the pysdmx Connector

pysdmx offers two connectors compliant with the SDMX-REST “Data Discovery and Retrieval” profile:

  • pysdmx.api.dc.rest.SdmxConnector: A basic connector for SDMX-REST services.

  • pysdmx.api.dc.pd.PandasConnector: A connector, based on the one above, that returns data as a Pandas DataFrame.

For this demo, we will use the Pandas Connector, which can be initialized as follows:

from pysdmx.api.dc import Endpoints
from pysdmx.api.dc.pd import PandasConnector

conn = PandasConnector(Endpoints.BIS)

The connector requires the service’s entry point URL to retrieve data and metadata, and we can easily get it using the Endpoints enum. Now that the connector is set up, let’s explore its capabilities!

Data Discovery, step 1: Discover Available Datasets

To begin, we need to list available datasets in the source to identify those of interest. This can be achieved using the dataflows method provided by the connector:

flows = conn.dataflows()
print(f"Found {len(flows)} dataflows.")
for f in flows:
    print(f.short_urn)

# Output:
# Found 32 dataflows.
# Dataflow=BIS:BIS_REL_CAL(1.0)
# Dataflow=BIS:WS_CBPOL(1.0)
# Dataflow=BIS:WS_CBS_PUB(1.0)
# Dataflow=BIS:WS_CBTA(1.0)
# Dataflow=BIS:WS_CPMI_CASHLESS(1.0)
# ... more lines

You can also filter datasets by providing a search term. The method will return dataflows where the term matches the ID, name, or description:

flows = conn.dataflows("banking")
print(f"Found {len(flows)} dataflows.")
for f in flows:
    print(f.short_urn)

# Output:
# Found 5 dataflows.
# Dataflow=BIS:WS_CBS_PUB(1.0)
# Dataflow=BIS:WS_LBS_D_PUB(1.0)
# Dataflow=BIS.CBS:CBS(1.0)
# Dataflow=BIS.LBS:LBSN(1.0)
# Dataflow=BIS.LBS:LBSR(1.0)

Data Discovery, step 2: Gather Information about a Dataset

After identifying a dataflow of interest, the next step is to gather detailed information about it to learn how to query it effectively. This can be achieved using the dataflow method provided by the connector:

# Fetch information about the WS_CBS_PUB dataflow
cbs = conn.dataflow(flows[0])

Basic information

You can print details about the dataflow, such as its name and the number of series it contains:

print(f"Name: {cbs.name}")
print(f"Number of series: {cbs.series_count}")

# Output:
# Name: Consolidated banking
# Number of series: 227004

Querying dimensions

To query data effectively, examine the dimensions and their possible values:

for d in cbs.components.dimensions:
    dv = [c.id for c in d.enumeration]
    print(f"{d.id}: {','.join(dv)}.")

# Output:
# FREQ: Q.
# L_MEASURE: S,B.
# ...  more lines

Rather than returning all theoretically possible values from the codelist for each dimension, the connector only provides values for which data actually exist, reducing the risk of executing queries that yield no results.

Data Retrieval: Download Data

Once you know how to query the data, the next step is to download it as a Pandas Data Frame. This can be achieved using the data method provided by the connector. The data method has one mandatory parameter, dataflow.

Specifying the Dataflow

The dataflow parameter specifies the dataflow from which to retrieve data. It can accept the following types of input:

  • String: A string representation of the dataflow, which can be an SDMX URN (e.g. urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:CBS(1.0)) or a short notation for the dataflow, such as BIS:CBS(1.0).

  • Python Object: A Python object that has the following three properties:

    • id: The unique identifier of the dataflow.

    • agency: The agency responsible for the dataflow.

    • version: The version of the dataflow.

    These three properties (id, agency, and version) are the standard attributes used in SDMX to uniquely identify artefacts of a certain type, such as a Dataflow. Examples of Python objects that can be passed include pysdmx.Dataflow, pysdmx.DataflowInfo, or pysdmx.Reference.

Applying Query Filters

To filter the data, you can use pysdmx query filters. For example, to retrieve cross-border claims (L_POSITION=D) for Switzerland (L_REP_CTY=CH):

from pysdmx.api.dc.query import MultiFilter, Operator, TextFilter

f1 = TextFilter("L_POSITION", Operator.EQUALS, "D")
f2 = TextFilter("L_REP_CTY", Operator.EQUALS, "CH")
mf = MultiFilter([f1, f2])

df = conn.data(cbs, mf)

print(df)

# Output:
#                                         FREQ      L_MEASURE
# SERIES_KEY                 TIME_PERIOD
# Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2      Q         S
#                            2005-Q3      Q         S
#                            2005-Q4      Q         S
#                            2006-Q1      Q         S
#                            2006-Q2      Q         S
# ... more lines
# [164 rows x 24 columns]

By default, the connector infers the index using the series keys and the time period. This behavior can be disabled by setting infer_index=False. Similarly, the generation of series keys is enabled by default and can be turned off by setting infer_series_keys=False.

Alternative Query Formats

The same query can also be expressed as:

  • SQL string: “L_POSITION = ‘D’ AND L_REP_CTY = ‘CH’”

  • Python boolean expression: “L_POSITION == ‘D’ and L_REP_CTY == ‘CH’”

df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'")

Data Types in the Pandas DataFrame

By default, the Pandas DataFrame applies data types based on the Data Structure Definition (DSD). You can disable this feature by setting apply_schema=False when calling the data method.

print(df.dtypes)

# Output:
# DATAFLOW               str
# FREQ              category
# L_MEASURE         category
# L_REP_CTY         category
# CBS_BANK_TYPE     category
# CBS_BASIS         category
# L_POSITION        category
# L_INSTR           category
# REM_MATURITY      category
# CURR_TYPE_BOOK    category
# L_CP_SECTOR       category
# L_CP_COUNTRY      category
# OBS_VALUE          Float64
# DECIMALS          category
# UNIT_MEASURE      category
# UNIT_MULT         category
# TIME_FORMAT       category
# COLLECTION        category
# ORG_VISIBILITY    category
# AVAILABILITY      category
# TITLE_GRP           string
# OBS_STATUS        category
# OBS_CONF          category
# OBS_PRE_BREAK      Float64
# dtype: object

Coded components will be defined as categorical data, making them efficient for analysis.

Label Options for Category Fields

The labels parameter allows you to control how category fields are represented in the DataFrame. You can choose from the following options:

  • id (Default): The DataFrame will include only the code IDs for category fields. This is the most compact representation.

  • name: The code IDs are replaced with their corresponding names, as defined in the codelists.

  • both: Both the code IDs and their names are included in the category fields, formatted as ID: Name.

You can specify the labels parameter when calling the connector:

df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'", labels="name")

print(df)

# Output:
#                                         FREQ          L_MEASURE
# SERIES_KEY                 TIME_PERIOD
# Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2      Quarterly     Amounts outstanding / Stocks
#                            2005-Q3      Quarterly     Amounts outstanding / Stocks
#                            2005-Q4      Quarterly     Amounts outstanding / Stocks
#                            2006-Q1      Quarterly     Amounts outstanding / Stocks
#                            2006-Q2      Quarterly     Amounts outstanding / Stocks
# ... more lines
# [164 rows x 24 columns]

Selecting Specific Columns

The columns parameter allows you to specify which columns should be included in the resulting DataFrame. This is useful when you are only interested in a subset of the components, as it can reduce the size of the DataFrame and make the output more focused.

df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'", columns=["OBS_VALUE", "OBS_STATUS"])

print(df)

# Output:
#                                         OBS_STATUS   OBS_VALUE
# SERIES_KEY                 TIME_PERIOD
# Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2      A            805455.0
#                            2005-Q3      A            847949.0
#                            2005-Q4      A            794753.0
#                            2006-Q1      A            919947.0
#                            2006-Q2      A            989307.0
# [164 rows x 2 columns]

If the columns parameter is not specified, all components defined in the Data Structure Definition (DSD) are included in the DataFrame.

We hope this demo has provided a helpful introduction to the capabilities of the pysdmx connectors.