Warning
The connectors are experimental and subject to change without prior notice. They are not covered by semantic versioning guarantees, and backward incompatible modifications to these classes will not result in a major version increment. Use them with caution in production environments or critical processes.
Data discovery and retrieval
One of the core use cases of SDMX is enabling data consumers to discover and retrieve available datasets from a source. This functionality is formalized in the SDMX-REST “Data Discovery and Retrieval” profile, implemented by the pysdmx connector. Below, we describe the features supported by this connector.
Important
To use the pysdmx connectors, you need to install the pysdmx[data] extra.
Check the installation guide for more information.
Setup: Initializing the pysdmx Connector
pysdmx offers two connectors compliant with the SDMX-REST
“Data Discovery and Retrieval” profile:
pysdmx.api.dc.rest.SdmxConnector: A basic connector for SDMX-REST services.pysdmx.api.dc.pd.PandasConnector: A connector, based on the one above, that returns data as a Pandas DataFrame.
For this demo, we will use the Pandas Connector, which can be initialized as follows:
from pysdmx.api.dc import Endpoints
from pysdmx.api.dc.pd import PandasConnector
conn = PandasConnector(Endpoints.BIS)
The connector requires the service’s entry point URL to retrieve data
and metadata, and we can easily get it using the Endpoints enum.
Now that the connector is set up, let’s explore its capabilities!
Data Discovery, step 1: Discover Available Datasets
To begin, we need to list available datasets in the source to
identify those of interest. This can be achieved using the
dataflows method provided by the connector:
flows = conn.dataflows()
print(f"Found {len(flows)} dataflows.")
for f in flows:
print(f.short_urn)
# Output:
# Found 32 dataflows.
# Dataflow=BIS:BIS_REL_CAL(1.0)
# Dataflow=BIS:WS_CBPOL(1.0)
# Dataflow=BIS:WS_CBS_PUB(1.0)
# Dataflow=BIS:WS_CBTA(1.0)
# Dataflow=BIS:WS_CPMI_CASHLESS(1.0)
# ... more lines
You can also filter datasets by providing a search term. The method will return dataflows where the term matches the ID, name, or description:
flows = conn.dataflows("banking")
print(f"Found {len(flows)} dataflows.")
for f in flows:
print(f.short_urn)
# Output:
# Found 5 dataflows.
# Dataflow=BIS:WS_CBS_PUB(1.0)
# Dataflow=BIS:WS_LBS_D_PUB(1.0)
# Dataflow=BIS.CBS:CBS(1.0)
# Dataflow=BIS.LBS:LBSN(1.0)
# Dataflow=BIS.LBS:LBSR(1.0)
Data Discovery, step 2: Gather Information about a Dataset
After identifying a dataflow of interest, the next step is to
gather detailed information about it to learn how to
query it effectively. This can be achieved using the
dataflow method provided by the connector:
# Fetch information about the WS_CBS_PUB dataflow
cbs = conn.dataflow(flows[0])
Basic information
You can print details about the dataflow, such as its name and the number of series it contains:
print(f"Name: {cbs.name}")
print(f"Number of series: {cbs.series_count}")
# Output:
# Name: Consolidated banking
# Number of series: 227004
Querying dimensions
To query data effectively, examine the dimensions and their possible values:
for d in cbs.components.dimensions:
dv = [c.id for c in d.enumeration]
print(f"{d.id}: {','.join(dv)}.")
# Output:
# FREQ: Q.
# L_MEASURE: S,B.
# ... more lines
Rather than returning all theoretically possible values from the codelist for each dimension, the connector only provides values for which data actually exist, reducing the risk of executing queries that yield no results.
Data Retrieval: Download Data
Once you know how to query the data, the next step is to download it
as a Pandas Data Frame. This can be achieved using the data method
provided by the connector. The data method has one mandatory parameter,
dataflow.
Specifying the Dataflow
The dataflow parameter specifies the dataflow from which to retrieve data. It can accept the following types of input:
String: A string representation of the dataflow, which can be an SDMX URN (e.g.
urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:CBS(1.0)) or a short notation for the dataflow, such asBIS:CBS(1.0).Python Object: A Python object that has the following three properties:
id: The unique identifier of the dataflow.agency: The agency responsible for the dataflow.version: The version of the dataflow.
These three properties (
id,agency, andversion) are the standard attributes used in SDMX to uniquely identify artefacts of a certain type, such as a Dataflow. Examples of Python objects that can be passed includepysdmx.Dataflow,pysdmx.DataflowInfo, orpysdmx.Reference.
Applying Query Filters
To filter the data, you can use pysdmx query filters. For example, to retrieve cross-border claims (L_POSITION=D) for Switzerland (L_REP_CTY=CH):
from pysdmx.api.dc.query import MultiFilter, Operator, TextFilter
f1 = TextFilter("L_POSITION", Operator.EQUALS, "D")
f2 = TextFilter("L_REP_CTY", Operator.EQUALS, "CH")
mf = MultiFilter([f1, f2])
df = conn.data(cbs, mf)
print(df)
# Output:
# FREQ L_MEASURE
# SERIES_KEY TIME_PERIOD
# Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2 Q S
# 2005-Q3 Q S
# 2005-Q4 Q S
# 2006-Q1 Q S
# 2006-Q2 Q S
# ... more lines
# [164 rows x 24 columns]
By default, the connector infers the index using the series keys and the time period.
This behavior can be disabled by setting infer_index=False. Similarly, the
generation of series keys is enabled by default and can be turned off by setting
infer_series_keys=False.
Alternative Query Formats
The same query can also be expressed as:
SQL string: “L_POSITION = ‘D’ AND L_REP_CTY = ‘CH’”
Python boolean expression: “L_POSITION == ‘D’ and L_REP_CTY == ‘CH’”
df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'")
Data Types in the Pandas DataFrame
By default, the Pandas DataFrame applies data types based on the Data Structure
Definition (DSD). You can disable this feature by setting apply_schema=False
when calling the data method.
print(df.dtypes)
# Output:
# DATAFLOW str
# FREQ category
# L_MEASURE category
# L_REP_CTY category
# CBS_BANK_TYPE category
# CBS_BASIS category
# L_POSITION category
# L_INSTR category
# REM_MATURITY category
# CURR_TYPE_BOOK category
# L_CP_SECTOR category
# L_CP_COUNTRY category
# OBS_VALUE Float64
# DECIMALS category
# UNIT_MEASURE category
# UNIT_MULT category
# TIME_FORMAT category
# COLLECTION category
# ORG_VISIBILITY category
# AVAILABILITY category
# TITLE_GRP string
# OBS_STATUS category
# OBS_CONF category
# OBS_PRE_BREAK Float64
# dtype: object
Coded components will be defined as categorical data, making them efficient for analysis.
Label Options for Category Fields
The labels parameter allows you to control how category fields are represented in the
DataFrame. You can choose from the following options:
id (Default): The DataFrame will include only the code IDs for category fields. This is the most compact representation.
name: The code IDs are replaced with their corresponding names, as defined in the codelists.
both: Both the code IDs and their names are included in the category fields, formatted as
ID: Name.
You can specify the labels parameter when calling the connector:
df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'", labels="name")
print(df)
# Output:
# FREQ L_MEASURE
# SERIES_KEY TIME_PERIOD
# Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2 Quarterly Amounts outstanding / Stocks
# 2005-Q3 Quarterly Amounts outstanding / Stocks
# 2005-Q4 Quarterly Amounts outstanding / Stocks
# 2006-Q1 Quarterly Amounts outstanding / Stocks
# 2006-Q2 Quarterly Amounts outstanding / Stocks
# ... more lines
# [164 rows x 24 columns]
Selecting Specific Columns
The columns parameter allows you to specify which columns should be included in the resulting DataFrame. This is useful when you are only interested in a subset of the components, as it can reduce the size of the DataFrame and make the output more focused.
df = conn.data(cbs, "L_POSITION = 'D' AND L_REP_CTY = 'CH'", columns=["OBS_VALUE", "OBS_STATUS"])
print(df)
# Output:
# OBS_STATUS OBS_VALUE
# SERIES_KEY TIME_PERIOD
# Q.S.CH.4R.U.D.A.A.TO1.A.5J 2005-Q2 A 805455.0
# 2005-Q3 A 847949.0
# 2005-Q4 A 794753.0
# 2006-Q1 A 919947.0
# 2006-Q2 A 989307.0
# [164 rows x 2 columns]
If the columns parameter is not specified, all components defined in the Data Structure Definition (DSD) are included in the DataFrame.
We hope this demo has provided a helpful introduction to the capabilities of the pysdmx connectors.