Reading and writing SDMX datasets

Note

This tutorial shows how to read and write SDMX datasets using pysdmx.

Reading and writing SDMX datasets.

Warning

To read and write data, you must use the extra “data”. You may need to install it using the following command:

pip install pysdmx[data]

For SDMX-ML format, you need to install the extra “xml” as well:

pip install pysdmx[data,xml]

pysdmx allows to read and write SDMX datasets in the following formats:

SDMX-CSV 1.0 (located in pysdmx.io.csv.sdmx10)
SDMX-CSV 2.0 (located in pysdmx.io.csv.sdmx20)
SDMX-ML 2.1 (located in pysdmx.io.xml.sdmx21)
- SDMX-ML 2.1 Generic
- SDMX-ML 2.1 Structure Specific

Currently, all data-related readers and writers are based on PandasDataset class.

class pysdmx.io.pd.PandasDataset(*, structure: str | ~pysdmx.model.dataflow.Schema, attributes: ~typing.Dict[str, ~typing.Any] = <factory>, action: ~pysdmx.model.dataset.ActionType = ActionType.Information, reporting_begin: ~datetime.date | None = None, reporting_end: ~datetime.date | None = None, data_extraction_date: ~datetime.date | None = None, valid_from: ~datetime.date | None = None, valid_to: ~datetime.date | None = None, publication_year: ~datetime.date | None = None, publication_period: ~datetime.date | None = None, set_id: str | None = None, data: ~pandas.core.frame.DataFrame)

Bases: Dataset

Class related to Dataset, using Pandas Dataframe.

It is based on SDMX Dataset and has Pandas Dataframe compatibility to withhold data.

Parameters:

attributes – Attributes at dataset level.
data – Pandas Dataframe.
structure – URN or Schema related to this Dataset (DSD, Dataflow, ProvisionAgreement)

Reading data

To read data, we recommend using the read_sdmx function or the get_datasets function:

pysdmx.io.read_sdmx(sdmx_document, validate=True)

Reads any SDMX message and returns a dictionary.

Supported structures formats are: - SDMX-ML 2.1 Structures

Supported webservices submissions are: - SDMX-ML 2.1 RegistryInterface (Submission) - SDMX-ML 2.1 Error (raises an exception with the error content)

Supported data formats are: - SDMX-ML 2.1 - SDMX-CSV 1.0 - SDMX-CSV 2.0

Parameters:

sdmx_document (Union[str, Path, BytesIO]) – Path to file (pathlib.Path), URL, or string.
validate (bool) – Validate the input file (only for SDMX-ML).

Return type:

Message

Returns:

A dictionary containing the parsed SDMX data or metadata.

Raises:

Invalid – If the file is empty or the format is not supported.

A typical example to read data from a file, a string or a buffer, using read_sdmx:

from pysdmx.io import read_sdmx

 # Read file from the same folder as this code
 file_path = Path(__file__).parent / "sample.csv"

 # Read from file
 data_msg = read_sdmx(file_path)

 # Read from URL
 data_msg = read_sdmx("https://example.com/sample.csv")

 # Extracting the datasets (list of Dataset)
 datasets = data_msg.data

 # Accessing the data of the test dataset by its Short URN
 df = data_msg.get_dataset("DataStructure=TEST_AGENCY:TEST_ID(1.0)").data

 # Accessing the data of the test dataset by its position in the SDMX Message
 df = data_msg.data[0].data

By default, the read_sdmx function will automatically detect the format of the file and use the appropriate reader. We may as well use the get_datasets to associate a dataset to its Schema:

pysdmx.io.get_datasets(data, structure=None, validate=True)

Reads a data message and a structure message and returns a dataset.

Parameters:

data (Union[str, Path, BytesIO]) – Path to file (pathlib.Path), URL, or string for the data message.
structure (Union[str, Path, BytesIO, None]) – Path to file (pathlib.Path), URL, or string for the structure message, if needed.
validate (bool) – Validate the input file (only for SDMX-ML).

Return type:

Sequence[Dataset]

Returns:

A sequence of Datasets

Raises:

Invalid – If the data message is empty or the related data structure (or dataflow with its children) is not found.
NotFound – If the related data structure (or dataflow with its children) is not found.

Important

If the structures message is used, the get_datasets function will associate the dataset to its Schema. If the structures message is not used, the get_datasets function will return a list of datasets without any Schema association. If a dataset references a dataflow, the structure message requires to have the dataflow children (or all descendants), i.e. the DataStructureDefinitions associated to this Dataflow in the same SDMX Message (with or without referenced artefacts like Codelists, ConceptSchemes, etc).

from pysdmx.io import get_datasets

# Read file from the same folder as this code (SDMX-CSV 2.0)
data_path = Path(__file__).parent / "sample.csv"

# Data contains a reference to the dataflow ``Dataflow=MD:TEST(1.0)``
datasets = get_datasets(data_path)

print(datasets[0].structure)  # Outputs a string with the Schema Short URN -> "Dataflow=MD:TEST(1.0)"

# Reading the datasets and associating the schema
datasets = get_datasets(data_path, "https://example.com/dataflow/MD/TEST/1.0?references=descendants")

print(datasets[0].structure)  # Outputs a Schema object with the associated components

Both methods are based on the individual readers for each format supported, which are described below. All individual readers will have a string as input.

SDMX-CSV 1.0

SDMX-CSV 1.0 specification

Warning

The SDMX-CSV 1.0 format is deprecated and should not be used for new implementations. It only allows a dataflow to be represented, which is not enough for most use cases.

pysdmx.io.csv.sdmx10.reader.read(input_str)

Reads csv data and returns a sequence of Datasets.

Parameters:: input_str (str) – str.
Return type:: Sequence[PandasDataset]
Returns:: A Sequence of Pandas Datasets.
Raises:: Invalid – If it is an invalid CSV file.

from pysdmx.io.input_processor import process_string_to_read
from pysdmx.io.csv.sdmx10.reader import read

from pathlib import Path

# Read file sample.csv from the same folder as this code
file_path = Path(__file__).parent / "sample10.csv"
input_str, format = process_string_to_read(file_path)

# Using reader, result will be a list of datasets
datasets = read(input_str)
# Accessing the data of the test dataset
df = dataset[0].data

SDMX-CSV 2.0

SDMX-CSV 2.0 specification

pysdmx.io.csv.sdmx20.reader.read(input_str)

Reads csv data and returns a sequence of Datasets.

Parameters:: input_str (str) – str.
Return type:: Sequence[PandasDataset]
Returns:: A Sequence of Pandas Datasets.
Raises:: Invalid – If it is an invalid CSV file.

We currently support only comma as the delimiter. Only the ordinary case is supported.

You may use any custom script for the remaining use cases, if anyone is interested in them, please raise an issue in GitHub.

from pysdmx.io.input_processor import process_string_to_read
from pysdmx.io.csv.sdmx20.reader import read
from pathlib import Path

# Read file from the same folder as this code
file_path = Path(__file__).parent / "sample20.csv"
input_str, format = process_string_to_read(file_path)

# Using reader, result will be a list of datasets
datasets = read(input_str)
# Accessing the data of the test dataset
df = dataset[0].data

SDMX-ML 2.1 Data Readers

SDMX-ML 2.1 format is described here (in Source Code, check documentation folder)

`pysdmx supports both Generic and Structure Specific SDMX-ML 2.1 to handle data on SDMX-ML, both as All Dimensions or Series format.

pysdmx.io.xml.sdmx21.reader.generic.read(input_str, validate=True)

Reads an SDMX-ML 2.1 Generic data and returns a Sequence of Datasets.

Parameters:

input_str (str) – SDMX-ML data to read.
validate (bool) – If True, the XML data will be validated against the XSD.

Return type:

Sequence[PandasDataset]

pysdmx.io.xml.sdmx21.reader.structure_specific.read(input_str, validate=True)

Reads an SDMX-ML 2.1 Generic file and returns a Sequence of Datasets.

Parameters:

input_str (str) – SDMX-ML data to read.
validate (bool) – If True, the XML data will be validated against the XSD.

Return type:

Sequence[PandasDataset]

We do not support the following elements:

Dimension Group
Reference to Provision Agreement

The reader supports both Generic and Structure Specific SDMX-ML 2.1. It will automatically detect any structural validation errors (if validate=True) and raise an exception.

Warning

The SDMX-ML 2.1 Generic format is deprecated and should not be used for new implementations. SDMX-ML 3.0 only uses the Structure Specific format, which is more efficient and easier to use.

from pysdmx.io.input_processor import process_string_to_read
from pysdmx.io.xml.sdmx21.reader.generic import read as read_generic  # For Generic format
from pysdmx.io.xml.sdmx21.reader.structure_specific import read # For Structure Specific format
from pathlib import Path

# Read file from the same folder as this code
file_path = Path(__file__).parent / "sample21.xml"
input_str, format = process_string_to_read(file_path)

# Using reader, result will be a list of datasets
datasets = read(input_str, validate=True)

# Accessing the data of the test dataset
df = dataset[0].data

Writing data

pysdmx allows to return the written data as a string or write it to a file. SDMX-CSV writers only allow one dataset to be written at a time, while SDMX-ML writers allow multiple datasets to be written at once.

SDMX-CSV 1.0

SDMX-CSV 1.0 specification

Warning

The SDMX-CSV 1.0 format is deprecated and should not be used for new implementations. It only allows a dataflow to be represented, which is not enough for most use cases.

pysdmx.io.csv.sdmx10.writer.write(datasets, output_path=None)

Write data to SDMX-CSV 1.0 format.

Parameters:

datasets (Sequence[PandasDataset]) – List of datasets to write. Must have the same components.
output_path (Optional[str]) – Path to write the data to. If None, the data is returned as a string.

Return type:

Optional[str]

Returns:

SDMX CSV data as a string, if output_path is None.

from pysdmx.io.csv.sdmx10.writer import write
from pathlib import Path

# Write to file sample.csv in the same folder as this code
file_path = Path(__file__).parent / "sample.csv"

# Write the datasets (list of Dataset or PandasDataset) to the file
write(datasets, file_path)

SDMX-CSV 2.0

SDMX-CSV 2.0 specification

Note

The SDMX-CSV 2.0 writer will write the data as the ordinary case. If you need to write data in other cases, you may need to write a custom script.

Warning

We use only comma as the delimiter.

pysdmx.io.csv.sdmx20.writer.write(datasets, output_path=None)

Write data to SDMX-CSV 2.0 format.

Parameters:

datasets (Sequence[PandasDataset]) – List of datasets to write. Must have the same components.
output_path (Optional[str]) – Path to write the data to. If None, the data is returned as a string.

Return type:

Optional[str]

Returns:

SDMX CSV data as a string, if output_path is None.

from pysdmx.io.csv.sdmx20.writer import write
from pathlib import Path

# Write to file sample.csv in the same folder as this code
file_path = Path(__file__).parent / "sample.csv"
write(dataset, file_path)

SDMX-ML 2.1 Data Writers

SDMX-ML 2.1 format is described here (in Source Code, check documentation folder)

SDMX-ML 2.1 format allows to write multiple datasets at once. To use the Series format, you need to pass the dimension at observation dictionary, where the key is the dataset short urn and the value is the dimension id to be observed.

Important

For each dataset, if dataset.structure is not a Schema, the writer can only write in the Structure Specific All Dimensions format. We perform a check to ensure that the dataset has a Schema structure for the remaining formats as we need to know the roles for each component. This check also ensures that the dataset.structure has at least one dimension and one measure defined.

pysdmx.io.xml.sdmx21.writer.generic.write(datasets, output_path='', prettyprint=True, header=None, dimension_at_observation=None)

Write data to SDMX-ML 2.1 Generic format.

Parameters:

datasets (Sequence[PandasDataset]) – The datasets to be written.
output_path (str) – The path to save the file.
prettyprint (bool) – Prettyprint or not.
header (Optional[Header]) – The header to be used (generated if None).
dimension_at_observation (Optional[Dict[str, str]]) – The mapping between the dataset and the dimension at observation.

Return type:

Optional[str]

Returns:

The XML string if path is empty, None otherwise.

pysdmx.io.xml.sdmx21.writer.structure_specific.write(datasets, output_path='', prettyprint=True, header=None, dimension_at_observation=None)

Write data to SDMX-ML 2.1 Structure Specific format.

Parameters:

datasets (Sequence[PandasDataset]) – The datasets to be written.
output_path (str) – The path to save the file.
prettyprint (bool) – Prettyprint or not.
header (Optional[Header]) – The header to be used (generated if None).
dimension_at_observation (Optional[Dict[str, str]]) – The mapping between the dataset and the dimension at observation.

Return type:

Optional[str]

Returns:

The XML string if path is empty, None otherwise.

from pysdmx.io.xml.sdmx21.writer.generic import write as write_generic  # For Generic format
from pysdmx.io.xml.sdmx21.writer.structure_specific import write  # For StructureSpecific format
from pathlib import Path

# List of datasets to write
datasets = [dataset1, dataset2]

# Dimension at observation mapping (do not need to set them all if not needed
dim_mapping = {
    "DataStructure=TEST_AGENCY:TEST_ID(1.0)": "TIME_PERIOD"
}

# Write to file sample.xml in the same folder as this code
file_path = Path(__file__).parent / "sample.xml"
write(datasets, file_path, dimension_at_observation=dim_mapping)  # This will write a Dataset in Series and another in AllDimensions format