Validate your data

In this tutorial, we’ll explore how pysdmx facilitates data validation in a metadata-driven approach, relying solely on the metadata stored in an SDMX Registry.

There are various types of validation, and we’ll focus on structural validation in this scenario. Structural validation ensures that the structure of data meets the expectations.

Required metadata

For this scenario, the necessary metadata depends on the desired thoroughness of validation. At a minimum, we need the data structure information. However, for more comprehensive validation, we may consider additional constraints from the dataflow or provision agreement.

Data Structure

A data structure describes the expected structure of data, including component types, data types, and whether components are mandatory. If components are coded, the allowed values are also specified. This is the minimum required for structural validation.

Dataflows

Dataflows allow defining one or more set of data sharing the same data structure. For example, if we have a data structure about locational banking statistics, we might want to define a dataflow representing the locational banking statistics by country (residence) and another dataflow representing the locational banking statistics by nationality. If we have a data structure representing bilateral foreign exchange reference rates, we might want to create a dataflow for the subset of exchange rates published on a website on a daily basis.

Expanding on this last example, we could define this subset of data using constraints, i.e. setting the frequency dimension to “daily” and the currency codes to the subset of codes that are published on a daily basis (e.g. CHF, CNY, EUR, JPY, USD, etc.) and we would “attach” these constraints to the dataflow. Taking these additional constraints into account makes the validation more strict.

Provisioning metadata

Provision agreements and data providers indicate which providers supply data for a dataflow. Constraints, such as expecting a provider to supply data only for its own country, can be applied.

In summary, the following SDMX artifacts need to be in the SDMX Registry: AgencyScheme, Codelist, ConceptScheme, and Data Structure. For more thorough validation, additional metadata like Data Constraint, Dataflow, DataProviderScheme, DataStructure, and ProvisionAgreement is needed.

For additional information about the various types of SDMX artifacts, please refer to the SDMX documentation.

Step-by-step solution

pysdmx allows retrieving metadata from an SDMX Registry in either a synchronous (via pymedal.fmr.RegistryClient) or asynchronous fashion (via pymedal.fmr.AsyncRegistryClient). Which one to use depends on the use case (and taste), but we tend to use the asynchronous client by default, as it is non-blocking.

Connecting to a Registry

First, we will need an instance of the client, so that we can connect to our target Registry. When instantiating the client, we need to pass the SDMX-REST endpoint of the Registry. If we use the FMR, i.e. the reference Implementation of the SDMX Registry specification, the endpoint will be the URL at which the FMR is available, followed by /sdmx/v2/.

from pysdmx.fmr import AsyncRegistryClient
gr = AsyncRegistryClient("https://registry.sdmx.org/sdmx/v2/")

Retrieving the schema information

For this tutorial, we want to validate data received for the EDUCAT_CLASS_A dataflow maintained by UNESCO Institute for Statistics (UIS), as published on the SDMX Global Registry.

This dataflow is based on the UOE_NON_FINANCE data structure. If we view the data structure using the Global Registry user interface, we will see that many values are allowed for most coded components. For example, at the time of writing, more than 80 codes are allowed for the AGE component. However, if we look at the dataflow, we will see that only one value is allowed (_T). This is because constraints (CR_EDUCAT_ALL and CR_EDUCAT_CLASS_A in this particular case) have been applied to the dataflow.

An SDMX-REST schema query can be used to retrieve what is allowed within the context of a data structure, a dataflow, or a provision agreement. The SDMX Registry then uses all available information (e.g., constraints) to return a “schema” describing the “data validity rules” for the selected context. We will use this to retrieve the metadata we need to validate our data.

As we want to make the validation as strict as possible, we want to consider all available constraints. No information about data providers or provision agreements is available for the selected dataflow in the Global Registry at the time of writing, and so we will use the next available context, i.e. dataflow.

schema = await gr.get_schema("dataflow", "UIS", "EDUCAT_CLASS_A", "1.0")

Validating data

Many different types of checks can be implemented, and covering them all goes beyond the scope of this tutorial. However, some common validation checks are described below. For this tutorial, we will assume that the data was provided in SDMX-CSV, and that the data file was read using Python CSV DictReader. That means, it is possible to iterate over the content of the file one row at a time, and every row is represented as a dictionary, with the column header as key and the cell content as value.

For example, to get the value for the AGE dimension:

for row in reader:
    age = row["AGE"]
    print(age)

Validating the components

The first thing we might want to do is to check whether we find the expected columns in SDMX-CSV. Each column in the SDMX-CSV input should be either the ID of a component defined in the data structure or one of the special SDMX-CSV columns (STRUCTURE, STRUCTURE_ID, or ACTION).

sdmx_cols = ["STRUCTURE", "STRUCTURE_ID", "ACTION"]
components = [c.id for c in schema.components]
for col in reader.fieldnames:
    if col not in sdmx_cols and col not in components:
        raise ValueError(f"Found unexpected column: {col}")

Validating the data type

pysdmx returns the expected data type for each of the components in a data structure. CSV treats everything as a string but the information provided by pysdmx may be used to attempt a type casting (or similar checks) and check for errors reported in the process.

The exact code will depend on the library used. While the Python interpreter only supports a few generic types, other Python libraries (like numpy, pandas, or pyarrow) offer more options. Covering them all goes beyond the scope of this tutorial, but the code below should be sufficient to give an idea.

from pysdmx.model import DataType
for row in reader:
    for comp, value in row.items():
        data_type = schema.components[comp].dtype
        if data_type in [DataType.DOUBLE, DataType.FLOAT]:
            try:
                float(value)
            except ValueError:
                raise TypeError(f"{value} for component {comp} is not a valid {data_type}")

Validating with facets

SDMX allows defining so-called facets, to provide additional constraints in addition to the data type. For example, we can say that a component is a string, with a minimum length of 3 characters and a maximum length of 10. This information is available via the facets property.

print(schema.components["COMMENT_DSET"].facets)
max_length=1050

This information can of course be used for validation purposes:

for row in reader:
    for comp, value in row.items():
        facets = schema.components[comp].facets
        if facets and facets.max_length:
            if len(value) > facets.max_length:
                raise ValueError(f"The value for {comp} is longer than {facets.max_length} characters")

Validating coded components

SDMX distinguishes between coded and uncoded components. The list of codes (defined either in a codelist or a valuelist) is available via the codes property:

coded_comp = {
    comp.id: [code.id for code in comp.codes]
    for comp in schema.components
    if comp.codes
}

for row in reader:
    for comp, value in row.items():
        if comp in coded_comp and value not in coded_comp[comp]:
            raise ValueError(f"{value} is not one of the expected codes for {comp}")

Validating mandatory components

The data structure indicates whether a component is required. However, this check also requires taking the message action into account. After all, if the message only contains updates and revisions to previously provided data, and if the value of a mandatory component hasn’t changed, then, in principle, the value does not need to be sent again. However, assuming the check for mandatory components needs to run, the required property can be used:

for row in reader:
    for comp, value in row.items():
        if schema.components[comp].required and value is None:
            raise ValueError(f"Value is missing for {comp}")

Summary

In this tutorial, we created a client to retrieve metadata from the SDMX Global Registry. We used the get_schema method to obtain the metadata necessary to validate data for the “EDUCAT_CLASS_A” dataflow by the UNESCO Institute for Statistics.

While this tutorial covers fundamental validation checks, there are many more aspects to consider when validating SDMX messages. Nonetheless, it provides a solid foundation for using pysdmx to write Python validation code for SDMX messages.