.. _validate:
Validate your data
==================
In this tutorial, we'll explore how ``pysdmx`` facilitates **data
validation** in a metadata-driven approach, relying solely on the metadata
stored in an SDMX Registry.
There are various types of validation, and we'll focus on **structural
validation** in this scenario. Structural validation ensures that the
structure of data meets the expectations.
Required metadata
-----------------
For this scenario, the necessary metadata depends on the desired
thoroughness of validation. At a minimum, we need the **data structure**
information. However, for more comprehensive validation, we may consider
additional constraints from the **dataflow** or **provision agreement**.
Data Structure
^^^^^^^^^^^^^^
A data structure describes the expected structure of data, including component
types, data types, and whether components are mandatory. If components are
**coded**, the allowed values are also specified. This is the minimum
required for structural validation.
Dataflows
^^^^^^^^^
Dataflows allow defining one or more set of data sharing the same data
structure. For example, if we have a data structure about locational banking
statistics, we might want to define a dataflow representing the locational
banking statistics by country (residence) and another dataflow representing
the locational banking statistics by nationality. If we have a data structure
representing bilateral foreign exchange reference rates, we might want to
create a dataflow for the subset of exchange rates published on a website on
a daily basis.
Expanding on this last example, we could define this subset of data using
constraints, i.e. setting the frequency dimension to “daily” and the currency
codes to the subset of codes that are published on a daily basis (e.g. CHF,
CNY, EUR, JPY, USD, etc.) and we would “attach” these constraints to the
dataflow. Taking these additional constraints into account makes the
validation more strict.
Provisioning metadata
^^^^^^^^^^^^^^^^^^^^^
Provision agreements and data providers indicate which providers supply data
for a dataflow. Constraints, such as expecting a provider to supply data only
for its own country, can be applied.
In summary, the following SDMX artifacts need to be in the SDMX Registry:
**AgencyScheme**, **Codelist**, **ConceptScheme**, and **Data Structure**.
For more thorough validation, additional metadata like **Data Constraint**,
**Dataflow**, **DataProviderScheme**, **DataStructure**, and
**ProvisionAgreement** is needed.
For additional information about the various types of SDMX artifacts, please
refer to the `SDMX documentation `_.
Step-by-step solution
---------------------
``pysdmx`` allows retrieving metadata from an SDMX Registry in either a
synchronous (via ``pysdmx.api.fmr.RegistryClient``) or asynchronous fashion
(via ``pysdmx.api.fmr.AsyncRegistryClient``). Which one to use depends on the
use case (and taste), but we tend to use the asynchronous client by default,
as it is non-blocking.
Connecting to a Registry
^^^^^^^^^^^^^^^^^^^^^^^^
First, we will need an instance of the client, so that we can connect to our
target Registry. When instantiating the client, we need to pass the SDMX-REST
endpoint of the Registry. If we use the
`FMR `_, i.e. the
reference Implementation of the SDMX Registry specification, the endpoint
will be the URL at which the FMR is available, followed by ``/sdmx/v2/``.
.. code-block:: python
from pysdmx.api.fmr import AsyncRegistryClient
gr = AsyncRegistryClient("https://registry.sdmx.org/sdmx/v2/")
Retrieving the schema information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For this tutorial, we want to validate data received for the ``EDUCAT_CLASS_A``
dataflow maintained by UNESCO Institute for Statistics (``UIS``), as published
on the `SDMX Global Registry `_.
This dataflow is based on the ``UOE_NON_FINANCE`` data structure. If we view
the data structure using the Global Registry user interface, we will see that
many values are allowed for most coded components. For example, at the time
of writing, more than 80 codes are allowed for the ``AGE`` component. However,
if we look at the dataflow, we will see that only one value is allowed (``_T``).
This is because constraints (``CR_EDUCAT_ALL`` and ``CR_EDUCAT_CLASS_A`` in
this particular case) have been applied to the dataflow.
An SDMX-REST ``schema`` query can be used to retrieve what is allowed within
the context of a data structure, a dataflow, or a provision agreement. The
SDMX Registry then uses all available information (e.g., constraints) to return
a "schema" describing the "data validity rules" for the selected context. We
will use this to retrieve the metadata we need to validate our data.
As we want to make the validation as strict as possible, we want to consider
all available constraints. No information about data providers or provision
agreements is available for the selected dataflow in the Global Registry at the
time of writing, and so we will use the next available context, i.e. **dataflow**.
.. code-block:: python
schema = await gr.get_schema("dataflow", "UIS", "EDUCAT_CLASS_A", "1.0")
Validating data
^^^^^^^^^^^^^^^
Many different types of checks can be implemented, and covering them all goes
beyond the scope of this tutorial. However, some common validation checks are
described below. For this tutorial, we will assume that the data was provided
in SDMX-CSV, and that the data file was read using Python CSV ``DictReader``.
That means, it is possible to iterate over the content of the file one row at
a time, and every row is represented as a dictionary, with the column header as
key and the cell content as value.
For example, to get the value for the AGE dimension:
.. code-block:: python
for row in reader:
age = row["AGE"]
print(age)
Validating the components
"""""""""""""""""""""""""
The first thing we might want to do is to check whether we find the expected
columns in SDMX-CSV. Each column in the SDMX-CSV input should be either the
ID of a component defined in the data structure or one of the special SDMX-CSV
columns (``STRUCTURE``, ``STRUCTURE_ID``, or ``ACTION``).
.. code-block:: python
sdmx_cols = ["STRUCTURE", "STRUCTURE_ID", "ACTION"]
components = [c.id for c in schema.components]
for col in reader.fieldnames:
if col not in sdmx_cols and col not in components:
raise ValueError(f"Found unexpected column: {col}")
Validating the data type
""""""""""""""""""""""""
``pysdmx`` returns the expected data type for each of the components in a data
structure. CSV treats everything as a string but the information provided by
``pysdmx`` may be used to attempt a type casting (or similar checks) and check
for errors reported in the process.
The exact code will depend on the library used. While the Python interpreter
only supports a few generic types, other Python libraries (like numpy, pandas,
or pyarrow) offer more options. Covering them all goes beyond the scope
of this tutorial, but the code below should be sufficient to give an idea.
.. code-block:: python
from pysdmx.model import DataType
for row in reader:
for comp, value in row.items():
data_type = schema.components[comp].dtype
if data_type in [DataType.DOUBLE, DataType.FLOAT]:
try:
float(value)
except ValueError:
raise TypeError(f"{value} for component {comp} is not a valid {data_type}")
Validating with facets
""""""""""""""""""""""
SDMX allows defining so-called **facets**, to provide additional
constraints in addition to the data type. For example, we can say that a
component is a string, with a minimum length of 3 characters and a maximum
length of 10. This information is available via the ``facets`` property.
.. code-block:: python
print(schema.components["COMMENT_DSET"].facets)
max_length=1050
This information can of course be used for validation purposes:
.. code-block:: python
for row in reader:
for comp, value in row.items():
facets = schema.components[comp].facets
if facets and facets.max_length:
if len(value) > facets.max_length:
raise ValueError(f"The value for {comp} is longer than {facets.max_length} characters")
Validating coded components
""""""""""""""""""""""""""""
SDMX distinguishes between **coded** and **uncoded** components. The list of
codes (defined either in a codelist or a valuelist) is available via the
``codes`` property:
.. code-block:: python
coded_comp = {
comp.id: [code.id for code in comp.codes]
for comp in schema.components
if comp.codes
}
for row in reader:
for comp, value in row.items():
if comp in coded_comp and value not in coded_comp[comp]:
raise ValueError(f"{value} is not one of the expected codes for {comp}")
Validating mandatory components
""""""""""""""""""""""""""""""""
The data structure indicates whether a component is required. However, this
check also requires taking the message action into account. After all, if the
message only contains updates and revisions to previously provided data, and
if the value of a mandatory component hasn't changed, then, in principle, the
value does not need to be sent again. However, assuming the check for
mandatory components needs to run, the ``required`` property can be used:
.. code-block:: python
for row in reader:
for comp, value in row.items():
if schema.components[comp].required and value is None:
raise ValueError(f"Value is missing for {comp}")
Summary
-------
In this tutorial, we created a client to retrieve metadata from the SDMX
Global Registry. We used the ``get_schema`` method to obtain the metadata
necessary to validate data for the "EDUCAT_CLASS_A" dataflow by the UNESCO
Institute for Statistics.
While this tutorial covers fundamental validation checks, there are many more
aspects to consider when validating SDMX messages. Nonetheless, it provides
a solid foundation for using ``pysdmx`` to write Python validation code for
SDMX messages.