Validate data using VTL
In this tutorial, we shall examine the utilization of pysdmx
for reading data and metadata to generate a dataset and VTL script
and the vtlengine
library to execute the VTL script.
Note
This tutorial assumes that you have a basic understanding of SDMX and VTL concepts. If you are new to these topics, please refer to the VTL documentation and SDMX-VTL documentation
Important
To use the VTL functionalities, you need to install the pysdmx[vtl] extra.
This tutorial requires the pysdmx[data] extra to handle SDMX datasets as Pandas DataFrames, and the pysdmx[xml] extra to read and write SDMX-ML messages.
Check the installation guide for more information.
Numerous types of operations can be performed; however, this tutorial will focus exclusively on the fundamental ones.
Step-by-Step Solution
Using pysdmx we will read the Datasets, its Structures and the VTL objects. For the purpose of this tutorial, we shall employ the XML files
structures.xml
(data structure), data.xml
(data) and vtl_ts.xml
(Transformation and VTLMapping).
Files used in the example can be found here:
Reading Data and Structures messages
The initial step involves reading the data structure and data from the SDMX files. The following code snippet demonstrates the process:
from pathlib import Path
# Path to the structures file (same directory as this script)
path_to_structures = Path(__file__).parent / "structures.xml"
# Path to the data file (same directory as this script)
path_to_data = Path(__file__).parent / "data.xml"
Now we have the paths to the files, we can read the data structure and data and extract the data:
from pysdmx.io import get_datasets
# With the data and metadata path we extract the datasets with their related structures
datasets = get_datasets(path_to_data, path_to_structures)
Important
Check the Get Datasets method docs for more information on how to generate a PandasDataset with both data and related structures.
This method is the recommended way to read SDMX data and structures, as it combines them in a single Pandas Dataset, allowing you to work with the data and its structure seamlessly.
Getting the Transformation Scheme and VTL Mapping
For the next step, we have three options available. We can read the transformation scheme and VTL mapping from a file, we can read a file from a Fusion Registry URL or we can create the pysdmx Model objects.
from pysdmx.io import read_sdmx
from pathlib import Path
# Path to the transformation file
path_to_vtl_ts = Path(__file__).parent / "vtl_ts.xml"
# Read the transformation file with read_sdmx
message = read_sdmx(path_to_vtl_ts)
# Get the Transformation Schemes
ts = message.get_transformation_schemes()[0]
# Get the VTL Mapping Scheme
mapping_scheme = message.get_vtl_mapping_schemes()[0]
# Get the VTL Dataflow Mapping from the items, assuming the first item is the one we want
dataflow_mapping = mapping_scheme.items[0]
Optionally, we can also create the Transformation Scheme and VTL Mapping objects directly in code.
from pysdmx.model import VtlDataflowMapping, DataflowRef, VtlMappingScheme, TransformationScheme, Transformation
# Mapping using VTLDataflowMapping object:
dataflow_mapping = VtlDataflowMapping(
dataflow=DataflowRef(agency="MD", id="TEST_DF", version="1.0"),
dataflow_alias="DS_1",
id="VTL_MAP_1",
name="VTL Mapping 1",
)
mapping_scheme = VtlMappingScheme(
id="VTL_MAP_SCHEME_1",
name="VTL Mapping Scheme 1",
version="1.0",
agency="MD",
items=[dataflow_mapping],
)
# Transformation Scheme object
ts = TransformationScheme(
id="TS1",
version="1.0",
agency="MD",
vtl_version="2.1",
name="Transformation Scheme 1",
items=[
Transformation(
id="T1",
uri=None,
urn=None,
name="Transformation 1",
description=None,
expression="DS_1 [calc Me_4 := OBS_VALUE]",
is_persistent=True,
result="DS_r",
annotations=(),
),
],
vtl_mapping_scheme=mapping_scheme
)
You may download as well directly the structures from the FMR or the SDMX API:
At this point you may use the VTL Toolkit Model validations to validate the Transformation Scheme.
Running the VTL Script
Now that we have the VTL script, we can run it using the vtlengine.run_sdmx method.
from vtlengine import run_sdmx
# Run the VTL script with the datasets and the dataflow mapping
run_sdmx(script=ts, datasets=datasets, mappings=dataflow_mapping)
The run_sdmx method will execute the Transformation Scheme (VTL Script) using the provided datasets and dataflow mapping.
Summary
In this tutorial, we have learned how to read SDMX data and metadata using pysdmx
,
extract the Pandas Datasets, and run a VTL script using the vtlengine.run_sdmx
method.
Useful additional links: