Generate the Filesystem Layout
In this tutorial, we learn how pysdmx
aids in generating the
filesystem structure in a metadata-driven fashion, relying solely on
metadata stored in an SDMX Registry.
What we want is to store data in folders organized by dataflows. In each dataflow folder, we create sub-folders by data providers. Access to folders should be granted via appropriate roles with access requests approved by the manager of the organizational unit owning the dataflow.
Required Metadata
For this use case, we need metadata in our SDMX Registry:
- Dataflows
Define the first-level of the filesystem. Dataflows, related artifacts, and provisioning metadata are needed to create roles for data access.
- Provisioning Metadata
Provision agreements and data providers indicate which providers supply data for a dataflow, defining the second-level of the filesystem.
- Agencies
Define the organizational unit owning the data. Contacts associated with agencies define the person in charge of approving (or denying) requests to access the data.
- Category Schemes
Define the dataflows to be considered when creating the filesystem structure. Dataflows are attached to categories of the category scheme via categorizations.
Step-by-step Solution
pysdmx
allows retrieving metadata from an SDMX Registry either
synchronously (via pymedal.fmr.RegistryClient
) or asynchronously
(via pymedal.fmr.AsyncRegistryClient
).
Connecting to a Registry
First, we need a client instance to connect to our target Registry.
When instantiating the client, we pass the SDMX-REST endpoint of our Registry.
If using the FMR,
the endpoint is the URL at which the FMR is available, followed by /sdmx/v2/
.
from pysdmx.fmr import AsyncRegistryClient
client = AsyncRegistryClient("[endpoint_comes_here]")
Creating the Dataflow Folders
Once we have a client, we use it to get the list of dataflows needed to consider when creating the filesystem. This information is captured in a category scheme and related categorizations.
cs = await client.get_categories("MY_AGENCY", "MY_DATAFLOWS")
Now we iterate over the categories (and their sub-categories) to find
the dataflows attached to them. Use the dataflows
property to get a set
with the dataflows attached at any level. Iterate through the set to create
a folder using os.mkdir
.
We can now iterate over the categories (and their sub-categories, if any) to
find the dataflows attached to them. However, there is a convenience property,
dataflows
, that we can use to get a set with the dataflows attached at any
level. As this is a set, we no longer need to worry about duplicates, i.e.
each dataflow will appear only once, regardless of how many categories it is
attached to. We can then use the dataflow ID to create a folder, using
os.mkdir
.
import os
for flow in cs.dataflows:
os.mkdir(flow.id)
Creating the Providers Folders
Create the second level, i.e., one folder per provider of data for a dataflow.
Use the get_providers
method to receive the respective dataflows for
each provider. Reorganize to have a dataflow as the key and the list of providers
for that dataflow as the value.
from collections import defaultdict
flow_provs = defaultdict(set)
providers = client.get_providers("MY_AGENCY", True)
for prov in providers:
for flow in prov.dataflows:
flow_provs[flow.id].add(prov.id)
Now iterate over the keys of the flow_provs
dictionary to create one folder
per item in the set associated with the key. Check if the dataflow folder exists
before creating provider folders.
for flow, providers in flow_provs.items():
if os.path.exists(flow):
for provider in providers:
os.mkdir(f"{flow}/{provider}")
Creating the Roles
Creating roles in the target directory service is a crucial step, although details of this process depend on the specific service being used (e.g., OpenLDAP, Active Directory, etc.). The key information needed for role creation includes the role ID and the ID of the person (or group) responsible for granting access to the role.
The role ID and name can be constructed using information from the dataflow.
For example, the role ID might follow a convention like starting with an “R”,
followed by the system name, dataflow ID, and access type (e.g., RO for read-only
access vs. RW for read and write access). Let’s assume our application is
called MYAPP.
Another critical aspect is linking the role to its approver. To achieve this,
we leverage contacts associated with SDMX agencies. Agencies might have multiple
contacts, so we use the contact role to identify the person tasked with
approving access requests. While the contact information may include various
details (such as name, address, unit, telephone, email, etc.), we specifically
use the ìd
property to capture the username of the user responsible for
approving requests.
Now, let’s dive into the implementation steps:
# Get extended information about the sub-agencies
agencies = await client.get_agencies("MY_AGENCY")
# Organize the agencies as a map for quick lookup
agency_map = {a.id: a for a in agencies}
# Assume that the role of the person approving access requests is "APPROVER"
for flow in cs.dataflows:
for access in ["RO", "RW"]:
# Fetch the contact responsible for approving access requests
contact = [c for c in agency_map[flow.agency].contacts if c.role == "APPROVER"][0]
# Construct role information
role = {
"id": f"R_MYAPP_{flow.id}_{access}",
"name": f"{access} access to {flow.id} ({flow.name})",
"approver": contact.id
}
# Print the role information (actual implementation will involve creating roles in the directory service)
print(role)
The roles, once created, play a pivotal role in defining access permissions to the folders we’ve created previously. The details of setting these permissions are specific to the operating system and the chosen directory service.
Summary
In this tutorial, we have created a client to retrieve metadata from an SDMX
Registry and used its get_categories
, get_providers
, and
get_agencies
methods to create a filesystem layout, organize dataflows,
and grant access via dedicated roles.