Dataflows and data structures

Note

Additional information about how dataflows (and related structures) can be used to drive statistical processes is available in the following tutorials:

Model for SDMX dataflows and related structures (like schemas).

pysdmx is dataflow-centric, another area where pysdmx is opinionated. As such, when retrieving information about a dataflow, information typically provided via the data structure (and related structures like concept schemes and codelists) is already provided as part of the response.

class pysdmx.model.dataflow.Component(id: str, required: bool, role: Role, dtype: DataType = DataType.STRING, facets: Facets | None = None, name: str | None = None, description: str | None = None, codes: Codelist | Hierarchy | None = None, attachment_level: str | None = None, array_def: ArrayBoundaries | None = None)

A component of a dataset (aka variable), such the frequency.

Concepts are used to describe the relevant characteristics of a statistical domain. For example, exchanges rates might be described with components such as the numerator currency, the denominator currency, the type of exchange rates, etc.

Some of these components are expected to be useful across statistical domains. Examples of such components include the frequency, the observation status, the confidentiality, etc.

When using components to describe the expected structure of a statistical domain, data stewards distinguish between the components that represent what is being captured (i.e. the measures), the components that help uniquely identifying the measures (i.e. the dimensions) and the components that provide additional descriptive information about the measures (i.e. the attributes). This is the component role. The role can be D (for Dimension), A (for Attribute) or M (for Measure).

While dimensions and measures are typically mandatory, attributes may be either mandatory or optional. This is captured in the required property using a boolean value (true for mandatory components, false otherwise). This may vary with the statistical domain, i.e. a mandatory component within a particular domain may be optional in another.

While the value of some attributes is expected to potentially vary with each measurement (aka observation or data point), some others must be unique across all observations sharing the same (sub)set of dimension values. This is captured in the attachment_level property, which can be one of: D (for Dataset), O (for Observation), any string identifying a component ID (FREQ) or comma-separated list of component IDs (FREQ,REF_AREA). The latter can be used to identify the dimension, group or series to which the attribute is attached. The attachment level of a component may vary with the statistical domain, i.e. a component attached to a series in a particular domain may be attached to, say, the dataset in another domain.

The codes field indicates the expected (i.e. allowed) set of values a component can take within a particular domain. In addition to (or instead of) a set of codes, additional details about the expected format may be found in the facets and dtype fields.

id

A unique identifier for the component (e.g. FREQ).

required

Whether the component must have a value.

role

The role played by the component.

dtype

The component’s data type (string, number, etc.).

facets

Additional details such as the component’s minimum length.

name

The component’s name.

description

Additional descriptive information about the component.

codes

The expected values for the component (e.g. currency codes).

attachment_level

The attachement level (if role = A only). Attributes can be attached at different levels such as D (for dataset-level attributes), O (for observation-level attributes) or a combination of dimension IDs, separated by commas, for series- and group-level attributes).

array_def

Any additional constraints for array types.

class pysdmx.model.dataflow.Components(iterable)

A collection of components describing the data.

append(item)

Add a component to the existing list of components.

Return type:

None

property attributes: Sequence[Component]

Return the list of attributes.

Attributes are components that provide descriptive information about some piece of data (aka an observation or data point).

Returns:

The list of attributes

property dimensions: Sequence[Component]

Return the list of dimensions.

Dimensions are components that contribute to the unique identification of a piece of data (aka an observation or data point). The combination of the values for all dimensions of an observation can therefore be seen as the observation’s primary key.

Returns:

The list of dimensions

extend(other)

Add the components to the existing list of components.

Return type:

None

insert(i, item)

Add a component at the requested index.

Return type:

None

property measures: Sequence[Component]

Return the list of measures.

Measures are components that hold the measured values.

Returns:

The list of measures

class pysdmx.model.dataflow.DataflowInfo(id: str, components: Components, agency: Organisation, name: str | None = None, description: str | None = None, version: str = '1.0', providers: Sequence[Organisation] = (), series_count: int | None = None, obs_count: int | None = None, start_period: str | None = None, end_period: str | None = None, last_updated: datetime | None = None, dsd_ref: str | None = None)

Extended information about a dataflow.

The information includes:

  • Some basic metadata about the dataflow (such as its ID and name).

  • Some useful metrics such as the number of observations.

  • The expected structure of data (i.e. the data schema), including the expected components, their types, etc.

id

The identifier of the dataflow (e.g. CBS).

components

The data structure, i.e. the components, their types, etc.

agency

The organization responsible for the data (e.g. BIS).

name

The dataflow’s name (e.g. Consolidated Banking Statistics).

description

Additional descriptive information about the dataflow.

version

The dataflow version.

providers

The organizations providing the data.

series_count

The number of series available in the dataflow.

obs_count

The number of observations available in the dataflow.

start_period

The oldest period for which data are available.

end_period

The oldest period for which data are available.

last_updated

When the dataflow was last updated.

dsd_ref

The URN of the data structure used by the dataflow.

class pysdmx.model.dataflow.Role(value)

The various roles a component can play.

ATTRIBUTE = 'A'

The component provides descriptive information about the data.

DIMENSION = 'D'

The component helps identifying data (e.g. primary key).

MEASURE = 'M'

The component holds a value we measure or collect.

class pysdmx.model.dataflow.Schema(context: str, agency: str, id: str, components: Components, version: str = '1.0', artefacts: Sequence[str] = (), generated: datetime = datetime.datetime(2024, 2, 23, 8, 32, 44, 237058))

The allowed content within a certain context.

This is the equivalent to the result of a schema query in the SDMX-REST API.

The response contains the list of allowed values for the selected context (one of data structure, dataflow or provision agreement), and is typially used for validation purposes.

context

The context for which the schema is provided. One of datastructure, dataflow or provisionagreement.

agency

The agency maintaining the context (e.g. BIS).

id

The ID of the context (e.g. BIS_MACRO).

components

The list of components along with their allowed values, types, etc.

version

The context version (e.g. 1.0)

artefacts

The URNs of the artefacts used to generate the schema. This will typically include the URNs of data structures, codelists, concept schemes, content constraints, etc.

generated

When the schema was generated. This is useful for metadata synchronization purposes. For example, if any of the artefacts listed under the artefacts property has been updated after the schema was generated, you might want to regenerate the schema.