.. _data-rw:

Reading and writing SDMX datasets
=================================

.. note::

        This tutorial shows how to read and write SDMX datasets using ``pysdmx``.

    - :ref:`data-rw`.

.. warning::
    To read and write data, you must use the extra "data". You may need to install it using the following command:

    .. code-block:: bash

        pip install pysdmx[data]

    For SDMX-ML format, you need to install the extra "xml" as well:

    .. code-block:: bash

        pip install pysdmx[data,xml]


``pysdmx`` allows to read and write SDMX datasets in the following formats:

- SDMX-CSV 1.0 (located in ``pysdmx.io.csv.sdmx10``)
- SDMX-CSV 2.0 (located in ``pysdmx.io.csv.sdmx20``)
- SDMX-ML 2.1 (located in ``pysdmx.io.xml.sdmx21``)
    - SDMX-ML 2.1 Generic
    - SDMX-ML 2.1 Structure Specific

Currently, all data-related readers and writers are based on PandasDataset class.

.. autoclass:: pysdmx.io.pd.PandasDataset
    :show-inheritance:
    :undoc-members:

Reading data
------------

To read data, we recommend using the read_sdmx function or the get_datasets function:

.. autofunction:: pysdmx.io.read_sdmx

A typical example to read data from a file, a string or a buffer, using read_sdmx:

.. code-block:: python

   from pysdmx.io import read_sdmx

    # Read file from the same folder as this code
    file_path = Path(__file__).parent / "sample.csv"

    # Read from file
    data_msg = read_sdmx(file_path)

    # Read from URL
    data_msg = read_sdmx("https://example.com/sample.csv")

    # Extracting the datasets (list of Dataset)
    datasets = data_msg.data

    # Accessing the data of the test dataset by its Short URN
    df = data_msg.get_dataset("DataStructure=TEST_AGENCY:TEST_ID(1.0)").data

    # Accessing the data of the test dataset by its position in the SDMX Message
    df = data_msg.data[0].data


By default, the read_sdmx function will automatically detect the format of the file and use the appropriate reader. We may as well use the get_datasets to associate a dataset to its Schema:

.. autofunction:: pysdmx.io.get_datasets

.. important::

        If the structures message is used, the get_datasets function will associate the dataset to its Schema. If the structures message is not used, the get_datasets function will return a list of datasets without any Schema association.
        If a dataset references a dataflow, the structure message requires to have the dataflow children (or all descendants), i.e. the DataStructureDefinitions associated to this Dataflow in the same SDMX Message (with or without referenced artefacts like Codelists, ConceptSchemes, etc).

.. code-block:: python

    from pysdmx.io import get_datasets

    # Read file from the same folder as this code (SDMX-CSV 2.0)
    data_path = Path(__file__).parent / "sample.csv"

    # Data contains a reference to the dataflow ``Dataflow=MD:TEST(1.0)``
    datasets = get_datasets(data_path)

    print(datasets[0].structure)  # Outputs a string with the Schema Short URN -> "Dataflow=MD:TEST(1.0)"

    # Reading the datasets and associating the schema
    datasets = get_datasets(data_path, "https://example.com/dataflow/MD/TEST/1.0?references=descendants")

    print(datasets[0].structure)  # Outputs a Schema object with the associated components


Both methods are based on the individual readers for each format supported, which are described below.
All individual readers will have a string as input.

SDMX-CSV 1.0
^^^^^^^^^^^^

`SDMX-CSV 1.0 specification <https://github.com/sdmx-twg/sdmx-csv/blob/v1.0/data-message/docs/sdmx-csv-field-guide.md>`_

.. warning::

        The SDMX-CSV 1.0 format is deprecated and should not be used for new implementations.
        It only allows a dataflow to be represented, which is not enough for most use cases.

.. autofunction:: pysdmx.io.csv.sdmx10.reader.read

.. code-block:: python

   from pysdmx.io.input_processor import process_string_to_read
   from pysdmx.io.csv.sdmx10.reader import read

   from pathlib import Path

   # Read file sample.csv from the same folder as this code
   file_path = Path(__file__).parent / "sample10.csv"
   input_str, format = process_string_to_read(file_path)

   # Using reader, result will be a list of datasets
   datasets = read(input_str)
   # Accessing the data of the test dataset
   df = dataset[0].data

SDMX-CSV 2.0
^^^^^^^^^^^^

`SDMX-CSV 2.0 specification <https://github.com/sdmx-twg/sdmx-csv/blob/v2.0.0/data-message/docs/sdmx-csv-field-guide.md>`_

.. autofunction:: pysdmx.io.csv.sdmx20.reader.read

We currently support only comma as the delimiter.
Only the `ordinary case <https://github.com/sdmx-twg/sdmx-csv/blob/v2.0.0/data-message/docs/
sdmx-csv-field-guide.md#1-ordinary-case>`_ is supported.

You may use any custom script for the remaining use cases, if anyone is interested in them, please
raise an issue in `GitHub <https://github.com/bis-med-it/pysdmx>`_.

.. code-block:: python

   from pysdmx.io.input_processor import process_string_to_read
   from pysdmx.io.csv.sdmx20.reader import read
   from pathlib import Path

   # Read file from the same folder as this code
   file_path = Path(__file__).parent / "sample20.csv"
   input_str, format = process_string_to_read(file_path)

   # Using reader, result will be a list of datasets
   datasets = read(input_str)
   # Accessing the data of the test dataset
   df = dataset[0].data

SDMX-ML 2.1 Data Readers
^^^^^^^^^^^^^^^^^^^^^^^^

SDMX-ML 2.1 format is described
`here (in Source Code, check documentation folder) <https://github.com/sdmx-twg/sdmx-ml/releases/tag/v2.1>`_

```pysdmx`` supports both Generic and Structure Specific SDMX-ML 2.1 to handle data on SDMX-ML, both as All Dimensions or Series format.

.. autofunction:: pysdmx.io.xml.sdmx21.reader.generic.read

.. autofunction:: pysdmx.io.xml.sdmx21.reader.structure_specific.read

We do not support the following elements:

- Dimension Group
- Reference to Provision Agreement

The reader supports both Generic and Structure Specific SDMX-ML 2.1.
It will automatically detect any structural validation errors (if validate=True) and raise an exception.

.. warning::

    The SDMX-ML 2.1 Generic format is deprecated and should not be used for new implementations. SDMX-ML 3.0 only uses the Structure Specific format, which is more efficient and easier to use.

.. code-block:: python

   from pysdmx.io.input_processor import process_string_to_read
   from pysdmx.io.xml.sdmx21.reader.generic import read as read_generic  # For Generic format
   from pysdmx.io.xml.sdmx21.reader.structure_specific import read # For Structure Specific format
   from pathlib import Path

   # Read file from the same folder as this code
   file_path = Path(__file__).parent / "sample21.xml"
   input_str, format = process_string_to_read(file_path)

   # Using reader, result will be a list of datasets
   datasets = read(input_str, validate=True)

   # Accessing the data of the test dataset
   df = dataset[0].data


Writing data
------------

``pysdmx`` allows to return the written data as a string or write it to a file. SDMX-CSV writers only allow one dataset to be written at a time, while SDMX-ML writers allow multiple datasets to be written at once.

SDMX-CSV 1.0
^^^^^^^^^^^^

`SDMX-CSV 1.0 specification <https://github.com/sdmx-twg/sdmx-csv/blob/v1.0/data-message/docs/sdmx-csv-field-guide.md>`_

.. warning::

        The SDMX-CSV 1.0 format is deprecated and should not be used for new implementations.
        It only allows a dataflow to be represented, which is not enough for most use cases.

.. autofunction:: pysdmx.io.csv.sdmx10.writer.write

.. code-block:: python

   from pysdmx.io.csv.sdmx10.writer import write
   from pathlib import Path

   # Write to file sample.csv in the same folder as this code
   file_path = Path(__file__).parent / "sample.csv"

   # Write the datasets (list of Dataset or PandasDataset) to the file
   write(datasets, file_path)


SDMX-CSV 2.0
^^^^^^^^^^^^

`SDMX-CSV 2.0 specification <https://github.com/sdmx-twg/sdmx-csv/blob/v2.0.0/data-message/docs/sdmx-csv-field-guide.md>`_

.. note::

        The SDMX-CSV 2.0 writer will write the data as the `ordinary case <https://github.com/sdmx-twg/sdmx-csv/blob/v2.0.0/data-message/docs/sdmx-csv-field-guide.md#1-ordinary-case>`_. If you need to write data in other cases, you may need to write a custom script.

.. warning::

        We use only comma as the delimiter.

.. autofunction:: pysdmx.io.csv.sdmx20.writer.write

.. code-block:: python

   from pysdmx.io.csv.sdmx20.writer import write
   from pathlib import Path

   # Write to file sample.csv in the same folder as this code
   file_path = Path(__file__).parent / "sample.csv"
   write(dataset, file_path)

SDMX-ML 2.1 Data Writers
^^^^^^^^^^^^^^^^^^^^^^^^

SDMX-ML 2.1 format is described
`here (in Source Code, check documentation folder) <https://github.com/sdmx-twg/sdmx-ml/releases/tag/v2.1>`_

SDMX-ML 2.1 format allows to write multiple datasets at once. To use the Series format, you need to pass the dimension at observation dictionary, where the key is the dataset short urn and the value is the dimension id to be observed.

.. important::

    For each dataset, if dataset.structure is not a Schema, the writer can only write in the Structure Specific All Dimensions format.
    We perform a check to ensure that the dataset has a Schema structure for the remaining formats as we need to know the roles for each component.
    This check also ensures that the dataset.structure has at least one dimension and one measure defined.

.. autofunction:: pysdmx.io.xml.sdmx21.writer.generic.write
.. autofunction:: pysdmx.io.xml.sdmx21.writer.structure_specific.write

.. code-block:: python

    from pysdmx.io.xml.sdmx21.writer.generic import write as write_generic  # For Generic format
    from pysdmx.io.xml.sdmx21.writer.structure_specific import write  # For StructureSpecific format
    from pathlib import Path

    # List of datasets to write
    datasets = [dataset1, dataset2]

    # Dimension at observation mapping (do not need to set them all if not needed
    dim_mapping = {
        "DataStructure=TEST_AGENCY:TEST_ID(1.0)": "TIME_PERIOD"
    }

    # Write to file sample.xml in the same folder as this code
    file_path = Path(__file__).parent / "sample.xml"
    write(datasets, file_path, dimension_at_observation=dim_mapping)  # This will write a Dataset in Series and another in AllDimensions format