Introduction to efsa-tools

Overview

The efsa-tools package brings together all the functions developed for EFSA's ad hoc data collections, providing tools for dataset operations as well as utilities designed to preserve data history.

The package is intended for researchers, analysts, and practitioners who require convenient programmatic access to data collection utilities.

During installation, the following packages developed by EFSA are also installed: - eppoPynder - Website | PyPI. - pystiller - Website | PyPI.

These packages are not required to use efsa-tools, but are included for convenience and can be imported and used directly in the code if needed:

import eppopynder
# and/or
import pystiller

Installation

From PyPI

pip install efsa-tools

Development version

To install the latest development version:

pip install git+https://github.com/openefsa/efsa_tools.git

Basic usage

The main purpose of efsa-tools is to provide tools for managing datasets and tracking data history within the context of data collections.

Below are examples demonstrating how to use the functions in this package. First, load the efsa-tools package:

from efsa_tools import *

To explore the arguments and usage of a specific function, you can run:

help(function_name)

This will show the full documentation for the function, including its arguments, return values, and usage examples.

For example, if you are working with the SCD2() function, you can check its documentation with:

help(SCD2)

Dropping empty rows and columns from a data frame

If a data frame contain empty rows or columns, you can remove them using the drop_empty() function, as follows:

iris_dropped = drop_empty(iris)

print(iris_dropped.head())

Enriching a data frame with an EFSA's catalogue

The enrich() function enables the augmentation of a data frame using information stored in an EFSA's catalogue. It requires specifying the column used to join the two datasets, as well as the name of the column that will contain the enriched information (namely, the 'NAME' field of EFSA's catalogues).

enriched_data_frame = enrich(
  dataframe=dataframe_,
  catalogue=CV_MTX_,
  join_by="CODE",
  enriched_column_name="enrichedColumn"
)

print(enriched_data_frame.head())

Removing replicated columns from a data frame

The remove_replicated_columns() function merges all the replicated columns in a data frame into a single column whose name includes the "_deduplicated" suffix. After the merge, the original replicated columns are removed from the data frame.

In the following example, we present a data frame containing the columns region_1, region_2, ..., region_n with n > 100. Using the remove_replicated_columns() function, these columns can be efficiently consolidated into a single region_deduplicated column, assuming that for each row only one of the n columns contains a meaningful (non-NA) value.

iris["Species_1"] = iris["Species"]
iris["Species_2"] = iris["Species"]
iris.drop(columns=["Species"], inplace=True)

iris_deduplicated = remove_replicated_columns(
    dataframe=iris,
    prefix="Species_"
)

print(iris_deduplicated.head())

Implementing a "Simple" Slowly Changing Dimension Type 2 (SSCD2)

The SSCD2() function makes it possible to preserve data history when new data becomes available by implementing a simplified version of Slowly Changing Dimension Type 2. It marks all records in the current data frame as inactive and appends the new data, flagging each newly added record as active.

Unlike the SCD2() function, SSCD2() does not check which records have actually changed. Instead, it marks all existing records as inactive and treats all incoming records as new, setting the previous ones to inactive status even if they are still included in the updated dataset.

An example of how to use the function is provided below:

sscd2_dataframe = SSCD2(new_data=new_dataframe, current_data=old_dataframe)

print(sscd2_dataframe.head()))

Implementing a Slowly Changing Dimension Type 2 (SCD2)

The SCD2() function makes it possible to preserve data history when new data becomes available by implementing a Slowly Changing Dimension Type 2. It compares the current records with the new ones, marking as inactive any existing records that no longer appear in the updated dataset. Then, it flags as active any new records that are not present among the currently active data.

Unlike the SSCD2() function, SCD2() checks which records have actually changed. It marks as inactive any existing records that no longer appear in the updated dataset, and flags as active any new records that are not present among the currently active data.

An example of how to use the function is provided below:

scd2_dataframe = SCD2(new_data=new_dataframe, current_data=old_ataframe)

print(scd2_dataframe.head())