Introduction to frictionless

Peter Desmet

2022-01-03

What is a Data Package?

A Data Package is a simple container format to describe and package a collection of (tabular) data. It is typically used to publish FAIR and open datasets. In this tutorial we will show you how to read, create, edit and write Data Packages with the R pkg frictionless.

package vs pkg: in the Frictionless Data community package refers to a Data Package, while in R that is a software package. In this tutorial we will use pkg to refer to the latter.

Read a Data Package

Load the frictionless pkg with:

library(frictionless)

To read a Data Package, you need to know the path or URL to its descriptor file, named datapackage.json. That file describes the Data Package, provides access points to its Data Resources and can contain dataset-level metadata. Let’s read a Data Package descriptor file published on Zenodo:

package <- read_package("https://zenodo.org/record/5879096/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.5879096

read_package() returns the content of datapackage.json as a list, printed here with str() to improve readability:

str(package, list.len = 3)
#> List of 4
#>  $ id       : chr "https://doi.org/10.5281/zenodo.5879096"
#>  $ profile  : chr "tabular-data-package"
#>  $ resources:List of 3
#>   ..$ :List of 7
#>   .. ..$ name     : chr "reference-data"
#>   .. ..$ path     : chr "O_WESTERSCHELDE-reference-data.csv"
#>   .. ..$ profile  : chr "tabular-data-resource"
#>   .. .. [list output truncated]
#>   ..$ :List of 7
#>   .. ..$ name     : chr "gps"
#>   .. ..$ path     : chr [1:3] "O_WESTERSCHELDE-gps-2018.csv.gz" "O_WESTERSCHELDE-gps-2019.csv.gz" "O_WESTERSCHELDE-gps-2020.csv.gz"
#>   .. ..$ profile  : chr "tabular-data-resource"
#>   .. .. [list output truncated]
#>   ..$ :List of 7
#>   .. ..$ name     : chr "acceleration"
#>   .. ..$ path     : chr [1:3] "O_WESTERSCHELDE-acceleration-2018.csv.gz" "O_WESTERSCHELDE-acceleration-2019.csv.gz" "O_WESTERSCHELDE-acceleration-2020.csv.gz"
#>   .. ..$ profile  : chr "tabular-data-resource"
#>   .. .. [list output truncated]
#>   [list output truncated]

The most important aspect of a Data Package are its Data Resources, which describe and point to the data. You can list all included resources with resources():

resources(package)
#> [1] "reference-data" "gps"            "acceleration"

This Data Package has 3 resources. Let’s read the data from the gps resource into a data frame:

gps <- read_resource(package, "gps")
gps
#> # A tibble: 73,047 × 21
#>    event-i…¹ visible timestamp           locat…² locat…³ bar:b…⁴ exter…⁵ gps:d…⁶
#>        <dbl> <lgl>   <dttm>                <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1   1.43e10 TRUE    2018-05-25 16:11:37    4.25    51.3      NA    32.5     2  
#>  2   1.43e10 TRUE    2018-05-25 16:16:41    4.25    51.3      NA    32.8     2.1
#>  3   1.43e10 TRUE    2018-05-25 16:21:29    4.25    51.3      NA    34.1     2.1
#>  4   1.43e10 TRUE    2018-05-25 16:26:28    4.25    51.3      NA    34.5     2.2
#>  5   1.43e10 TRUE    2018-05-25 16:31:21    4.25    51.3      NA    34.1     2.2
#>  6   1.43e10 TRUE    2018-05-25 16:36:09    4.25    51.3      NA    32.5     2.2
#>  7   1.43e10 TRUE    2018-05-25 16:40:57    4.25    51.3      NA    32.1     2.2
#>  8   1.43e10 TRUE    2018-05-25 16:45:55    4.25    51.3      NA    33.3     2.1
#>  9   1.43e10 TRUE    2018-05-25 16:50:49    4.25    51.3      NA    32.6     2.1
#> 10   1.43e10 TRUE    2018-05-25 16:55:36    4.25    51.3      NA    31.7     2  
#> # … with 73,037 more rows, 13 more variables: `gps:satellite-count` <dbl>,
#> #   `gps-time-to-fix` <dbl>, `ground-speed` <dbl>, heading <dbl>,
#> #   `height-above-msl` <dbl>, `location-error-numerical` <dbl>,
#> #   `manually-marked-outlier` <lgl>, `vertical-error-numerical` <dbl>,
#> #   `sensor-type` <chr>, `individual-taxon-canonical-name` <chr>,
#> #   `tag-local-identifier` <chr>, `individual-local-identifier` <chr>,
#> #   `study-name` <chr>, and abbreviated variable names ¹​`event-id`, …

The data frame contains all GPS records, even though the actual data were split over multiple CSV zipped files. read_resource() assigned the column names and types based on the Table Schema that was defined for that resource, not the headers of the CSV file.

You can also read data from a local (e.g. downloaded) Data Package. In fact, there is one included in the frictionless pkg, let’s read that one from disk:

local_package <- read_package(
  system.file("extdata", "datapackage.json", package = "frictionless")
)
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
read_resource(local_package, "media")
#> # A tibble: 3 × 5
#>   media_id                             deployment_id observati…¹ times…² file_…³
#>   <chr>                                <chr>         <chr>       <chr>   <chr>  
#> 1 aed5fa71-3ed4-4284-a6ba-3550d1a4de8d 1             1-1         2020-0… https:…
#> 2 da81a501-8236-4cbd-aa95-4bc4b10a05df 1             1-1         2020-0… https:…
#> 3 0ba57608-3cf1-49d6-a5a2-fe680851024d 1             1-1         2020-0… https:…
#> # … with abbreviated variable names ¹​observation_id, ²​timestamp, ³​file_path

Data from the media was not stored in a CSV file, but directly in the data property of that resource in datapackage.json. read_resource() will automatically detect where to read data from.

Create and edit a Data Package

Data Package is a good format to technically describe your data, e.g. if you are planning to deposit it on research repository. It also goes a long way meeting FAIR principles.

The frictionless pkg assumes your data are stored as a data frame or CSV files. Let’s use the built-in dataset iris as your data frame:

# Show content of the data frame "iris"
dplyr::as_tibble(iris)
#> # A tibble: 150 × 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ℹ 140 more rows

Create a Data Package with create_package() and add your data frame as a resource with the name iris:

# Load dplyr or magrittr to support %>% pipes
library(dplyr) # or library(magrittr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

my_package <-
  create_package() %>%
  add_resource(resource_name = "iris", data = iris)

Note that you can chain most frictionless functions together using pipes (%>% or |>), which improves readability.

my_package now contains one resource:

resources(my_package)
#> [1] "iris"

By default, add_resource() will create a Table Schema for your data frame, describing its field names, field types and (for factors) constraints. You can retrieve the schema of a resource with get_schema(). It is a list, which we print here using str():

iris_schema <-
  my_package %>%
  get_schema("iris")

str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"

You can also create a schema from a data frame, using create_schema():

iris_schema <- create_schema(iris)

str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"

Doing so allows you to customize the schema before adding the resource. E.g. let’s add a description for Sepal.Length:

iris_schema$fields[[1]]$description <- "Sepal length in cm."
# Show result
str(iris_schema$fields[[1]])
#> List of 3
#>  $ name       : chr "Sepal.Length"
#>  $ type       : chr "number"
#>  $ description: chr "Sepal length in cm."

Since schema is a list, you can use the purrr pkg to edit multiple elements at once:

# Remove description for first field
iris_schema$fields[[1]]$description <- NULL

# Set descriptions for all fields
descriptions <- c(
  "Sepal length in cm.",
  "Sepal width in cm.",
  "Pedal length in cm.",
  "Pedal width in cm.",
  "Iris species."
)
iris_schema$fields <- purrr::imap(
  iris_schema$fields,
  ~ c(.x, description = descriptions[.y])
)

str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Sepal.Length"
#>   .. ..$ type       : chr "number"
#>   .. ..$ description: chr "Sepal length in cm."
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Sepal.Width"
#>   .. ..$ type       : chr "number"
#>   .. ..$ description: chr "Sepal width in cm."
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Petal.Length"
#>   .. ..$ type       : chr "number"
#>   .. ..$ description: chr "Pedal length in cm."
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Petal.Width"
#>   .. ..$ type       : chr "number"
#>   .. ..$ description: chr "Pedal width in cm."
#>   ..$ :List of 4
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"
#>   .. ..$ description: chr "Iris species."

Let’s add iris as a resource to your Data Package again, but this time with the customized schema. Note that you have to remove the originally added resource iris with remove_resource() first, since Data Packages can only contain uniquely named resources:

my_package <-
  my_package %>%
  remove_resource("iris") %>% # Remove originally added resource
  add_resource(
    resource_name = "iris",
    data = iris,
    schema = iris_schema # Your customized schema
  )

If you already have your data stored as CSV files and you want to include them as is as a Data Resource, you can do so as well. As with data frames, you can opt to create a Table Schema automatically or provide your own.

# Two CSV files with the same structure
path_1 <- system.file("extdata", "observations_1.csv", package = "frictionless")
path_2 <- system.file("extdata", "observations_2.csv", package = "frictionless")

# Add both CSV files as a single resource
my_package <-
  my_package %>%
  add_resource(resource_name = "observations", data = c(path_1, path_2))

Your Data Package now contains 2 resources, but you can add metadata properties as well (see the Data Package documentation for an overview). Since it is a list, you can use append() to insert properties at the desired place. Let’s add name as first and title as second property:

my_package <- append(my_package, c(name = "my_package"), after = 0)
# Or with purrr::prepend(): prepend(my_package, c(name = "my_package))
my_package <- append(my_package, c(title = "My package"), after = 1)

Note that in the above steps you started a Data Package from scratch with create_package(), but you can use the same functionality to edit an existing Data Package read with read_package().

Write a Data Package

Now that you have created your Data Package, you can write it to a directory of your choice with write_package():

write_package(my_package, "my_directory")

The directory will contain four files: the descriptor datapackage.json, one CSV file containing the data for the resource iris and two CSV files containing the data for the resource observations.

list.files("my_directory")
#> [1] "datapackage.json"   "iris.csv"           "observations_1.csv"
#> [4] "observations_2.csv"

The frictionless pkg does not provide functionality to deposit your Data Package on a research repository such as Zenodo, but here are some tips:

  1. Validate your Data Package before depositing. You can do this in Python with the Frictionless Framework using frictionless validate datapackage.json.
  2. Zip the individual CSV files (and update their paths in datapackage.json) to reduce size, not the entire Data Package. That way, users still have direct access to the datapackage.json file. See this example.
  3. Only describe the technical aspects of your dataset in datapackage.json (fields, units, the dataset identifier in id). Authors, methodology, license, etc. are better described in the metadata fields the research repository provides.