Tplyr Metadata

Tplyr has a bit of a unique design, which might feel a bit weird as you get used to the package. The process flow of building a tplyr_table() object first, and then using build() to construct the data frame is different than programming in the tidyverse, or creating a ggplot. Why create the tplyr_table() object first? Why is the tplyr_table() object different than the resulting data frame?

The purpose of the tplyr_table() object is to let Tplyr do more than just summarize data. As you build the table, all of the metadata around the table being built is maintained - the target variables being summarized, the grouped variables by row and column, the filter conditions necessary applied to the table and each layer. As a user, you provide this information to create the summary. But what about after the results are produced? Summarizing data inevitably leads to new questions. Within clinical summaries, you may want to know which subjects experienced an adverse event, or why the lab summaries of a particular visit’s descriptive statistics are abnormal. Normally, you’d write a query to recreate the data that lead to that particular summary. Tplyr now allows you to immediately extract the input data or metadata that created an output result, thus providing traceability from the result back to the source.

Generating the Metadata

Consider the following example:

t <- tplyr_table(tplyr_adsl, TRT01P, where = SAFFL == "Y") %>% 
  add_layer(
    group_count(RACE)
  ) %>% 
  add_layer(
    group_desc(AGE, where = EFFFL == "Y")
  )

dat <- t %>% build(metadata=TRUE)

kable(dat)

row_id	row_label1	var1_Placebo	var1_Xanomeline High Dose	var1_Xanomeline Low Dose	ord_layer_index	ord_layer_1
c1_1	AMERICAN INDIAN OR ALASKA NATIVE	0 ( 0.0%)	1 ( 1.2%)	0 ( 0.0%)	1	1
c2_1	BLACK OR AFRICAN AMERICAN	8 ( 9.3%)	9 ( 10.7%)	6 ( 7.1%)	1	2
c3_1	WHITE	78 ( 90.7%)	74 ( 88.1%)	78 ( 92.9%)	1	3
d1_2	n	79	74	81	2	1
d2_2	Mean (SD)	75.9 ( 8.43)	75.4 ( 7.87)	77.8 ( 8.02)	2	2
d3_2	Median	76.0	75.5	78.0	2	3
d4_2	Q1, Q3	69.5, 81.0	70.2, 79.0	71.0, 82.0	2	4
d5_2	Min, Max	52, 88	56, 88	51, 88	2	5
d6_2	Missing	0	0	0	2	6

To trigger the creation of metadata, the build() function has a new argument metadata. By specifying TRUE, the underlying metadata within Tplyr are prepared in an extractable format. This is the only action a user needs to specify for this action to take place.

When the metadata argument is used, a new column will be produced in the output dataframe called row_id. The row_id variable provides a persistent reference to a row of interest, even if the output dataframe is sorted. If you review vignette("styled-table"), note that we expect a certain amount of post processing and styling of the built data frame from Tplyr, to let you use whatever other packages you prefer. As such, this reference ID is necessary.

Extracting The Input Source

So, let’s cut to the chase. The most likely way you would use this metadata is to pull out the source data that created a cell. For this, we’ve provided the function get_meta_subset(). The only information that you need is the row_id and column name of the result cell of interest. For example, looking at the result above, what if we want to know who the 8 subjects in the Placebo group who where Black or African American:

get_meta_subset(t, 'c2_1', 'var1_Placebo') %>% 
  kable()

USUBJID	TRT01P	SAFFL	RACE
01-701-1203	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-701-1363	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-705-1282	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-706-1041	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-708-1286	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-708-1296	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-708-1378	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-711-1036	Placebo	Y	BLACK OR AFRICAN AMERICAN

By using the row_id and column, the dataframe is pulled right out for us. Notice that USUBJID was included by default, even though Tplyr there’s no reference anywhere in the tplyr_table() to the variable USUBJID. This is because get_meta_subset() has an additional argument add_cols that allows you to specify additional columns you want included in the resulting dataframe, and has a default of USUBJID. So let’s say we want additionally include the variable SEX.

get_meta_subset(t, 'c2_1', 'var1_Placebo', add_cols = vars(USUBJID, SEX)) %>% 
  kable()

USUBJID	SEX	TRT01P	SAFFL	RACE
01-701-1203	F	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-701-1363	F	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-705-1282	F	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-706-1041	F	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-708-1286	F	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-708-1296	M	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-708-1378	M	Placebo	Y	BLACK OR AFRICAN AMERICAN
01-711-1036	M	Placebo	Y	BLACK OR AFRICAN AMERICAN

Variables should be provided using dplyr::vars(), just like the cols argument on tplyr_table() and the by arguments in each layer type.

As mentioned, the input source data can be extracted for any result cell created by Tplyr. So let’s say we want to know the subjects relevant for the descriptive statistics around age in the Xanomeline High Dose group:

get_meta_subset(t, 'd1_2', 'var1_Xanomeline High Dose') %>% 
  head(10) %>% 
  kable()

USUBJID	TRT01P	EFFFL	SAFFL	AGE
01-701-1028	Xanomeline High Dose	Y	Y	71
01-701-1034	Xanomeline High Dose	Y	Y	77
01-701-1133	Xanomeline High Dose	Y	Y	81
01-701-1146	Xanomeline High Dose	Y	Y	75
01-701-1148	Xanomeline High Dose	Y	Y	57
01-701-1180	Xanomeline High Dose	Y	Y	56
01-701-1181	Xanomeline High Dose	Y	Y	79
01-701-1239	Xanomeline High Dose	Y	Y	56
01-701-1275	Xanomeline High Dose	Y	Y	61
01-701-1287	Xanomeline High Dose	Y	Y	56

Note: Trimmed for space

Notice how the columns returned are different. First off, within the summary above, we pulled results from the descriptive statistics layer. The target variable for this layer was AGE, and as such AGE is returned in the resulting output. Additionally, a layer level where argument was used to subset to EFFFL == "Y", which leads to EFFFL being included in the output as well.

Extracting a Result Cell’s Metadata

To extract the dataframe in get_meta_subset(), the metadata of the result cell needs to first be extracted. This metadata can be directly accessed using the function get_meta_result(). Using the last example of get_meta_subset() above:

get_meta_result(t, 'd1_2', 'var1_Xanomeline High Dose')
#> tplyr_meta: 4 names, 3 filters
#> Names:
#>      TRT01P, EFFFL, SAFFL, AGE 
#> Filters:
#>      TRT01P == c("Xanomeline High Dose"), EFFFL == "Y", SAFFL == "Y"

The resulting output is a new object Tplyr called tplyr_meta(). This is a container of a relevent metadata for a specific result. The object itself is a list with two elements: names and filters.

The names element contains quosures for each variable relevant to a specific result. This will include the target variable, the by variables used on the layer, the cols variables used on the table, and all variables included in any filter condition relevant to create the result.

The filters element contains each filter condition (provided as calls) necessary to create a particular cell. This will include the table level where argument, the layer level where argument, the filter condition for the specific value of any by variable or cols variable necessary to create the cell, and similarly the filter for the treatment group of interest.

The results are provided this was so that they can be unpacked directly into dplyr syntax when necessary, which is exactly what happens in get_meta_subset(). For example:

m <- get_meta_result(t, 'd1_2', 'var1_Xanomeline High Dose')

tplyr_adsl %>% 
  filter(!!!m$filters) %>% 
  select(!!!m$names) %>% 
  head(10) %>% 
  kable()

TRT01P	EFFFL	SAFFL	AGE
Xanomeline High Dose	Y	Y	71
Xanomeline High Dose	Y	Y	77
Xanomeline High Dose	Y	Y	81
Xanomeline High Dose	Y	Y	75
Xanomeline High Dose	Y	Y	57
Xanomeline High Dose	Y	Y	56
Xanomeline High Dose	Y	Y	79
Xanomeline High Dose	Y	Y	56
Xanomeline High Dose	Y	Y	61
Xanomeline High Dose	Y	Y	56

Note: Trimmed for space

But - who says you can’t let your imagination run wild?

cat(c("tplyr_adsl %>%\n",
  "   filter(\n      ",
  paste(purrr::map_chr(m$filters, ~ rlang::as_label(.)), collpase=",\n      "),
  ") %>%\n",
  paste("   select(", paste(purrr::map_chr(m$names, rlang::as_label), collapse=", "), ")", sep="")
))

Anti Joins

Most data presented within a table refers back to the target dataset from which data are being summarized. In some cases, data presented may refer to information excluded from the summary. This is the case when you use the Tplyr function add_missing_subjects_row(). In this case, the counts presented refer to data excluded from the target which are present in the population data. The metadata thus needs to refer to that excluded data. To handle this, there’s an additional field called an ‘Anti Join’. Consider this example:

t <- tplyr_table(tplyr_adae, TRTA) %>%
  set_pop_data(tplyr_adsl) %>%
  set_pop_treat_var(TRT01A) %>%
  add_layer(
    group_count(vars(AEBODSYS, AEDECOD)) %>%
      set_distinct_by(USUBJID) %>%
      add_missing_subjects_row(f_str("xx (XX.x%)", distinct_n, distinct_pct), sort_value = Inf)
  )

x <- build(t, metadata=TRUE)

tail(x) %>% 
  select(starts_with('row'), var1_Placebo) %>% 
  kable()

row_id	row_label1	row_label2	var1_Placebo
c18_1	SKIN AND SUBCUTANEOUS TISSUE DISORDERS	SKIN EXFOLIATION	0 ( 0.0%)
c19_1	SKIN AND SUBCUTANEOUS TISSUE DISORDERS	SKIN IRRITATION	3 ( 3.5%)
c20_1	SKIN AND SUBCUTANEOUS TISSUE DISORDERS	SKIN ODOUR ABNORMAL	0 ( 0.0%)
c21_1	SKIN AND SUBCUTANEOUS TISSUE DISORDERS	SKIN ULCER	1 ( 1.2%)
c22_1	SKIN AND SUBCUTANEOUS TISSUE DISORDERS	URTICARIA	0 ( 0.0%)
c23_1	SKIN AND SUBCUTANEOUS TISSUE DISORDERS	Missing	65 ( 75.6%)

The missing row in this example counts the subjects within their respective treatment groups who do not have any adverse events for the body system “SKIN AND SUBCUTANEOUS TISSUE DISORDERS”. Here’s what the metadata for the result for the Placebo treatment group looks like.

m <- get_meta_result(t, 'c23_1', 'var1_Placebo')
m
#> tplyr_meta: 3 names, 4 filters
#> Names:
#>      TRTA, AEBODSYS, AEDECOD 
#> Filters:
#>      TRTA == c("Placebo"), AEBODSYS == c("SKIN AND SUBCUTANEOUS TISSUE DISORDERS"), TRUE, TRUE 
#> Anti-join:
#>     Join Meta:
#>         tplyr_meta: 1 names, 3 filters
#>         Names:
#>              TRT01A 
#>         Filters:
#>              TRT01A == c("Placebo"), TRUE, TRUE 
#>     On:
#>         USUBJID

This result has the addition field of ‘Anti-join’. This element has two fields, which are the join metadata, and the “on” field, which specifies a merging variable to be used when “anti-joining” with the target data. The join metadata here refers to the data of interest from the population data. Note that while the metadata for the target data has variable names and filter conditions referring to AEBODSYS and AEDECOD, these variables are not present within the join metadata, because that information is not present within the population data.

While the usual joins we work with focus on the overlap between two sets, an anti-join looks at the non-overlap. The metadata provided here will specifically give us “The subjects within the Placebo treatment group who do not have an adverse event within the body system ‘SKIN AND SUBCUTANEOUS TISSUE DISORDERS’”.

Extracting this metadata works very much the same way as extracting other results.

head(get_meta_subset(t, 'c23_1', 'var1_Placebo'))
#> # A tibble: 6 × 2
#>   USUBJID     TRT01A 
#>   <chr>       <chr>  
#> 1 01-701-1015 Placebo
#> 2 01-701-1047 Placebo
#> 3 01-701-1118 Placebo
#> 4 01-701-1153 Placebo
#> 5 01-701-1203 Placebo
#> 6 01-701-1234 Placebo

If you’re not working with the tplyr_table object, then there’s some additional information you need to provide to the function.

head(get_meta_subset(t$metadata, 'c23_1', 'var1_Placebo', 
                     target=t$target, pop_data=t$pop_data))
#> # A tibble: 6 × 2
#>   USUBJID     TRT01A 
#>   <chr>       <chr>  
#> 1 01-701-1015 Placebo
#> 2 01-701-1047 Placebo
#> 3 01-701-1118 Placebo
#> 4 01-701-1153 Placebo
#> 5 01-701-1203 Placebo
#> 6 01-701-1234 Placebo

tplyr_adsl %>%
    filter(
       TRTA == c("Placebo") ,
       AEBODSYS == c("SKIN AND SUBCUTANEOUS TISSUE DISORDERS") ,
       TRUE ,
       TRUE ,
       ) %>%
    select(TRTA, AEBODSYS, AEDECOD)

Tplyr Metadata

Generating the Metadata

Extracting The Input Source

Extracting a Result Cell’s Metadata

Anti Joins

So, What Does This Get Me?