qualitycontrol

The goal of qualitycontrol is to set a data quality control framework

Installation

You can install the qualitycontrol from GitHub with:

# install.packages("devtools")
devtools::install_github("luisgarcez11/qualitycontrol")

Data

The als_data dataset will be used to guide you through the package functionality. This data is not real, but based on data retrieved from Amyotrophic Lateral Sclerosis patients.

library(qualitycontrol)
als_data
##    subjid p1 p2 p3 p4 p5 p6 p7 p8 p9 x1r x2r x3r age_at_baseline age_at_onset
## 1       1  4  1  1  3  4  3  4  3  4   2   2   1              51           46
## 2       2  4  4  4  1  1  3  3  1  4   1   2   4              82           77
## 3       3  2  3  1  4  3  1  3  1  1   4   3   1              85           80
## 4       4  3  2  1  1  4  1  3  2  4   4   3   3              77           72
## 5       5  3  2  1  3  3  4  4  3  4   1   4   2              85           80
## 6       6  2  2  1  4  1  4  4  3  1   3   5   2              73           68
## 7       7  1  4  2  4  3  3  2  3  4   1   2   2              65           60
## 8       8  2  2  4  4  3  2  1  2  3   3   1   1              50           62
## 9       9  3  1  1  4  4  2  4  1  1   2   2   4              65           46
## 10     10  3  4  1  4  3  2  3  2  1   4   3   1              81           76
## 11     11  1  3  1  3  3  4  1 NA  3   3   2   4              51           46
## 12     12  1  4  3  2  3  2  2 NA  1   3   2   3              50           45
## 13     13  1  1  4  1  1  3  4 NA  2   2   3   1              82           77
## 14     14  3  2  2  4  3  3  3  3  2   3   4   1              76           71
## 15     15  3  4  2  2  2  3  1  3  4   4   1   4              87          376
## 16     16  3  3  2  4  3  3  1  1  2   2   4   1              50           45
## 17     17  3  2  3  1  4  1  3  2  1   4   4   2              85           80
## 18     18  4  1  3  1  3  1  3  2  2   4   3   4              57           52
## 19     19  1  3  3  2  2  2  3  2  3   2   3   2              74           69
## 20     20  2  2  4  2  3  4  2  4  1   4   1   3              59           54
## 21     21  2  3  3  2  3  2  4  4  1   1   3   3              79           74
## 22     22  4  3  1  1  3  4  2  1  4   1   2   3              53           48
## 23     23  3  3  4  3  4  1  3  4  3   2   2   2              45           40
## 24     24  4  1  1  2  4  2  4  4  4   4   2   1              72           67
## 25     25  4  3  1  3  3  4  3  2  3   3   4   2              77           72
## 26     26  2  1  1  2  4  2  4  1  2   3   2   4              65           60
## 27     27  1  1  1  1  1  1  3  3  2   2   1   1              54           49
## 28     28  3  1  1  3  1  4  1  2  2   2   3   4              50          -23
## 29     29  2  3  1  3  1  4  4  1  3   2   4   1              85           80
## 30     30  3  1  2  1  3  1  2  4  1   1   2   4              85           80
## 31     30  3  3  1  4  2  2  1  4  3   3   1   3              53           48
##          onset baseline_date death_date
## 1       bulbar    2003-03-26 2010-10-18
## 2        bulba    2003-07-03 2019-06-24
## 3       spinal    2007-01-27 9999-12-30
## 4       bulbar    2010-11-27 2018-01-04
## 5       bulbar    2006-10-25 2017-10-13
## 6       spinal    2007-04-30 2010-05-08
## 7       spinal    2002-11-15 2019-04-06
## 8       spinal    2002-12-13 2018-05-04
## 9       spinal    2005-06-02 2013-08-11
## 10      bulbar    2004-06-02 2016-05-20
## 11      bulbar    2007-03-09 2016-09-26
## 12      bulbar    2005-01-11 2010-06-20
## 13      bulbar    2010-12-22 2019-07-05
## 14      bulbar    2008-10-14 2013-08-14
## 15      spinal    2005-09-15 2010-07-20
## 16      spinal    2007-07-05 2010-08-28
## 17 respiratory    2002-08-19 2011-10-17
## 18      spinal    2002-06-30 2020-12-17
## 19 respiratory    2010-07-18 2016-05-15
## 20      spinal    2004-08-15 2015-03-15
## 21      bulbar    2006-04-07 2013-03-16
## 22      bulbar    2002-06-01 2016-06-21
## 23      bulbar    2007-08-12 2017-04-01
## 24      bulbar    2006-08-12 2002-12-02
## 25 respiratory    2006-08-11 2016-03-03
## 26      spinal    2005-01-04 2011-10-05
## 27 respiratory    2009-08-25 2015-03-11
## 28      bulbar    2002-05-11 2017-11-09
## 29      bulbar    2004-07-27 2014-03-27
## 30      bulbar    2005-11-11 2015-05-30
## 31      bulbar    2008-02-27 2014-07-05

QC mapping

The als_data_qc_mapping is an R list which contains 3 tables specifying all the tests used for quality control. You can specify your own tests, by creating an excel file and then read it using the function read_qc_mapping.

Missing

als_data_qc_mapping$missing
## # A tibble: 13 × 3
##    qc_type    variable type   
##    <chr>      <chr>    <chr>  
##  1 duplicated subjid   text   
##  2 missing    p1       numeric
##  3 missing    p2       numeric
##  4 missing    p3       numeric
##  5 missing    p4       numeric
##  6 missing    p5       numeric
##  7 missing    p6       numeric
##  8 missing    p7       numeric
##  9 missing    p8       numeric
## 10 missing    p9       numeric
## 11 missing    x1r      numeric
## 12 missing    x2r      numeric
## 13 missing    x3r      numeric

Inconsistencies

als_data_qc_mapping$inconsistencies
## # A tibble: 2 × 6
##   qc_type             variable1       type1   relation     variable2    type2  
##   <chr>               <chr>           <chr>   <chr>        <chr>        <chr>  
## 1 inconsistent_values age_at_baseline numeric greater_than age_at_onset numeric
## 2 inconsistent_values baseline_date   date    lower_than   death_date   date

Out of range values

als_data_qc_mapping$range
## # A tibble: 16 × 6
##    qc_type variable        type        lower_value upper_value categories       
##    <chr>   <chr>           <chr>       <chr>       <chr>       <chr>            
##  1 range   p1              numeric     1           4           <NA>             
##  2 range   p2              numeric     1           4           <NA>             
##  3 range   p3              numeric     1           4           <NA>             
##  4 range   p4              numeric     1           4           <NA>             
##  5 range   p5              numeric     1           4           <NA>             
##  6 range   p6              numeric     1           4           <NA>             
##  7 range   p7              numeric     1           4           <NA>             
##  8 range   p8              numeric     1           4           <NA>             
##  9 range   p9              numeric     1           4           <NA>             
## 10 range   x1r             numeric     1           4           <NA>             
## 11 range   x2r             numeric     1           4           <NA>             
## 12 range   x3r             numeric     1           4           <NA>             
## 13 range   age_at_baseline numeric     20          100         <NA>             
## 14 range   age_at_onset    numeric     20          100         <NA>             
## 15 range   death_date      date        2000-01-01  2022-01-01  <NA>             
## 16 range   onset           categorical <NA>        <NA>        bulbar, respirat…

qc_data function

qc_data takes as arguments the data to be quality controlled and the QC mapping containing the tests to be applied.

qc_data(als_data, als_data_qc_mapping)[,c("subjid","age_at_onset","onset","baseline_date","death_date","finding")]
## # A tibble: 13 × 6
##    subjid age_at_onset onset  baseline_date death_date finding                  
##    <chr>  <chr>        <chr>  <chr>         <chr>      <chr>                    
##  1 30     80           bulbar 2005-11-11    2015-05-30 subjid variable is dupli…
##  2 30     48           bulbar 2008-02-27    2014-07-05 subjid variable is dupli…
##  3 11     46           bulbar 2007-03-09    2016-09-26 variable p8 is missing   
##  4 12     45           bulbar 2005-01-11    2010-06-20 variable p8 is missing   
##  5 13     77           bulbar 2010-12-22    2019-07-05 variable p8 is missing   
##  6 6      68           spinal 2007-04-30    2010-05-08 variable x2r is out of r…
##  7 15     376          spinal 2005-09-15    2010-07-20 variable age_at_onset is…
##  8 28     -23          bulbar 2002-05-11    2017-11-09 variable age_at_onset is…
##  9 3      80           spinal 2007-01-27    9999-12-30 variable death_date is o…
## 10 2      77           bulba  2003-07-03    2019-06-24 variable onset is not a …
## 11 8      62           spinal 2002-12-13    2018-05-04 variables age_at_baselin…
## 12 15     376          spinal 2005-09-15    2010-07-20 variables age_at_baselin…
## 13 24     67           bulbar 2006-08-12    2002-12-02 variables baseline_date …

This will return a table with all the findings. If you want to save it, you can specify the path to be saved in output_file.