Application example

Andreas Schulz

Introduction

The purpose of the package is automatically detecting type of variables in not quality controlled data. The prediction is based on a pre-trained random forest model, trained on over 5000 medical variables with OOB accuracy of 99%. The accuracy depends heavily on the type and coding style of data. For example, often categorical variables are coded as integers 1 to x, if the number of categories is very large, there is no way to distinguish it from a continuous integer variable. Some types are per definition very sensitive to errors in data, like ID, missing or constant, where a single alternative non-missing value makes it not constant or not missing anymore. The data is assumed to be cross sectional, where ID is unique (no multiple entries per ID).

It can be used as a first step by data quality control to help sort the variables in advance and get some information about the possible formats.

Example data set

The data set ‘sim_nqc_data’ contains 100 observations and 14 artificial variables with some not well formatted or missing values. The data is complete artificial and was not used for training or validation of the random forest model.

knitr::kable(head(sim_nqc_data, 20), caption='Artificial not quality controlled data')
Artificial not quality controlled data
id visit sex age byear decades bmi med date loct group crp bnp comms
1 1 Men 55 1966 55-64 28 0 2016/12/03 NA 4 1.78 >300
2 NULL Women NULL 1979 NULL NULL 0 2016-07-06 NA NULL kM kM no material available
3 1 Men 49 1972 45-54 29 0 2015-09-16 NA 2 2.002 <1.5
4 1 Women 73 1948 65-75 25.5 1 2016-xx-xx NA 1 1.332 3.4
5 1 Women 49 1972 45-54 24 0 2017-08-03 NA 2 <0.2 2.1 .
6 1 Men 70 9999 65-75 28 0 2018-03-30 NA 2 3.157 5.6
7 1 71 1950 65-75 32 0 NA 4 4.203 <1.5 nd
8 1 Women 55 1966 55-64 27 0 2016/12/04 NA 2 0.619 3.1
9 1 Men 46 1975 45-54 31 0 2016-06-26 NA 2 9.866 7.2
10 1 Men 32 1989 25-34 24 NA 2018-05-08 NA 3 NA <1.5 Lab problems
11 1 Women 72 1949 65-75 29 0 2017-08-12 NA 4 0.352 <1.5
12 1 Men 40 1981 35-44 31 1 2016-11-27 NA 2 1.28 <1.5
13 1 Men 28 1993 25-34 0 0 2017-12-28 NA 3 1.073 <1.5
14 1 Men 72 1949 65-75 29 0 2015-09-18 NA 1 0.227 <1.5
17 1 Men 61 1960 55-64 27 0 2017-10-06 NA 2 5.113 5.5
18 1 Women NA NA NA 26 0 NA 5 0.508 <1.5 na
19 1 Men 52 1969 45-54 26 1 2018-01-10 NA 3 0.231 1.5
20 1 Men 73 1948 65-75 29 1 2017-03-02 NA 4 0.975 <1.5
21 1 Men 54 1967 45-54 26 0 2016-05-27 NA 2 3.38 <1.5
22 1 Women NA 9999 NULL 28 0 2017-12-11 NA 3 0.437 <1.5 no birth date

Application

An example on error afflicted data

The application is straightforward, it requires data in data.frame format. It is important that all unusual missing values in the data, e.g. the code 9999 for missing values are covered. Values as NA, NaN, Inf, NULL and spaces are automatic considered as invalid (missing) values. The second column type is the estimated type of the variable, and the column probability indicates how certain the type is. The format gives additional information about the possible format of the variable, especially useful for date variables. Class is just a translation of type into broader categories.

tab <- vtype(sim_nqc_data, miss_values='9999')
knitr::kable(tab, caption='Application example of vtype')
Application example of vtype
variable type probability format class alternative n missings
id ID 0.906 supportive continuous (4.8%) 100 0
visit constant 0.989 1 uninformative 98 2
sex binary 0.961 men/women qualitative text (3.2%) 98 2
age continuous 1.000 integer quantitative 92 8
byear date 0.974 %Y supportive continuous (2.6%) 93 7
decades categorical 0.565 labels qualitative date (39.8%) 92 8
bmi continuous 0.999 integer quantitative 92 8
med binary 0.999 0/1 qualitative 95 5
date date 1.000 %Y-%m-%d supportive 97 3
loct missing 1.000 uninformative 0 100
group categorical 0.978 1-7 qualitative continuous (2.1%) 99 1
crp continuous 1.000 floating quantitative 96 4
bnp continuous 0.997 floating quantitative 95 5
comms text 0.952 supportive categorical (4.3%) 10 90

An example with very small sample size

Very small sample size can reduce the prediction performance significantly. The id variable is now detected as integer, age as categorical and decades as a date variable.

knitr::kable(vtype(sim_nqc_data[1:10,]), caption='Application example with small sample size')
Application example with small sample size
variable type probability format class alternative n missings
id continuous 0.517 integer quantitative date (16.6%) 10 0
visit constant 0.962 1 uninformative continuous (3.8%) 9 1
sex binary 0.754 men/women qualitative text (21.8%) 9 1
age categorical 0.441 32-73 qualitative continuous (42.3%) 9 1
byear date 0.444 ? supportive continuous (32.4%) 10 0
decades date 0.493 ? supportive categorical (40.9%) 9 1
bmi continuous 0.595 integer quantitative categorical (27.8%) 9 1
med binary 0.965 0/1 qualitative continuous (3.5%) 9 1
date date 0.999 %Y-%m-%d supportive 9 1
loct missing 1.000 uninformative 0 10
group categorical 0.909 1-4 qualitative continuous (7.2%) 9 1
crp continuous 0.640 floating quantitative text (17%) 9 1
bnp continuous 0.479 floating quantitative text (31.3%) 10 0
comms text 0.999 supportive 4 6

An example on data without errors

knitr::kable(vtype(mtcars),  caption='Application example on data without errors')
Application example on data without errors
variable type probability format class alternative n missings
mpg continuous 1.000 floating quantitative 32 0
cyl categorical 0.966 4-8 qualitative continuous (3%) 32 0
disp continuous 0.999 floating quantitative 32 0
hp continuous 1.000 integer quantitative 32 0
drat continuous 1.000 floating quantitative 32 0
wt continuous 0.998 floating quantitative 32 0
qsec continuous 0.999 floating quantitative 32 0
vs binary 0.974 0/1 qualitative continuous (2.6%) 32 0
am binary 0.974 0/1 qualitative continuous (2.6%) 32 0
gear categorical 0.966 3-5 qualitative continuous (3%) 32 0
carb categorical 0.967 1-8 qualitative continuous (3.1%) 32 0