Application example

Introduction

The purpose of the package is automatically detecting type of variables in not quality controlled data. The prediction is based on a pre-trained random forest model, trained on over 5000 medical variables with OOB accuracy of 99%. The accuracy depends heavily on the type and coding style of data. For example, often categorical variables are coded as integers 1 to x, if the number of categories is very large, there is no way to distinguish it from a continuous integer variable. Some types are per definition very sensitive to errors in data, like ID, missing or constant, where a single alternative non-missing value makes it not constant or not missing anymore. The data is assumed to be cross sectional, where ID is unique (no multiple entries per ID).

It can be used as a first step by data quality control to help sort the variables in advance and get some information about the possible formats.

Example data set

The data set ‘sim_nqc_data’ contains 100 observations and 14 artificial variables with some not well formatted or missing values. The data is complete artificial and was not used for training or validation of the random forest model.

knitr::kable(head(sim_nqc_data, 20), caption='Artificial not quality controlled data')

Artificial not quality controlled data
id	visit	sex	age	byear	decades	bmi	med	date	loct	group	crp	bnp	comms
1	1	Men	55	1966	55-64	28	0	2016/12/03	NA	4	1.78	>300
2	NULL	Women	NULL	1979	NULL	NULL	0	2016-07-06	NA	NULL	kM	kM	no material available
3	1	Men	49	1972	45-54	29	0	2015-09-16	NA	2	2.002	<1.5
4	1	Women	73	1948	65-75	25.5	1	2016-xx-xx	NA	1	1.332	3.4
5	1	Women	49	1972	45-54	24	0	2017-08-03	NA	2	<0.2	2.1	.
6	1	Men	70	9999	65-75	28	0	2018-03-30	NA	2	3.157	5.6
7	1		71	1950	65-75	32	0		NA	4	4.203	<1.5	nd
8	1	Women	55	1966	55-64	27	0	2016/12/04	NA	2	0.619	3.1
9	1	Men	46	1975	45-54	31	0	2016-06-26	NA	2	9.866	7.2
10	1	Men	32	1989	25-34	24	NA	2018-05-08	NA	3	NA	<1.5	Lab problems
11	1	Women	72	1949	65-75	29	0	2017-08-12	NA	4	0.352	<1.5
12	1	Men	40	1981	35-44	31	1	2016-11-27	NA	2	1.28	<1.5
13	1	Men	28	1993	25-34	0	0	2017-12-28	NA	3	1.073	<1.5
14	1	Men	72	1949	65-75	29	0	2015-09-18	NA	1	0.227	<1.5
17	1	Men	61	1960	55-64	27	0	2017-10-06	NA	2	5.113	5.5
18	1	Women	NA	NA	NA	26	0		NA	5	0.508	<1.5	na
19	1	Men	52	1969	45-54	26	1	2018-01-10	NA	3	0.231	1.5
20	1	Men	73	1948	65-75	29	1	2017-03-02	NA	4	0.975	<1.5
21	1	Men	54	1967	45-54	26	0	2016-05-27	NA	2	3.38	<1.5
22	1	Women	NA	9999	NULL	28	0	2017-12-11	NA	3	0.437	<1.5	no birth date

Application

An example on error afflicted data

The application is straightforward, it requires data in data.frame format. It is important that all unusual missing values in the data, e.g. the code 9999 for missing values are covered. Values as NA, NaN, Inf, NULL and spaces are automatic considered as invalid (missing) values. The second column type is the estimated type of the variable, and the column probability indicates how certain the type is. The format gives additional information about the possible format of the variable, especially useful for date variables. Class is just a translation of type into broader categories.

tab <- vtype(sim_nqc_data, miss_values='9999')
knitr::kable(tab, caption='Application example of vtype')

Application example of vtype
variable	type	probability	format	class	alternative	n	missings
id	ID	0.906		supportive	continuous (4.8%)	100	0
visit	constant	0.989	1	uninformative	–	98	2
sex	binary	0.961	men/women	qualitative	text (3.2%)	98	2
age	continuous	1.000	integer	quantitative	–	92	8
byear	date	0.974	%Y	supportive	continuous (2.6%)	93	7
decades	categorical	0.565	labels	qualitative	date (39.8%)	92	8
bmi	continuous	0.999	integer	quantitative	–	92	8
med	binary	0.999	0/1	qualitative	–	95	5
date	date	1.000	%Y-%m-%d	supportive	–	97	3
loct	missing	1.000		uninformative	–	0	100
group	categorical	0.978	1-7	qualitative	continuous (2.1%)	99	1
crp	continuous	1.000	floating	quantitative	–	96	4
bnp	continuous	0.997	floating	quantitative	–	95	5
comms	text	0.952		supportive	categorical (4.3%)	10	90

An example with very small sample size

Very small sample size can reduce the prediction performance significantly. The id variable is now detected as integer, age as categorical and decades as a date variable.

knitr::kable(vtype(sim_nqc_data[1:10,]), caption='Application example with small sample size')

Application example with small sample size
variable	type	probability	format	class	alternative	n	missings
id	continuous	0.517	integer	quantitative	date (16.6%)	10	0
visit	constant	0.962	1	uninformative	continuous (3.8%)	9	1
sex	binary	0.754	men/women	qualitative	text (21.8%)	9	1
age	categorical	0.441	32-73	qualitative	continuous (42.3%)	9	1
byear	date	0.444	?	supportive	continuous (32.4%)	10	0
decades	date	0.493	?	supportive	categorical (40.9%)	9	1
bmi	continuous	0.595	integer	quantitative	categorical (27.8%)	9	1
med	binary	0.965	0/1	qualitative	continuous (3.5%)	9	1
date	date	0.999	%Y-%m-%d	supportive	–	9	1
loct	missing	1.000		uninformative	–	0	10
group	categorical	0.909	1-4	qualitative	continuous (7.2%)	9	1
crp	continuous	0.640	floating	quantitative	text (17%)	9	1
bnp	continuous	0.479	floating	quantitative	text (31.3%)	10	0
comms	text	0.999		supportive	–	4	6

An example on data without errors

knitr::kable(vtype(mtcars),  caption='Application example on data without errors')

Application example on data without errors
variable	type	probability	format	class	alternative	n
mpg	continuous	1.000	floating	quantitative	–	32
cyl	categorical	0.966	4-8	qualitative	continuous (3%)	32
disp	continuous	0.999	floating	quantitative	–	32
hp	continuous	1.000	integer	quantitative	–	32
drat	continuous	1.000	floating	quantitative	–	32
wt	continuous	0.998	floating	quantitative	–	32
qsec	continuous	0.999	floating	quantitative	–	32
vs	binary	0.974	0/1	qualitative	continuous (2.6%)	32
am	binary	0.974	0/1	qualitative	continuous (2.6%)	32
gear	categorical	0.966	3-5	qualitative	continuous (3%)	32
carb	categorical	0.967	1-8	qualitative	continuous (3.1%)	32