Generate a data dictionnary and search for variables with look_for()

Joseph Larmarange

Showing a summary of a data frame

Default printing of tibbles

It is a common need to easily get a description of all variables in a data frame.

When a data frame is converted into a tibble (e.g. with dplyr::as_tibble()), it as a nice printing showing the first rows of the data frame as well as the type of column.

library(dplyr)
iris %>% as_tibble()
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

However, when you have too many variables, all of them cannot be printed and their are just listed.

data(fertility, package = "questionr")
women
## # A tibble: 2,000 × 17
##    id_woman id_household weight interview_date date_of_birth   age residency
##       <dbl>        <dbl>  <dbl> <date>         <date>        <dbl> <dbl+lbl>
##  1      391          381  1.80  2012-05-05     1997-03-07       15 2 [rural]
##  2     1643         1515  1.80  2012-01-23     1982-01-06       30 2 [rural]
##  3       85           85  1.80  2012-01-21     1979-01-01       33 2 [rural]
##  4      881          844  1.80  2012-01-06     1968-03-29       43 2 [rural]
##  5     1981         1797  1.80  2012-05-11     1986-05-25       25 2 [rural]
##  6     1072         1015  0.998 2012-02-20     1993-07-03       18 2 [rural]
##  7     1978         1794  0.998 2012-02-23     1967-01-28       45 2 [rural]
##  8     1607         1486  0.998 2012-02-20     1989-01-21       23 2 [rural]
##  9      738          711  0.192 2012-03-09     1962-07-24       49 2 [rural]
## 10     1656         1525  0.192 2012-03-15     1980-12-25       31 2 [rural]
## # ℹ 1,990 more rows
## # ℹ 10 more variables: region <dbl+lbl>, instruction <dbl+lbl>,
## #   employed <dbl+lbl>, matri <dbl+lbl>, religion <dbl+lbl>,
## #   newspaper <dbl+lbl>, radio <dbl+lbl>, tv <dbl+lbl>,
## #   ideal_nb_children <dbl+lbl>, test <dbl+lbl>

Note: in R console, value labels (if defined) are usually printed but they do not appear in a R markdown document like this vignette.

dplyr::glimpse()

The function dplyr::glimpse() allows you to have a quick look at all the variables in a data frame.

glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
glimpse(women)
## Rows: 2,000
## Columns: 17
## $ id_woman          <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, 165…
## $ id_household      <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, 152…
## $ weight            <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150, 0.…
## $ interview_date    <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, 201…
## $ date_of_birth     <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, 198…
## $ age               <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 25, …
## $ residency         <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ region            <dbl+lbl> 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, …
## $ instruction       <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
## $ employed          <dbl+lbl> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ matri             <dbl+lbl> 0, 2, 2, 2, 1, 0, 1, 1, 2, 5, 2, 3, 0, 2, 1, 2, …
## $ religion          <dbl+lbl> 1, 3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 2, 2, …
## $ newspaper         <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ radio             <dbl+lbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ tv                <dbl+lbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ ideal_nb_children <dbl+lbl>  4,  4,  4,  4,  4,  5, 10,  5,  4,  5,  6, 10, …
## $ test              <dbl+lbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, …

It will show you the first values of each variable as well as the type of each variable. However, some important informations are not displayed:

labelled::look_for()

look_for() provided by the labelled package will print in the console a data dictionary of all variables, showing variable labels when available, the type of variable and a list of values corresponding to:

library(labelled)
look_for(iris)
##  pos variable     label           col_type missing values    
##  1   Sepal.Length —               dbl      0                 
##  2   Sepal.Width  —               dbl      0                 
##  3   Petal.Length Length of petal dbl      0                 
##  4   Petal.Width  Width of Petal  dbl      0                 
##  5   Species      —               fct      0       setosa    
##                                                    versicolor
##                                                    virginica
look_for(women)
##  pos variable          label              col_type missing values            
##  1   id_woman          Woman Id           dbl      0                         
##  2   id_household      Household Id       dbl      0                         
##  3   weight            Sample weight      dbl      0                         
##  4   interview_date    Interview date     date     0                         
##  5   date_of_birth     Date of birth      date     0                         
##  6   age               Age at last anniv~ dbl      0                         
##  7   residency         Urban / rural res~ dbl+lbl  0       [1] urban         
##                                                            [2] rural         
##  8   region            Region             dbl+lbl  0       [1] North         
##                                                            [2] East          
##                                                            [3] South         
##                                                            [4] West          
##  9   instruction       Level of instruct~ dbl+lbl  0       [0] none          
##                                                            [1] primary       
##                                                            [2] secondary     
##                                                            [3] higher        
##  10  employed          Employed?          dbl+lbl  7       [0] no            
##                                                            [1] yes           
##                                                            [9] missing       
##  11  matri             Matrimonial status dbl+lbl  0       [0] single        
##                                                            [1] married       
##                                                            [2] living togeth~
##                                                            [3] windowed      
##                                                            [4] divorced      
##                                                            [5] separated     
##  12  religion          Religion           dbl+lbl  4       [1] Muslim        
##                                                            [2] Christian     
##                                                            [3] Protestant    
##                                                            [4] no religion   
##                                                            [5] other         
##  13  newspaper         Read newspaper?    dbl+lbl  0       [0] no            
##                                                            [1] yes           
##  14  radio             Listen to radio?   dbl+lbl  0       [0] no            
##                                                            [1] yes           
##  15  tv                Watch TV?          dbl+lbl  0       [0] no            
##                                                            [1] yes           
##  16  ideal_nb_children Ideal number of c~ dbl+lbl  0       [96] don't know   
##                                                            [99] missing      
##  17  test              Ever tested for H~ dbl+lbl  29      [0] no            
##                                                            [1] yes           
##                                                            [9] missing

Note that lookfor() and generate_dictionary() are synonyms of look_for() and works exactly in the same way.

If there is not enough space to print full labels in the console, they will be truncated (truncation is indicated by a ~).

Searching variables by key

When a data frame has dozens or even hundreds of variables, it could become difficult to find a specific variable. In such case, you can provide an optional list of keywords, which can be simple character strings or regular expression, to search for specific variables.

# Look for a single keyword.
look_for(iris, "petal")
##  pos variable     label           col_type missing values
##  3   Petal.Length Length of petal dbl      0             
##  4   Petal.Width  Width of Petal  dbl      0
look_for(iris, "s")
##  pos variable     label col_type missing values    
##  1   Sepal.Length —     dbl      0                 
##  2   Sepal.Width  —     dbl      0                 
##  5   Species      —     fct      0       setosa    
##                                          versicolor
##                                          virginica
# Look for with a regular expression
look_for(iris, "petal|species")
##  pos variable     label           col_type missing values    
##  3   Petal.Length Length of petal dbl      0                 
##  4   Petal.Width  Width of Petal  dbl      0                 
##  5   Species      —               fct      0       setosa    
##                                                    versicolor
##                                                    virginica
look_for(iris, "s$")
##  pos variable label col_type missing values    
##  5   Species  —     fct      0       setosa    
##                                      versicolor
##                                      virginica
# Look for with several keywords
look_for(iris, "pet", "sp")
##  pos variable     label           col_type missing values    
##  3   Petal.Length Length of petal dbl      0                 
##  4   Petal.Width  Width of Petal  dbl      0                 
##  5   Species      —               fct      0       setosa    
##                                                    versicolor
##                                                    virginica
# Look_for will take variable labels into account
look_for(women, "read", "level")
##  pos variable    label                col_type missing values       
##  9   instruction Level of instruction dbl+lbl  0       [0] none     
##                                                        [1] primary  
##                                                        [2] secondary
##                                                        [3] higher   
##  13  newspaper   Read newspaper?      dbl+lbl  0       [0] no       
##                                                        [1] yes

By default, look_for() will look through both variable names and variables labels. Use labels = FALSE to look only through variable names.

look_for(women, "read")
##  pos variable  label           col_type missing values 
##  13  newspaper Read newspaper? dbl+lbl  0       [0] no 
##                                                 [1] yes
look_for(women, "read", labels = FALSE)
## Nothing found. Sorry.

Similarly, the search is by default case insensitive. To make the search case sensitive, use ignore.case = FALSE.

look_for(iris, "sepal")
##  pos variable     label col_type missing values
##  1   Sepal.Length —     dbl      0             
##  2   Sepal.Width  —     dbl      0
look_for(iris, "sepal", ignore.case = FALSE)
## Nothing found. Sorry.

Level of details

If you just want to use the search feature of look_for() without computing the details of each variable, simply indicate details = "none" or details = FALSE.

look_for(women, "id", details = "none")
##  pos variable          label                   
##   1  id_woman          Woman Id                
##   2  id_household      Household Id            
##   7  residency         Urban / rural residency 
##  16  ideal_nb_children Ideal number of children

If you want more details (but can be time consuming for big data frames), indicate details = "full" or details = TRUE.

look_for(women, details = "full")
##  pos variable          label              col_type missing unique_values
##  1   id_woman          Woman Id           dbl      0       2000         
##  2   id_household      Household Id       dbl      0       1814         
##  3   weight            Sample weight      dbl      0       351          
##  4   interview_date    Interview date     date     0       165          
##  5   date_of_birth     Date of birth      date     0       1740         
##  6   age               Age at last anniv~ dbl      0       36           
##  7   residency         Urban / rural res~ dbl+lbl  0       2            
##                                                                         
##  8   region            Region             dbl+lbl  0       4            
##                                                                         
##                                                                         
##                                                                         
##  9   instruction       Level of instruct~ dbl+lbl  0       4            
##                                                                         
##                                                                         
##                                                                         
##  10  employed          Employed?          dbl+lbl  7       3            
##                                                                         
##                                                                         
##  11  matri             Matrimonial status dbl+lbl  0       6            
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  12  religion          Religion           dbl+lbl  4       6            
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  13  newspaper         Read newspaper?    dbl+lbl  0       2            
##                                                                         
##  14  radio             Listen to radio?   dbl+lbl  0       2            
##                                                                         
##  15  tv                Watch TV?          dbl+lbl  0       2            
##                                                                         
##  16  ideal_nb_children Ideal number of c~ dbl+lbl  0       18           
##                                                                         
##  17  test              Ever tested for H~ dbl+lbl  29      3            
##                                                                         
##                                                                         
##  values             na_values na_range
##  range: 1 - 2000                      
##  range: 1 - 1814                      
##  range: 0.044629 -~                   
##  range: 2011-12-01~                   
##  range: 1962-02-07~                   
##  range: 14 - 49                       
##  [1] urban                            
##  [2] rural                            
##  [1] North                            
##  [2] East                             
##  [3] South                            
##  [4] West                             
##  [0] none                             
##  [1] primary                          
##  [2] secondary                        
##  [3] higher                           
##  [0] no             9                 
##  [1] yes                              
##  [9] missing                          
##  [0] single                           
##  [1] married                          
##  [2] living togeth~                   
##  [3] windowed                         
##  [4] divorced                         
##  [5] separated                        
##  [1] Muslim                           
##  [2] Christian                        
##  [3] Protestant                       
##  [4] no religion                      
##  [5] other                            
##  [0] no                               
##  [1] yes                              
##  [0] no                               
##  [1] yes                              
##  [0] no                               
##  [1] yes                              
##  [96] don't know                      
##  [99] missing                         
##  [0] no             9                 
##  [1] yes                              
##  [9] missing
look_for(women, details = "full") %>%
  dplyr::glimpse()
## Rows: 17
## Columns: 14
## $ pos           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
## $ variable      <chr> "id_woman", "id_household", "weight", "interview_date", …
## $ label         <chr> "Woman Id", "Household Id", "Sample weight", "Interview …
## $ col_type      <chr> "dbl", "dbl", "dbl", "date", "date", "dbl", "dbl+lbl", "…
## $ missing       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 0, 0, 0, 29
## $ levels        <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, …
## $ value_labels  <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <1, 2>, …
## $ class         <named list> "numeric", "numeric", "numeric", "Date", "Date", …
## $ type          <chr> "double", "double", "double", "double", "double",…
## $ na_values     <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <…
## $ na_range      <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, …
## $ n_na          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 0, 0, 0, 29
## $ unique_values <int> 2000, 1814, 351, 165, 1740, 36, 2, 4, 4, 3, 6, 6,…
## $ range         <named list> <1, 2000>, <1, 1814>, <0.044629, 4.396831>, <2011…

Advanced usages of look_for()

look_for() returns a detailed tibble which is summarized before printing. To deactivate default printing and see full results, simply use dplyr::as_tibble(), dplyr::glimpse() or even utils::View().

look_for(women) %>% View()
look_for(women) %>% as_tibble()
## # A tibble: 17 × 7
##      pos variable          label            col_type missing levels value_labels
##    <int> <chr>             <chr>            <chr>      <int> <name> <named list>
##  1     1 id_woman          Woman Id         dbl            0 <NULL> <NULL>      
##  2     2 id_household      Household Id     dbl            0 <NULL> <NULL>      
##  3     3 weight            Sample weight    dbl            0 <NULL> <NULL>      
##  4     4 interview_date    Interview date   date           0 <NULL> <NULL>      
##  5     5 date_of_birth     Date of birth    date           0 <NULL> <NULL>      
##  6     6 age               Age at last ann… dbl            0 <NULL> <NULL>      
##  7     7 residency         Urban / rural r… dbl+lbl        0 <NULL> <dbl [2]>   
##  8     8 region            Region           dbl+lbl        0 <NULL> <dbl [4]>   
##  9     9 instruction       Level of instru… dbl+lbl        0 <NULL> <dbl [4]>   
## 10    10 employed          Employed?        dbl+lbl        7 <NULL> <dbl [3]>   
## 11    11 matri             Matrimonial sta… dbl+lbl        0 <NULL> <dbl [6]>   
## 12    12 religion          Religion         dbl+lbl        4 <NULL> <dbl [5]>   
## 13    13 newspaper         Read newspaper?  dbl+lbl        0 <NULL> <dbl [2]>   
## 14    14 radio             Listen to radio? dbl+lbl        0 <NULL> <dbl [2]>   
## 15    15 tv                Watch TV?        dbl+lbl        0 <NULL> <dbl [2]>   
## 16    16 ideal_nb_children Ideal number of… dbl+lbl        0 <NULL> <dbl [2]>   
## 17    17 test              Ever tested for… dbl+lbl       29 <NULL> <dbl [3]>
glimpse(look_for(women))
## Rows: 17
## Columns: 7
## $ pos          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
## $ variable     <chr> "id_woman", "id_household", "weight", "interview_date", "…
## $ label        <chr> "Woman Id", "Household Id", "Sample weight", "Interview d…
## $ col_type     <chr> "dbl", "dbl", "dbl", "date", "date", "dbl", "dbl+lbl", "d…
## $ missing      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 0, 0, 0, 29
## $ levels       <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <…
## $ value_labels <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <1, 2>, <…

The tibble returned by look_for() could be easily manipulated for advanced programming.

When a column has several values for one variable (e.g. levels or value_labels), results as stored with nested named list. You can convert named lists into simpler character vectors, you can use convert_list_columns_to_character().

look_for(women) %>% convert_list_columns_to_character()
## # A tibble: 17 × 7
##      pos variable          label            col_type missing levels value_labels
##    <int> <chr>             <chr>            <chr>      <int> <chr>  <chr>       
##  1     1 id_woman          Woman Id         dbl            0 ""     ""          
##  2     2 id_household      Household Id     dbl            0 ""     ""          
##  3     3 weight            Sample weight    dbl            0 ""     ""          
##  4     4 interview_date    Interview date   date           0 ""     ""          
##  5     5 date_of_birth     Date of birth    date           0 ""     ""          
##  6     6 age               Age at last ann… dbl            0 ""     ""          
##  7     7 residency         Urban / rural r… dbl+lbl        0 ""     "[1] urban;…
##  8     8 region            Region           dbl+lbl        0 ""     "[1] North;…
##  9     9 instruction       Level of instru… dbl+lbl        0 ""     "[0] none; …
## 10    10 employed          Employed?        dbl+lbl        7 ""     "[0] no; [1…
## 11    11 matri             Matrimonial sta… dbl+lbl        0 ""     "[0] single…
## 12    12 religion          Religion         dbl+lbl        4 ""     "[1] Muslim…
## 13    13 newspaper         Read newspaper?  dbl+lbl        0 ""     "[0] no; [1…
## 14    14 radio             Listen to radio? dbl+lbl        0 ""     "[0] no; [1…
## 15    15 tv                Watch TV?        dbl+lbl        0 ""     "[0] no; [1…
## 16    16 ideal_nb_children Ideal number of… dbl+lbl        0 ""     "[96] don't…
## 17    17 test              Ever tested for… dbl+lbl       29 ""     "[0] no; [1…

Alternatively, you can use lookfor_to_long_format() to transform results into a long format with one row per factor level and per value label.

look_for(women) %>% lookfor_to_long_format()
## # A tibble: 41 × 7
##      pos variable       label               col_type missing levels value_labels
##    <int> <chr>          <chr>               <chr>      <int> <chr>  <chr>       
##  1     1 id_woman       Woman Id            dbl            0 <NA>   <NA>        
##  2     2 id_household   Household Id        dbl            0 <NA>   <NA>        
##  3     3 weight         Sample weight       dbl            0 <NA>   <NA>        
##  4     4 interview_date Interview date      date           0 <NA>   <NA>        
##  5     5 date_of_birth  Date of birth       date           0 <NA>   <NA>        
##  6     6 age            Age at last annive… dbl            0 <NA>   <NA>        
##  7     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [1] urban   
##  8     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [2] rural   
##  9     8 region         Region              dbl+lbl        0 <NA>   [1] North   
## 10     8 region         Region              dbl+lbl        0 <NA>   [2] East    
## # ℹ 31 more rows

Both can be combined:

look_for(women) %>%
  lookfor_to_long_format() %>%
  convert_list_columns_to_character()
## # A tibble: 41 × 7
##      pos variable       label               col_type missing levels value_labels
##    <int> <chr>          <chr>               <chr>      <int> <chr>  <chr>       
##  1     1 id_woman       Woman Id            dbl            0 <NA>   <NA>        
##  2     2 id_household   Household Id        dbl            0 <NA>   <NA>        
##  3     3 weight         Sample weight       dbl            0 <NA>   <NA>        
##  4     4 interview_date Interview date      date           0 <NA>   <NA>        
##  5     5 date_of_birth  Date of birth       date           0 <NA>   <NA>        
##  6     6 age            Age at last annive… dbl            0 <NA>   <NA>        
##  7     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [1] urban   
##  8     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [2] rural   
##  9     8 region         Region              dbl+lbl        0 <NA>   [1] North   
## 10     8 region         Region              dbl+lbl        0 <NA>   [2] East    
## # ℹ 31 more rows