This vignette can be referred to by citing the following:
Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software, 7(78), 4684, https://doi.org/10.21105/joss.04684
{datawizard}
package aims to make basic data wrangling
easier than with base R. The data wrangling workflow it supports is
similar to the one supported by the tidyverse package combination of
{dplyr}
and {tidyr}
. However, one of its main
features is that it has a very few dependencies: {stats}
and {utils}
(included in base R) and
{insight}
, which is the core package of the
easystats ecosystem. This package grew organically to
simultaneously satisfy the “0 non-base hard dependency” principle of
easystats and the data wrangling needs of the constituent
packages in this ecosystem.
One drawback of this genesis is that not all features of the
{tidyverse}
packages are supported since only features that
were necessary for easystats ecosystem have been implemented.
Some of these missing features (such as summarize
or the
pipe operator %>%
) are made available in other
dependency-free packages, such as {poorman}
.
It is also important to note that {datawizard}
was designed
to avoid namespace collisions with {tidyverse}
packages.
In this article, we will see how to go through basic data wrangling
steps with {datawizard}
. We will also compare it to the
{tidyverse}
syntax for achieving the same. This way, if you
decide to make the switch, you can easily find the translations here.
This vignette is largely inspired from {dplyr}
’s Getting started
vignette.
Before we look at their tidyverse equivalents, we can first
have a look at {datawizard}
’s key functions for data
wrangling:
Function | Operation |
---|---|
data_filter() |
to select only certain observations |
data_select() |
to select only a few variables |
data_modify() |
to create variables or modify existing ones |
data_arrange() |
to sort observations |
data_extract() |
to extract a single variable |
data_rename() |
to rename variables |
data_relocate() |
to reorder a data frame |
data_to_long() |
to convert data from wide to long |
data_to_wide() |
to convert data from long to wide |
data_join() |
to join two data frames |
data_unite() |
to concatenate several columns into a single one |
data_separate() |
to separate a single column into multiple columns |
Note that there are a few functions in {datawizard}
that
have no strict equivalent in {dplyr}
or
{tidyr}
(e.g data_rotate()
), and so we won’t
discuss them in the next section.
{dplyr}
/ {tidyr}
Before we look at them individually, let’s first have a look at the summary table of this equivalence.
Function | Tidyverse equivalent(s) |
---|---|
data_filter() |
dplyr::filter() ,
dplyr::slice() |
data_select() |
dplyr::select() |
data_modify() |
dplyr::mutate() |
data_arrange() |
dplyr::arrange() |
data_extract() |
dplyr::pull() |
data_rename() |
dplyr::rename() |
data_relocate() |
dplyr::relocate() |
data_to_long() |
tidyr::pivot_longer() |
data_to_wide() |
tidyr::pivot_wider() |
data_join() |
dplyr::inner_join() ,
dplyr::left_join() , dplyr::right_join() , |
dplyr::full_join() ,
dplyr::anti_join() , dplyr::semi_join() |
|
data_peek() |
dplyr::glimpse() |
data_unite() |
tidyr::unite() |
data_separate() |
tidyr::separate() |
data_filter()
is a wrapper around subset()
.
However, if you want to have several filtering conditions, you can
either use &
(as in subset()
) or
,
(as in dplyr::filter()
).
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Biggs Da… 183 84 black light brown 24 male mascu…
## 3 Cordé 157 NA brown light brown NA fema… femin…
## 4 Dormé 165 NA brown light brown NA fema… femin…
## 5 Raymus A… 188 79 brown light brown NA male mascu…
## 6 Poe Dame… NA NA brown light brown NA male mascu…
## 7 Padmé Am… 165 45 brown light brown 46 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Biggs Da… 183 84 black light brown 24 male mascu…
## 3 Cordé 157 NA brown light brown NA fema… femin…
## 4 Dormé 165 NA brown light brown NA fema… femin…
## 5 Raymus A… 188 79 brown light brown NA male mascu…
## 6 Poe Dame… NA NA brown light brown NA male mascu…
## 7 Padmé Am… 165 45 brown light brown 46 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
data_select()
is the equivalent of
dplyr::select()
. The main difference between these two
functions is that data_select()
uses two arguments
(select
and exclude
) and requires quoted
column names if we want to select several variables, while
dplyr::select()
accepts any unquoted column names.
## # A tibble: 6 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## # A tibble: 6 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Sk… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth V… 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia Or… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen La… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## # ℹ 1 more variable: starships <list>
## # A tibble: 6 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Sk… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth V… 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia Or… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen La… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## # ℹ 1 more variable: starships <list>
## # A tibble: 6 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Sk… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth V… 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia Or… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen La… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## # ℹ 1 more variable: starships <list>
## # A tibble: 6 × 3
## height mass birth_year
## <int> <dbl> <dbl>
## 1 172 77 19
## 2 167 75 112
## 3 96 32 33
## 4 202 136 41.9
## 5 150 49 19
## 6 178 120 52
You can find a list of all the select helpers with
?data_select
.
data_modify()
is a wrapper around
base::transform()
but has several additional benefits:
This last point is also the main difference between
data_modify()
and dplyr::mutate()
.
## c12hour e16sex e42dep c172code neg_c_7 c12hour_c c12hour_z c12hour_z2
## 1 16 2 3 2 12 -67.6 -0.9420928 -0.9420928
## 2 148 2 3 2 20 64.4 0.8974967 0.8974967
## 3 70 2 3 1 11 -13.6 -0.1895335 -0.1895335
## 4 NA 2 <NA> 2 10 NA NA NA
## 5 168 2 4 2 12 84.4 1.1762224 1.1762224
## 6 16 2 4 2 19 -67.6 -0.9420928 -0.9420928
data_modify()
accepts expressions as strings:
new_exp <- c(
"c12hour_c = center(c12hour)",
"c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE)"
)
data_modify(efc, new_exp)
## c12hour e16sex e42dep c172code neg_c_7 c12hour_c c12hour_z
## 1 16 2 3 2 12 -67.6 -0.9420928
## 2 148 2 3 2 20 64.4 0.8974967
## 3 70 2 3 1 11 -13.6 -0.1895335
## 4 NA 2 <NA> 2 10 NA NA
## 5 168 2 4 2 12 84.4 1.1762224
## 6 16 2 4 2 19 -67.6 -0.9420928
This makes it easy to use it in custom functions:
miles_to_km <- function(data, var) {
data_modify(
data,
paste0("km = ", var, "* 1.609344")
)
}
distance <- data.frame(miles = c(1, 8, 233, 88, 9))
distance
## miles
## 1 1
## 2 8
## 3 233
## 4 88
## 5 9
## miles km
## 1 1 1.609344
## 2 8 12.874752
## 3 233 374.977152
## 4 88 141.622272
## 5 9 14.484096
data_arrange()
is the equivalent of
dplyr::arrange()
. It takes two arguments: a data frame, and
a vector of column names used to sort the rows. Note that contrary to
most other functions in {datawizard}
, it is not possible to
use select helpers such as starts_with()
in
data_arrange()
.
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Leia Org… 150 49 brown light brown 19 fema… femin…
## 3 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 6 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
You can also sort variables in descending order by putting a
"-"
in front of their name, like below:
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 2 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 3 Leia Org… 150 49 brown light brown 19 fema… femin…
## 4 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 5 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 6 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Although we mostly work on data frames, it is sometimes useful to
extract a single column as a vector. This can be done with
data_extract()
, which reproduces the behavior of
dplyr::pull()
:
## [1] "masculine" "masculine" "masculine" "masculine" "feminine" "masculine"
We can also specify several variables in select
. In this
case, data_extract()
is equivalent to
data_select()
:
## # A tibble: 6 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
data_rename()
is the equivalent of
dplyr::rename()
but the syntax between the two is
different. While dplyr::rename()
takes new-old pairs of
column names, data_rename()
requires a vector of column
names to rename, and then a vector of new names for these columns that
must be of the same length.
## # A tibble: 6 × 14
## name height mass `Hair Color` skin_color eye_color birth_year Sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth … 202 136 none white yellow 41.9 male mascu…
## 5 Leia O… 150 49 brown light brown 19 fema… femin…
## 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
The way data_rename()
is designed makes it easy to apply
the same modifications to a vector of column names. For example, we can
remove underscores and use TitleCase with the following code:
to_rename <- names(starwars)
starwars %>%
data_rename(
pattern = to_rename,
replacement = tools::toTitleCase(gsub("_", " ", to_rename, fixed = TRUE))
)
## # A tibble: 6 × 14
## Name Height Mass `Hair Color` `Skin Color` `Eye Color` `Birth Year` Sex
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 none
## 3 R2-D2 96 32 <NA> white, blue red 33 none
## 4 Darth V… 202 136 none white yellow 41.9 male
## 5 Leia Or… 150 49 brown light brown 19 fema…
## 6 Owen La… 178 120 brown, grey light blue 52 male
## # ℹ 6 more variables: Gender <chr>, Homeworld <chr>, Species <chr>,
## # Films <list>, Vehicles <list>, Starships <list>
It is also possible to add a prefix or a suffix to all or a subset of
variables with data_addprefix()
and
data_addsuffix()
. The argument select
accepts
all select helpers that we saw above with
data_select()
:
starwars %>%
data_addprefix(
pattern = "OLD.",
select = contains("color")
) %>%
data_addsuffix(
pattern = ".NEW",
select = -contains("color")
)
## # A tibble: 6 × 14
## name.NEW height.NEW mass.NEW OLD.hair_color OLD.skin_color OLD.eye_color
## <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## 4 Darth Vader 202 136 none white yellow
## 5 Leia Organa 150 49 brown light brown
## 6 Owen Lars 178 120 brown, grey light blue
## # ℹ 8 more variables: birth_year.NEW <dbl>, sex.NEW <chr>, gender.NEW <chr>,
## # homeworld.NEW <chr>, species.NEW <chr>, films.NEW <list>,
## # vehicles.NEW <list>, starships.NEW <list>
Sometimes, we want to relocate one or a small subset of columns in
the dataset. Rather than typing many names in
data_select()
, we can use data_relocate()
,
which is the equivalent of dplyr::relocate()
. Just like
data_select()
, we can specify a list of variables we want
to relocate with select
and exclude
. Then, the
arguments before
and after
1 specify where the
selected columns should be relocated:
## # A tibble: 6 × 14
## name sex gender homeworld height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skyw… male mascu… Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu… Tatooine 167 75 <NA> gold yellow
## 3 R2-D2 none mascu… Naboo 96 32 <NA> white, bl… red
## 4 Darth Vad… male mascu… Tatooine 202 136 none white yellow
## 5 Leia Orga… fema… femin… Alderaan 150 49 brown light brown
## 6 Owen Lars male mascu… Tatooine 178 120 brown, gr… light blue
## # ℹ 5 more variables: birth_year <dbl>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
In addition to column names, before
and
after
accept column indices. Finally, one can use
before = -1
to relocate the selected columns just before
the last column, or after = -1
to relocate them after the
last column.
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year species films
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <lis>
## 1 Luke Sk… 172 77 blond fair blue 19 Human <chr>
## 2 C-3PO 167 75 <NA> gold yellow 112 Droid <chr>
## 3 R2-D2 96 32 <NA> white, bl… red 33 Droid <chr>
## 4 Darth V… 202 136 none white yellow 41.9 Human <chr>
## 5 Leia Or… 150 49 brown light brown 19 Human <chr>
## 6 Owen La… 178 120 brown, gr… light blue 52 Human <chr>
## # ℹ 5 more variables: vehicles <list>, starships <list>, sex <chr>,
## # gender <chr>, homeworld <chr>
Reshaping data from wide to long or from long to wide format can be
done with data_to_long()
and data_to_wide()
.
These functions were designed to match
tidyr::pivot_longer()
and tidyr::pivot_wider()
arguments, so that the only thing to do is to change the function name.
However, not all of tidyr::pivot_longer()
and
tidyr::pivot_wider()
features are available yet.
We will use the relig_income
dataset, as in the {tidyr}
vignette.
## # A tibble: 18 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## # `Don't know/refused` <dbl>
We would like to reshape this dataset to have 3 columns: religion,
count, and income. The column “religion” doesn’t need to change, so we
exclude it with -religion
. Then, each remaining column
corresponds to an income category. Therefore, we want to move all these
column names to a single column called “income”. Finally, the values
corresponding to each of these columns will be reshaped to be in a
single new column, called “count”.
## # A tibble: 180 × 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # ℹ 170 more rows
To explore a bit more the arguments of data_to_long()
,
we will use another dataset: the billboard
dataset.
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Once again, we use an example in the {tidyr}
vignette to
show how close data_to_wide()
and
pivot_wider()
are:
## # A tibble: 114 × 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## 7 4842 BCW 1
## 8 4842 BCE2 1
## 9 4842 BCW2 1
## 10 4842 MAE 1
## # ℹ 104 more rows
## # A tibble: 19 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 0 0 0 0 0 0
## 5 4847 1 1 1 0 0 0 0 0 0 0 0
## 6 4848 1 1 1 1 0 0 0 0 0 0 0
## 7 4849 1 1 0 0 0 0 0 0 0 0 0
## 8 4850 1 1 0 1 1 1 1 0 0 0 0
## 9 4851 1 1 0 0 0 0 0 0 0 0 0
## 10 4854 1 1 0 0 0 0 0 0 0 0 0
## 11 4855 1 1 1 1 1 0 0 0 0 0 0
## 12 4857 1 1 1 1 1 1 1 1 1 0 0
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 0 0 0 0 0 0
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 0 0
## 17 4863 1 1 0 0 0 0 0 0 0 0 0
## 18 4864 1 1 0 0 0 0 0 0 0 0 0
## 19 4865 1 1 1 0 0 0 0 0 0 0 0
In {datawizard}
, joining datasets is done with
data_join()
(or its alias data_merge()
).
Contrary to {dplyr}
, this unique function takes care of all
types of join, which are then specified inside the function with the
argument join
(by default, join = "left"
).
Below, we show how to perform the four most common joins: full, left,
right and inner. We will use the datasets band_members
and
band_instruments
provided by {dplyr}
:
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
## # A tibble: 4 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
## # A tibble: 3 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## # A tibble: 3 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
Uniting variables is useful e.g to create unique indices by combining
several variables or to gather years, months, and days into a single
date. data_unite()
offers an interface very close to
tidyr::unite()
:
test <- data.frame(
year = 2002:2004,
month = c("02", "03", "09"),
day = c("11", "22", "28"),
stringsAsFactors = FALSE
)
test
## year month day
## 1 2002 02 11
## 2 2003 03 22
## 3 2004 09 28
## date
## 1 2002-02-11
## 2 2003-03-22
## 3 2004-09-28
## year month day date
## 1 2002 02 11 2002-02-11
## 2 2003 03 22 2003-03-22
## 3 2004 09 28 2004-09-28
Separating variables is the counterpart to uniting variables and is
useful to split values into multiple columns, e.g. when splitting a date
into values for years, months and days. data_separate()
offers an interface very close to tidyr::separate()
:
test <- data.frame(
date_arrival = c("2002-02-11", "2003-03-22", "2004-09-28"),
date_departure = c("2002-03-15", "2003-03-28", "2004-09-30"),
stringsAsFactors = FALSE
)
test
## date_arrival date_departure
## 1 2002-02-11 2002-03-15
## 2 2003-03-22 2003-03-28
## 3 2004-09-28 2004-09-30
## date_departure Year Month Day
## 1 2002-03-15 2002 02 11
## 2 2003-03-28 2003 03 22
## 3 2004-09-30 2004 09 28
Unlike tidyr::separate()
, you can separate multiple
columns in one step with data_separate()
.
test %>%
data_separate(
new_columns = list(
date_arrival = c("Arr_Year", "Arr_Month", "Arr_Day"),
date_departure = c("Dep_Year", "Dep_Month", "Dep_Day")
)
)
## Arr_Year Arr_Month Arr_Day Dep_Year Dep_Month Dep_Day
## 1 2002 02 11 2002 03 15
## 2 2003 03 22 2003 03 28
## 3 2004 09 28 2004 09 30
{datawizard}
contains other functions that are not
necessarily included in {dplyr}
or {tidyr}
or
do not directly modify the data. Some of them are inspired from the
package janitor
.
We can convert a column in rownames and move rownames to a new column
with rownames_as_column()
and
column_as_rownames()
:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
rowid_as_column()
is close but not identical to
tibble::rowid_to_column()
. The main difference is when we
use it with grouped data. While tibble::rowid_to_column()
uses one distinct rowid for every row in the dataset,
rowid_as_column()
creates one id for every row in each
group. Therefore, two rows in different groups can have the same
row id.
This means that rowid_as_column()
is closer to using
n()
in mutate()
, like the following:
test <- data.frame(
group = c("A", "A", "B", "B"),
value = c(3, 5, 8, 1),
stringsAsFactors = FALSE
)
test
## group value
## 1 A 3
## 2 A 5
## 3 B 8
## 4 B 1
## rowid group value
## 1 1 A 3
## 2 2 A 5
## 3 3 B 8
## 4 4 B 1
## # A tibble: 4 × 3
## # Groups: group [2]
## rowid group value
## <int> <chr> <dbl>
## 1 1 A 3
## 2 2 A 5
## 3 1 B 8
## 4 2 B 1
## # A tibble: 4 × 3
## # Groups: group [2]
## group value id
## <chr> <dbl> <int>
## 1 A 3 1
## 2 A 5 2
## 3 B 8 1
## 4 B 1 2
When dealing with messy data, it is sometimes useful to use a row as
column names, and vice versa. This can be done with
row_to_colnames()
and colnames_to_row()
.
## X_1 X_2
## 1 <NA> <NA>
## 2 Title Title2
## 3 1 4
## 4 2 5
## 5 3 6
## Title Title2
## 1 <NA> <NA>
## 3 1 4
## 4 2 5
## 5 3 6
## x1 x2
## 1 Title Title2
## 11 <NA> <NA>
## 3 1 4
## 4 2 5
## 5 3 6
## Data frame with 10 rows and 5 variables
##
## Variable | Type | Values
## -----------------------------------------------------------------------
## Sepal.Length | numeric | 4.3, 5, 7.7, 4.4, 5.9, 6.5, 5.5, 5.5, 5.8, ...
## Sepal.Width | numeric | 3, 3.3, 3.8, 3.2, 3, 3, 2.5, 2.6, 2.7, 3, ...
## Petal.Length | numeric | 1.1, 1.4, 6.7, 1.3, 5.1, 5.2, 4, 4.4, 5.1, ...
## Petal.Width | numeric | 0.1, 0.2, 2.2, 0.2, 1.8, 2, 1.3, 1.2, 1.9, ...
## Species | factor | setosa, setosa, virginica, setosa, ...
Note that we use before
and
after
whereas dplyr::relocate()
uses
.before
and .after
.↩︎