brief-introduction-to-select-syntax

Version Note: Up-to-date with v0.3.0

library(psycModel)
library(dplyr)

This article briefly introduces the usage of dplyr::select, and how it is applied to this package. In the first section, I will briefly describe dplyr::select syntax for R-beginners. If you are already familiar with dplyr::select syntax, then you can skip to the next section where I describe how to apply the syntax in this pacakge.

Introduction to the select syntax

dplyr::select (abbreviated as select hereafter) is an extremely power function for R. It allows you to subset columns with the a set of syntax that is also known as the select syntax / semantics in the R community. A side note here. With the new introduction of dplyr::across function, the select syntax can be applied to dplyr::mutate and dplyr::filter where make these two already powerful function even more powerful. I will first introduce the usage of :, c() and -. Then, I will discuss how to use everything, starts_with, end_with, contains, and where. This is not an exhaustive list of the select syntax, but there are the most relevant one. If you want to learn more, I encourage you to check the vignette of dplyr or just google it. There are tons of article that discuss this in detail.

Basic select syntax

I am going to use the iris dataset for the demonstration. Let’s take a quick peek of the dataset.

iris %>% head() # head() show the first 5 rows of the data frame 
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

If I want to select the first 3 columns, you can use : to do that

iris %>% select(1:3) %>% head(1)
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
iris %>% select(Sepal.Length:Petal.Length) %>% head(1)
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4

Next, if you want to combine selection then you can use c(). For example, I want the 1st, 3rd and 4th columns. Then, you can do it like this

iris %>% select(c(1, 3:4)) %>% head(1)
  Sepal.Length Petal.Length Petal.Width
1          5.1          1.4         0.2
iris %>% select(Sepal.Length, Petal.Length:Petal.Width) %>% head(1)
  Sepal.Length Petal.Length Petal.Width
1          5.1          1.4         0.2

Finally, if you want to delete a column from selection, then you can use -. For example, you want to select all columns except the 3rd column, then you can do it like this

iris %>% select(1:5, -3) %>% head(1)
  Sepal.Length Sepal.Width Petal.Width Species
1          5.1         3.5         0.2  setosa
iris %>% select(Sepal.Length:Species, -Petal.Length) %>% head(1)
  Sepal.Length Sepal.Width Petal.Width Species
1          5.1         3.5         0.2  setosa

Tidyselect helpers

Ok. Now you understand the basic usage. Let’s get to something a little bit more advanced. First, let’s talk about my favorite which is everything. As the name entails, it select all the variables in the data frame. It is usually used in combination with c() if you are using in select function. However, it is very powerful in other use cases like the one in this package. For example, you want to fit a linear regression with all the variables, then you can use everything (a more detailed discussion is presented in the next section).

# select all columns
iris %>% select(everything()) %>% head(1)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
# select everything except Sepal.Width
iris %>% select(c(everything(),-Sepal.Width)) %>% head(1)
  Sepal.Length Petal.Length Petal.Width Species
1          5.1          1.4         0.2  setosa

Next, we can talk about starts_with. starts_with select all columns that is starts with a certain specified string. For example, we want to select all columns start with Sepal, then we can do something like this

iris %>% select(starts_with('Sepal')) %>% head(1)
  Sepal.Length Sepal.Width
1          5.1         3.5

Similar to starts_with, ends_with select all columns that is ends with a certain specified string. For example, we want to select all columns ends with Width.

iris %>% select(ends_with('Width')) %>% head(1)
  Sepal.Width Petal.Width
1         3.5         0.2

Next, we are going talk about contains. As the name entails, it select all columns that contains a specified string.

iris %>% select(contains('Sepal')) %>% head(1) # same as starts_with
  Sepal.Length Sepal.Width
1          5.1         3.5
iris %>% select(contains('Width')) %>% head(1) # same as ends_with
  Sepal.Width Petal.Width
1         3.5         0.2
iris %>% select(contains('.')) %>% head(1) # contains "." will be selected
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2

Finally, we are going to conclude this section with where. where is not used alone. It is usually pair with a function return TRUE or FALSE. I think the most common use case for this package is paired with is.numeric. where(is.numeric) will select all numeric variables. A little tip, you need to pass is.numeric instead of is.numeric(). I will not go into the detail of why because this is out of the scope of this article. It required a little bit more advanced understanding of how function work in R. If you have that, you wouldn’t reading this article anyway.

iris %>% select(where(is.numeric)) %>% head(1) 
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2

Applying select syntax in this package

First, I will demonstrate the usage of linear regression. I will first create a data frame. You don’t need to know anything about how this data frame is created. Just know that it has 1 DV / outcome / response variable (i.e, y) and 5 IV / predictor variable (i.e, x1 to x5)

set.seed(1)
test_data = data.frame(y = rnorm(n = 100,mean = 2,sd = 3), 
           x1 = rnorm(n = 100,mean = 1.5, sd = 4),
           x2 = rnorm(n = 100,mean = 1.7, sd = 4), 
           x3 = rnorm(n = 100,mean = 1.5, sd = 4),
           x4 = rnorm(n = 100,mean = 2, sd = 4),
           x5 = rnorm(n = 100,mean = 1.5, sd = 4))

Ok, let’s fit that linear regression now.

# Without this package: 
model1 = lm(data = test_data, formula = y ~ x1 + x2 + x3 + x4 + x5)

# With this package: 
model2 = lm_model(data = test_data,
         response_variable = y, 
         predictor_variable = c(everything(),-y))
Fitting Model with lm:
 Formula = y ~ x1 + x2 + x3 + x4 + x5

This is already a step up from the basic lm() function. We can still make is even simpler by just passing everyhing(). The function is designed to remove the response variable from predictor variables (if selected) automatically. The following model3 is the same as model2

model3 = lm_model(data = test_data,
         response_variable = y, 
         predictor_variable = everything())
Fitting Model with lm:
 Formula = y ~ x1 + x2 + x3 + x4 + x5

The same logic is applied to all other functions in this package. Arguments that support dplyr::select syntax will ends with “support dplyr::select syntax” in the description of the argument.That’s it for this brief introduction. If you want to learn more about this package, I encourage you to check out this article or use vignette('quick-introduction') if you are in R Studio.