Read and Write Data

David Gerbing

Data Table

Analyze the data values for at least one variable, such as the annual salaries of employees at a company. Typically have available the data for multiple variables. Organize the data values into a specific kind of structure from which analysis proceeds.

Data Table: Organize data values into a rectangular data table with the data values for each variable in a column, and the name of the variable at the top of the column.

Store the structured data values within a computer file, such as an Excel or OpenDocument Spreadsheet (ODS) formatted file or text file. This file can be stored on your computer, an accessible local network, or the world wide web. The data table in the following figure, formatted as an Excel file, contains four variables: Years, Gender, Dept, and Salary plus an ID field called Name for a total of five columns.

Structure of a data table.

Structure of a data table.

Describe the data table by its columns, rows, and cell entries.

Data value: The contents of a single cell of a data table, a specific measurement.

For example, according to the data values for employee Darnell Ritchie, he has worked at the company for 7 years, identifies as a male, and works in administration with an annual salary of $43,788.26.

Missing data: A cell for which there is no recorded data value.

Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.

Variable name: A short, concise word or abbreviation that identifies a column of data values in a data table.

Instance or Example or Sample or Case: Each row of the data table, the data for a specific instance of a single person, organization, place, event, or whatever is being studied.

Encode the data table in one of a variety of computer file formats. Common formats include Excel files (.xlsx), OpenDocument Spreadsheet files (.ods), and text files the form of comma-separated value files (.csv) or tab-delimited text files (.txt).

Analysis of data can only proceed with the data table identified and the relevant variables identified by their name.

R data analysis functions analyze the data values for one or more specified variables, identified by their names, such as Salary.

Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.

Read the Data into R

Your data organized as a data table exists somewhere as a data file stored on a computer system. To analyze data in a data table stored in a computer file, first read the data table from the computer file into a corresponding data table within a running R session.

Data frame: A data table stored within an R session, referenced by its name.

Each variable in a data table has a name, and so does the data table itself. Initially reference the data table stored on your computer system by its file name and location. When read into R, name the data table, the R data frame, with a name of your choice. Regardless of the file name on your computer system, typically name the data table within the active R session, the data frame, as d for data. Not only is d easy to type, but it is also the lessR default data frame name for the data processed by its various analysis functions.

When analyzing data read into R, the same data exists in two different locations: a computer file on your computer system, and an R data frame within a running R app. Different locations, different names: same data. On your computer system, identify the data table by its file name and location. Within a running R app, identify the same data by the name of the data frame, such as d, within which the data from the computer file was read.

To read the data from a file into a data frame of a running R application, as with every other task in R, accomplish the task with a function. The R ecosystem, base R and its many packages, presents many such functions. We use the lessR function Read() for its simplicity and for its useful output that helps understanding the data that was read.

The lessR function Read() can read data files in many file formats, including MS Excel. The most generic format is the csv format, for comma separated values. Read() also reads SPSS and SAS data files, as well as data files in R’s own native format, of type .rda.

The csv text format applies to text files with a comma, ,, delimiting adjacent data values and the decimal point represented by a period, .. If the data are in the form of a text file that follows the European and South American convention of using a comma , for the decimal point, and if adjacent values are delimited with a semicolon, ;, then either use Read2(), or manually add the parameters dec="," and sep=";". The __lessR__ functionRead2()follows the form of base Rread.csv2()`.

Browse for the data file

To read the data, direct R to the location of the data file.”)` R cannot read the data file until it knows where the data is stored. One option has you locate the data file on your computer system by browsing for it, navigating your file system until you locate the file.

To locate your data file by browsing through your file system, call the Read() function with an empty file reference (""), literally nothing between the quotes.

The following Read() statement reads the data stored as a rectangular data table from an external file stored on your computer system into an R data frame called d. The empty quotes indicate to R to open your file browser for you to locate the data file that already exists somewhere on your computer system).

d <- Read("")

As with all R (and Excel) functions, the call to invoke the function includes a matching set of parentheses. Any information within the parentheses specifies the information provided to the function for analysis.

The <- indicates to assign what is on the right of the expression, here the data read from an external file, to the object on the left, here the R data frame stored within R, named d in this example. You can also use an ordinary equals sign, =, to indicate the assignment, but the <- is more descriptive, and more widely used by R practitioners.

Specify location of the data file

Specify either the full path name of a file on your computer system, or specify a web address that locates the data table on the web. Also can explicitly select the location of the data file to be read within the quotes and parentheses. Specify either the full path name of a file on your computer system, or specify a web address that locates the data table on the web. Again, read the data into the d data frame.

d <- Read("path name" or "web address")

With Excel, R, or any other computer apps that processes data, enclose values that are character strings, such as a file name, in quotes. For example, to read the data stored on the web in the data file called employee.xlsx into the data frame d, invoke the following Read() function call.

d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")

This example reads a data file on the web. To specify a location on your computer, provide the full path name of your data file, its name and location. To obtain this path name, first browse for the file with Read(""). The resulting output displays the path name of the identified file. Copy this path name and insert between the quotes of Read(""), save this and other R function calls in a text file, and then run the code in the future to directly read the data file for future analyses without needing to browse for its location.

With the Read() function, put nothing between the quotes to browse for a data file, or specify the location of the data file on your computer system or the web. Direct the data read from a file into an R data frame, usually named d, but can choose any valid name.

Output of Read()

The Read() function displays useful output. Because R organizes analyses by variable name, it is crucial to know the exact variable names, including the pattern of capitalization.

In addition to the variable names, Read() displays each variable’s type as stored in the computer, as numbers with or without decimal digits, or as character strings. Also listed are the number of complete and missing values for each variable, the number of unique values for each variable, and sample data values.

The following lists the output from reading from a data file downloaded with lessR. All that is needed to read these data files is the name. A file on the web cannot be specified here because you may not have web access when this file is generated.

d <- Read("Employee")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

To allow for many variables, Read() lists the information for each variable in a row. Note that the data file organizes the variables by column.

Get more information from Read(), especially regarding missing data, by setting brief to FALSE.

d <- Read("Employee", brief=FALSE)
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## 
## 
## ----------------------------------------------------------
## Dimensions: 8 variables over 37 rows of data
## 
## First two row names: Ritchie, Darnell     Wu, James 
## Last two row names:  Anastasiou, Crystal     Cassinelli, Anastis 
## ----------------------------------------------------------
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------
## 
## 
## Missing Data Analysis
## ------------------------------------------------------------
## Number of cells in the data table:  296 
## Number of missing data values:  4 
## Proportion of missing data values:  0.014 
## Number of rows of data with missing values:  3 
## ------------------------------------------------------------
## 
##                    Years Gender Dept   Salary JobSat Plan Pre Post
## Wu, James             NA      M SALE 94494.58    low    1  62   74
## Jones, Alissa          5      W <NA> 53772.58   <NA>    1  65   62
## Korhalkar, Jessica     2      W ACCT 72502.50   <NA>    2  74   87
## ------------------------------------------------------------

Obtain this same information at any time during an R analysis by calling the function details(). Read() calls this same function after reading the data, with brief set to TRUE by default.

Two types of variables

Always distinguish continuous variables from categorical variables. This distinction between these two types of variables is fundamental in data analysis.

Continuous (quantitative) variable: A numerical variable with many possible values.

Categorical (qualitative) variable: A variable with relatively few unique labels as data values.

Examples of continuous variables are Salary or Time. Examples of categorical variables are Gender or State of Residence, each with just a relatively few number of possible values compared to numerical values. This distinction of continuous and categorical variables is common to all data analytics.

Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be either quantitative or qualitative. For example, sometimes Male, Female, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable Gender. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as Gender, for example, with M F, and O for Other.

Read Variable Labels

A variable label is a longer description of the corresponding variable than that of the variable name. The variable label displays in conjunction with the variable name on the text and visualization output to further clarify the interpretability of the output.

The variable label file has two columns. The the first column lists the variable names and the second column the corresponding labels. The file can be of type .csv, or .xlsx, or contained within lessR, as with the following example.

The variable labels must be read into the data frame l. Specify the var_labels parameter as TRUE to instruct the Read() function to read variable labels instead of data.

l <- Read("Employee_lbl", var_labels=TRUE)
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------

Write the Data Table to a File

Relying upon the have package, lessR can write data with function Write() in formats: csv, Excel, ODS, and native R and native SPSS formats. The R format preserves a binary copy of the data frame as it is stored within R, as does the SPSS format for native SPSS files. The Excel and ODS formats yield worksheets. The csv format yields a comma separated value text file.

A recommended procedure begins the analysis with data in .csv, .xlsx, or .ods format. Then proceed with data cleaning and preparation, including needed transformations and re-coding. When the data is ready for analysis, save the cleaned, prepared data as a native R file of format .rda. This format is the most efficient for size and for speed of reading back into R, with all data preparations already completed. Can also write the data in a format such as Excel so that those on the team not using R can also access.

With the Write() function, specify the name of the file for the output, as well as the type of file with the format parameter, with the default of .csv. The following R statements that call Write() are not run here as the intent is not to create additional files.

Write the current contents of default data frame d to GoodData.csv.

Write(d, "GoodData")

Write the data as a Excel data table in an Excel file.

Write(d, "GoodData", format="Excel")

Can also use the abbreviation for an Excel file, wrt_x().

wrt_x(d, "GoodData")

Write the data as a R data table.

Write(d, "GoodData", format="R")

Can also use the abbreviation for an R file, wrt_r().

wrt_r(d, "GoodData")

Use format="ODS" to specify to write to the OpenDocument Spreadsheet format with file type .ods. Use format="SPSS" to specify to write to the SPSS format with file type .sav.

The output of Write() indicates the full path name of the written file.

Read SPSS (and SAS and Stata) Files

The haven package has an excellent function for reading SPSS files, read_spss(). Read() adds to the given functionality by preserving the SPSS variable labels and value labels. Read() invokes this function to read SPSS files, with the default file types of .sav and zsav. The functionality of haven also allows for reading SAS and Stata data files.

Value labels

Within SPSS, an integer scored variable can have value labels. An example is Likert scaled data with 1 representing a Strongly Disagree to 5 a Strongly Agree. The corresponding SPSS variable has integer values but displays the more informative value labels such as in bar charts. This type of categorical variable corresponds to a factor in the R system.

The read_spss() function preserves the value labels with a special variable type called haven_labels. These variables can be converted to an R factor for processing in the R system with the haven function as_factor(). Read() performs this conversion automatically for each relevant variable.

One problem is that the factor conversion preserves the value labels listed in the correct order, but looses the original integer scoring information. To preserve both the labels as a standard R factor, and to preserve the original scoring, Read() converts the original read variable with the labels (type haven_labels) into two variables. The first variable, an integer variable with the original integer scoring, has the name of the read variable. The corresponding factor variable has the same name as the read variable with the suffix \_f.

For example, the use of read_spss() results in the following, here showing just the first four lines of data. The variable region contains both the integer scoring and the value label that are part of the SPSS data file that was read.

# A tibble: 4 x 4
  city                         region growth income
  <chr>                     <dbl+lbl>  <dbl>  <dbl>
1 ALBANY-SCHNTADY-TROY,N.Y.    1 [NE]    -71   3313
2 ATLANTA,GA.                  2 [SE]    264   3153
3 BALTIMORE,MD.                1 [NE]     38   3540
4 BIRMINGHAM,ALA.              2 [SE]   -178   2528

With Read(), obtain the following standard R data frame. The variable region is now a standard R integer variable, and region_f is the corrsponding factor.

                       city region region_f growth income
1 ALBANY-SCHNTADY-TROY,N.Y.      1       NE    -71   3313
2               ATLANTA,GA.      2       SE    264   3153
3             BALTIMORE,MD.      1       NE     38   3540
4           BIRMINGHAM,ALA.      2       SE   -178   2528

With these data, the analyst may accomplish a numerical analysis with the integer variable, and for analyses such as a bar chart, instead, display the corresponding value labels.

Variable labels

lessR automatically accesses variable labels stored in a data frame named l, and then displays with the variable name in text and visualization output. Usually, the variable labels are stored in an Excel or .csv file with two columns, the variable name, and the variable label. There are no column titles, just the names and labels. Then read into the l data frame with the var_labels parameter set to TRUE.

l <- Read(file_reference, var_labels=TRUE)

Read() also processes the variable labels of these (usually) integer-scored variables with value labels in the SPSS data file. The overseers of the R system do not permit package authors to create stored data structures from internal R code. Only the user can create these structures. As such, Read() lists each variable name, a comma, and then the corresponding label. This example only has one such relevant variable, region and its factor equivalent.

Variable and Variable Label  --> See vignette("Read"), SPSS section
---------------------------
region,  region of US 
region_f,  region of US

To access these labels, copy the names and labels, paste into a text file, and then save as a file. Then read the file of names/labels into R with the preceding Read() statement.

Full Manual

Use the base R help() function to view the full manual for Read() or Write(). Simply enter a question mark followed by the name of the function.

?Read
?Write