% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/bread.R
\name{bread}
\alias{bread}
\title{Reads a file in table format, selecting columns, subsetting rows by number and filtering them by column values}
\usage{
bread(
  file = NULL,
  first_row = NULL,
  last_row = NULL,
  head = NULL,
  tail = NULL,
  colnames = NULL,
  colnums = NULL,
  patterns = NULL,
  filtered_columns = NULL,
  fixed = FALSE,
  range_min = NULL,
  range_max = NULL,
  numrange_columns = NULL,
  ...
)
}
\arguments{
\item{file}{String. Name or full path to a file compatible with data.table::fread()}

\item{first_row}{Numeric. First row of the portion of the file to subset.}

\item{last_row}{Numeric. Last row of the portion of the file to subset.}

\item{head}{Numeric. How many rows starting from the first in the file.}

\item{tail}{Numeric. How many rows starting from the last in the file.}

\item{colnames}{Vector of strings. Exact names of columns to select. If both colnames and colnums are provided, colnums will be prefered.}

\item{colnums}{Vector of numeric. Columns index numbers.}

\item{patterns}{Vector of strings. One or several patterns used to filter the data from the input file. Each element of the vector should correspond to the column to be filtered. Can use regular expressions.}

\item{filtered_columns}{Vector of strings or numeric. Optional. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the pattern with which it will be filtered.}

\item{fixed}{Logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.}

\item{range_min}{Vector of numeric. One or several minimal values used to filter (inclusively, as in superior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered.}

\item{range_max}{Vector of numeric. One or several maximal values used to filter (inclusively, as in inferior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered.}

\item{numrange_columns}{Vector of strings or numeric. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the range_min and range_man values with which it will be filtered.}

\item{...}{Arguments that must be passed to data.table::fread() like 'sep' or 'dec'.}
}
\value{
A data frame with the selected columns and the subsetted and filtered data
}
\description{
Wrapper for data.table::fread() simplifying the use of Unix commands like 'grep', 'cut', 'awk' and 'sed'
on a data file \emph{before} loading it in memory. The Unix commands are automatically generated
from the arguments.
This is useful if you want to load a big file too large for your available memory
(and encounter the 'cannot allocate vector of size' error) and know you can work on a
subsample. 'b' stands for 'big file'.
This function allows to subset rows by their index number, select columns and filter
with a pattern.
}
\details{
You can mix and match the row subsetting, the filtering by value and the selecting of columns.
In order, the function:
\enumerate{
\item subsets the rows by their numbers (with 'sed' & 'awk'). You need to input the index
number of the first and last rows you want to load in memory with fread(),
or alternatively use either the head or tail arguments to subset the first or
last rows of the file.
\item selects columns by index number or name (with 'cut'). If both colnames and
colnums are provided, colnums will be prefered.
\item filters the data selected so far with a pattern by column (with 'grep'). The
columns to be filtered should be indicated through their names or their index
number. Each element of the vector should correspond to the pattern with which
it will be filtered.
\item filters (inclusively, as in inferior/superior OR EQUAL) the data selected
so far by numerical value on a different set of provided columns with the 'sed' command.
}
}
\section{Warning}{

Best practice would probably be to load the big file in a SQL database or something.
Or not working on huge CSV files in the first place.
But if you have to, you hopefully won"t have to delve into the fascinating grammar of
Unix commands.
}

\examples{
file <- system.file('extdata', 'test.csv', package = 'bread')
## Select the columns numbered 1 and 3
bread(file = file, colnums = c(1,3))
## Select the columns named 'YEAR' and 'PRICE', then filter to keep only the
## value '2022' in column 'YEAR'
bread(file = file, colnames = c('YEAR', 'PRICE'),
      patterns = 2002, filtered_columns = 'YEAR')
## Select the columns names 'YEAR' and 'PRICE', then filter to keep only values
## superior or equal to 2004 in YEAR and to 2000 in PRICE
bread(file = file, colnames = c("YEAR", "PRICE"),
      range_min = c(2004,2000), numrange_columns = c(1,3))
## Subset to keep only the rows 10 to 18, select the columns named 'YEAR'
## and 'COLOR' then filter to keep only the value 'red' in column 'COLOR'
bread(file = file, colnames = c('YEAR', 'COLOR'),
      patterns = 'red', filtered_columns = 'COLOR',
      first_row = 10, last_row = 18)
}
\keyword{allocate}
\keyword{big}
\keyword{cut}
\keyword{file}
\keyword{select}
\keyword{size}
\keyword{vector}
