% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/process_plink.R
\name{process_plink}
\alias{process_plink}
\title{Preprocess PLINK files using the \code{bigsnpr} package}
\usage{
process_plink(
  data_dir,
  data_prefix,
  rds_dir = data_dir,
  rds_prefix,
  logfile = NULL,
  impute = TRUE,
  impute_method = "mode",
  id_var = "IID",
  parallel = TRUE,
  quiet = FALSE,
  overwrite = FALSE,
  ...
)
}
\arguments{
\item{data_dir}{The path to the bed/bim/fam data files, \emph{without} a trailing "/" (e.g., use \code{data_dir = '~/my_dir'}, \strong{not} \code{data_dir = '~/my_dir/'})}

\item{data_prefix}{The prefix (as a character string) of the bed/fam data files (e.g., \code{data_prefix = 'mydata'})}

\item{rds_dir}{The path to the directory in which you want to create the new '.rds' and '.bk' files. Defaults to \code{data_dir}}

\item{rds_prefix}{String specifying the user's preferred filename for the to-be-created .rds file (will be create insie \code{rds_dir} folder)
Note: 'rds_prefix' cannot be the same as 'data_prefix'}

\item{logfile}{Optional: the name (character string) of the prefix of the logfile to be written in 'rds_dir'. Default to NULL (no log file written).
Note: if you supply a file path in this argument, it will error out with a "file not found" error. Only supply the string; e.g., if you want my_log.log, supply 'my_log', the my_log.log file will appear in rds_dir.}

\item{impute}{Logical: should data be imputed? Default to TRUE.}

\item{impute_method}{If 'impute' = TRUE, this argument will specify the kind of imputation desired. Options are:
* mode (default): Imputes the most frequent call. See \code{bigsnpr::snp_fastImputeSimple()} for details.
* random: Imputes sampling according to allele frequencies.
* mean0: Imputes the rounded mean.
* mean2: Imputes the mean rounded to 2 decimal places.
* xgboost: Imputes using an algorithm based on local XGBoost models. See \code{bigsnpr::snp_fastImpute()} for details. Note: this can take several minutes, even for a relatively small data set.}

\item{id_var}{String specifying which column of the PLINK \code{.fam} file has the unique sample identifiers. Options are "IID" (default) and "FID"}

\item{parallel}{Logical: should the computations within this function be run in parallel? Defaults to TRUE. See \code{count_cores()} and \code{?bigparallelr::assert_cores} for more details.
In particular, the user should be aware that too much parallelization can make computations \emph{slower}.}

\item{quiet}{Logical: should messages to be printed to the console be silenced? Defaults to FALSE}

\item{overwrite}{Logical: if existing \code{.bk}/\code{.rds} files exist for the specified directory/prefix, should these be overwritten? Defaults to FALSE. Set to TRUE if you want to change the imputation method you're using, etc.
\strong{Note}: If there are multiple \code{.rds} files with names that start with "std_prefix_...", \strong{this will error out}.
To protect users from accidentally deleting files with saved results, only one \code{.rds} file can be removed with this option.}

\item{...}{Optional: additional arguments to \code{bigsnpr::snp_fastImpute()} (relevant only if impute_method = "xgboost")}
}
\value{
The filepath to the '.rds' object created; see details for explanation.
}
\description{
Preprocess PLINK files using the \code{bigsnpr} package
}
\details{
Three files are created in the location specified by \code{rds_dir}:
\itemize{
\item 'rds_prefix.rds': This is a list with three items:
(1) \code{X}: the filebacked \code{bigmemory::big.matrix} object pointing to the imputed genotype data.
This matrix has type 'double', which is important for downstream operations in \code{create_design()}
(2) \code{map}: a data.frame with the PLINK 'bim' data (i.e., the variant information)
(3) \code{fam}: a data.frame with the PLINK 'fam' data (i.e., the pedigree information)
\item 'prefix.bk': This is the
backingfile that stores the numeric data of the genotype matrix
\item 'rds_prefix.desc'" This is the description file, as needed by the
}

Note that \code{process_plink()} need only be run once for a given set of PLINK
files; in subsequent data analysis/scripts, \code{get_data()} will access the '.rds' file.

For an example, see vignette on processing PLINK files
}
