% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/run_VIProDesign.R
\name{run_VIProDesign}
\alias{run_VIProDesign}
\title{Run VIProDesign Workflow}
\usage{
run_VIProDesign(
  file,
  output_prefix,
  max_cluster_number = NULL,
  predefined_cluster_number = NULL,
  use_cd_hit = TRUE,
  cd_hit_path,
  cutoff = 0.99,
  remove_outliers = TRUE,
  verbose = FALSE
)
}
\arguments{
\item{file}{A string specifying the path to the input FASTA file containing protein sequences.}

\item{output_prefix}{A string specifying the prefix for output files generated by the workflow.}

\item{max_cluster_number}{An integer specifying the maximum number of clusters to evaluate (optional).}

\item{predefined_cluster_number}{An integer specifying a predefined number of clusters for PAM clustering (optional).}

\item{use_cd_hit}{A logical value indicating whether to remove redundant sequences using `cd-hit` (default: TRUE).}

\item{cd_hit_path}{A string specifying the path to the `cd-hit` executable (default: "cd-hit").}

\item{cutoff}{A numeric value specifying the redundancy cutoff for `cd-hit` (default: 0.99).}

\item{remove_outliers}{A logical value indicating whether to identify and remove outliers using DBSCAN clustering (default: TRUE).}

\item{verbose}{A logical value indicating whether to print detailed messages during execution (default: FALSE).}
}
\value{
A list containing the following elements:
\itemize{
  \item \code{filtered_file}: A file containing the filtered sequences (if redundancy removal was performed).
  \item \code{non_redundant_file}: A file containing the non-redundant sequences (if redundancy removal was performed).
  \item \code{no_outlier_obj}:  A `AAStringSet` object containing the sequences with outliers removed (if outlier removal was performed).
  \item \code{clustering_info}: Clustering information generated by PAM clustering.
  \item \code{final_panel}: The final representative sequences selected by the workflow.
}
}
\description{
This function performs the VIProDesign workflow for clustering and analyzing protein sequences.
It includes steps for filtering sequences, removing redundancy, identifying and removing outliers,
and clustering sequences using PAM (Partitioning Around Medoids).
This function requires the `cd-hit` executable to be installed and accessible in the system's PATH
if `use_cd_hit = TRUE`. If `cd-hit` is not available, the workflow will skip redundancy removal
and proceed with the filtered sequences.
}
\details{
To install `cd-hit`, you can use conda:
```
conda install -c bioconda cd-hit
```
Or download it from the official website: http://weizhong-lab.ucsd.edu/cd-hit/

The workflow includes the following steps:
- Filtering sequences.
- Removing redundancy using `cd-hit`.
- Identifying and removing outliers using DBSCAN clustering.
- Performing PAM clustering to identify representative sequences.
- Calculating entropy to evaluate clustering quality.
}
\examples{
# Example usage:
temp_dir <- tempdir()
temp_prefix <- file.path(temp_dir, "output")
input_file <- system.file("extdata", "input.fasta", package = "VIProDesign")
run_VIProDesign(
  file = input_file,
  output_prefix = temp_prefix,
  max_cluster_number = 5,
  use_cd_hit = TRUE,
  cd_hit_path = "/data/kiryst/conda/envs/VIProDesign/bin/cd-hit",
  cutoff = 0.99,
  remove_outliers = TRUE
)
# Clean up
unlink(list.files(temp_dir, full.names = TRUE))
}
