% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/group_category.r
\name{group_category}
\alias{group_category}
\alias{CollapseCategory}
\title{Group categories for discrete features}
\usage{
group_category(data, feature, threshold, measure, update = FALSE,
  category_name = "OTHER", exclude = NULL)
}
\arguments{
\item{data}{input data}

\item{feature}{name of the discrete feature to be collapsed.}

\item{threshold}{the bottom x\% categories to be grouped, e.g., if set to 20\%, categories with cumulative frequency of the bottom 20\% will be grouped}

\item{measure}{name of feature to be used as an alternative measure.}

\item{update}{logical, indicating if the data should be modified. The default is \code{FALSE}. Setting to \code{TRUE} will modify the input \link{data.table} object directly. Otherwise, input class will be returned.}

\item{category_name}{name of the new category if update is set to \code{TRUE}. The default is "OTHER".}

\item{exclude}{categories to be excluded from grouping when update is set to \code{TRUE}.}
}
\value{
If \code{update} is set to \code{FALSE}, returns categories with cumulative frequency less than the input threshold. The output class will match the class of input data.
If \code{update} is set to \code{TRUE}, updated data will be returned, and the output class will match the class of input data.
}
\description{
Sometimes discrete features have sparse categories. This function will group the sparse categories for a discrete feature based on a given threshold.
}
\details{
If a continuous feature is passed to the argument \code{feature}, it will be force set to \link{character-class}.
}
\examples{
# Load packages
library(data.table)

# Generate data
data <- data.table("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 500))

# View cumulative frequency without collpasing categories
group_category(data, "a", 0.2)

# View cumulative frequency based on another measure
group_category(data, "a", 0.2, measure = "b")

# Group bottom 20\% categories based on cumulative frequency
group_category(data, "a", 0.2, update = TRUE)
plot_bar(data)

# Exclude categories from being grouped
dt <- data.table("a" = c(rep("c1", 25), rep("c2", 10), "c3", "c4"))
group_category(dt, "a", 0.8, update = TRUE, exclude = c("c3", "c4"))
plot_bar(dt)

# Return from non-data.table input
df <- data.frame("a" = as.factor(round(rnorm(50, 10, 5))), "b" = rexp(50, 10))
group_category(df, "a", 0.2)
group_category(df, "a", 0.2, measure = "b", update = TRUE)
group_category(df, "a", 0.2, update = TRUE)
}
\keyword{group_category}
