% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/nlp_collocation.R
\name{keywords_collocation}
\alias{keywords_collocation}
\alias{collocation}
\alias{collocation}
\title{Extract collocations - a sequence of terms which follow each other}
\usage{
keywords_collocation(x, term, group, ngram_max = 2, n_min = 2, sep = " ")

collocation(x, term, group, ngram_max = 2, n_min = 2, sep = " ")
}
\arguments{
\item{x}{a data.frame with one row per term where the sequence of the terms correspond to 
the natural order of a text. The data frame \code{x} should also contain 
the columns provided in \code{term} and \code{group}}

\item{term}{a character vector with 1 column from \code{x} which indicates the term}

\item{group}{a character vector with 1 or several columns from \code{x} which indicates 
for example a document id or a sentence id. Collocations will be computed within this 
group in order not to find collocations across sentences or documents for example.}

\item{ngram_max}{integer indicating the size of the collocations. Defaults to 2, indicating
to compute bigrams. If set to 3, will find collocations of bigrams and trigrams.}

\item{n_min}{integer indicating the frequency of how many times a collocation should
at least occur in the data in order to be returned. Defaults to 2.}

\item{sep}{character string with the separator which will be used to \code{paste} together
terms which are collocated. Defaults to a space: ' '.}
}
\value{
a data.frame with columns 
\itemize{
\item ngram: the number of terms which are combined
\item collocation: the terms which are combined
\item left: the left term of the collocation
\item right: the right term of the collocation
\item n: the number of times the collocation occurred in the data
\item n_left: the number of times the left element of the collocation occurred in the data
\item n_right: the number of times the right element of the collocation occurrend in the data
\item pmi: the pointwise mutual information
\item md: mutual dependency
\item lfmd: log-frequency biased mutual dependency
}
}
\description{
Collocations are a sequence of words or terms that co-occur more often than would be expected by chance.
Common collocation are adjectives + nouns, nouns followed by nouns, verbs and nouns, adverbs and adjectives,
verbs and prepositional phrases or verbs and adverbs.\cr
This function extracts relevant collocations and computes the following statistics on them
which are indicators of how likely two terms are collocated compared to being independent.
\itemize{
  \item PMI (pointwise mutual information): log2(P(w1w2) / P(w1) P(w2))
  \item MD (mutual dependency): log2(P(w1w2)^2 / P(w1) P(w2))
  \item LFMD (log-frequency biased mutual dependency): MD + log2(P(w1w2))
}
As natural language is non random - otherwise you wouldn't understand what I'm saying, 
most of the combinations of terms are significant. That's why these indicators of collocation
are merely used to order the collocations.
}
\examples{
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language \%in\% "fr")
colloc <- keywords_collocation(x, term = "lemma", group = c("doc_id", "sentence_id"), 
                               ngram_max = 3, n_min = 10)
head(colloc, 10)

## Example on finding collocations of nouns preceded by an adjective
library(data.table)
x <- as.data.table(x)
x[, xpos_previous := txt_previous(xpos, n = 1), by = list(doc_id, sentence_id)]
x[, xpos_next := txt_next(xpos, n = 1), by = list(doc_id, sentence_id)]
x <- subset(x, (xpos \%in\% c("NN") & xpos_previous \%in\% c("JJ")) | 
               (xpos \%in\% c("JJ") & xpos_next \%in\% c("NN")))
colloc <- keywords_collocation(x, term = "lemma", group = c("doc_id", "sentence_id"), 
                               ngram_max = 2, n_min = 2)
head(colloc)
}
