\name{whatif}
\alias{whatif}
\title{Counterfactual Evaluation}
\description{
  Implements the methods described in King and Zeng (2006a, 2006b) for
  evaluating counterfactuals.  
}
\usage{
whatif(formula = NULL, data, cfact, range = NULL, freq = NULL, nearby = 1, 
distance = "gower", miss = "list", choice = "both", return.inputs = FALSE, 
return.distance = FALSE, ...)
}
\arguments{
  \item{formula}{An optional formula without a dependent variable that
    is of class "formula" and that follows standard \code{R}
    conventions for formulas, e.g. ~ x1 + x2.  Allows you to
    transform or otherwise re-specify combinations of the variables in
    both \code{data} and \code{cfact}.  To use this
    parameter, both \code{data} and \code{cfact} must be coercable
    to data frames; the variables of both \code{data} and
    \code{cfact} must be labeled; and all variables appearing in
    \code{formula} must also appear in both \code{data} and
    \code{cfact}.  Otherwise, errors are returned.  The intercept is
    automatically dropped.  Default is \code{NULL}.}
  \item{data}{May take one of the following forms:  
    \enumerate{
    \item A \code{R} model output object, such as the output from calls to
      \code{lm}, \code{glm}, and \code{zelig}.  Such an output
      object must be a list.  It must additionally have either a \code{formula}
      or \code{terms} component and either a \code{data} or \code{model} component; if it
      does not, an error is returned.  Of the latter, \code{whatif}
      first looks for \code{data}, which should contain either the original
      data set supplied as part of the model call (as in \code{glm})
      or the name of this data set (as in \code{zelig}), which is
      assumed to reside in the global environment.  If \code{data} does not
      exist, \code{whatif} then looks for \code{model}, which should
      contain the model frame (as in \code{lm}).  The intercept is
      automatically dropped from the extracted \emph{observed covariate data}
      set if the original model included one.  
    \item A \eqn{n}-by-\eqn{k} non-character (logical or numeric) matrix or
      data frame of \emph{observed covariate data} with \eqn{n} data points
      or units and \eqn{k} covariates.  All desired variable transformations
      and interaction terms should be included in this set of \eqn{k}
      covariates unless \code{formula} is alternatively used to
      produce them.  However, an intercept should not be.  Such a matrix
      may be obtained by passing model output (e.g., output from a call
      to \code{lm}) to \code{model.matrix} and excluding the intercept
      from the resulting matrix if one was
      fit.  Note that \code{whatif} will attempt to coerce data frames
      to their internal numeric values.  Hence, data frames should only
      contain logical, numeric, and factor columns; character columns
      will lead to an error being returned.
    \item A string.  Either the complete path (including file name) of the
      file containing the data or the path relative to your working
      directory.  This file should be a white space delimited text file.
      If it contains a header, you must include a column of row names as
      discussed in the help file for the \code{R} function
      \code{read.table}.  The data in the file should be as otherwise
      described in (2).
    } 
    Missing data is allowed and will be dealt with
    via the argument \code{missing}.  It should be flagged using
    \code{R}'s standard representation for missing data, \code{NA}.}
  \item{cfact}{A \code{R} object or a string.  If a \code{R} object,
    a \eqn{m}-by-\eqn{k} non-character matrix or data frame of
    \emph{counterfactuals} with \eqn{m} counterfactuals and the same \eqn{k}
    covariates (in the same order) as in \code{data}.  However, if
    \code{formula} is used to select a subset of the \eqn{k} covariates,
    then \code{cfact} may contain either only these \eqn{j \leq k}{j <= K}
    covariates or the complete set of \eqn{k} covariates.  An intercept
    should not be included as one of the covariates.  It will be
    automatically dropped from the counterfactuals generated by
    \pkg{Zelig} if the original model contained one.  Data frames
    will again be coerced to their internal numeric values if possible.
    If a string, either the complete path (including file name) of the
    file containing the counterfactuals or the path relative to your
    working directory.  This file should be a white space delimited text
    file.  See the discussion under \code{data} for instructions on
    dealing with a header.  All counterfactuals should be fully
    observed: if you supply counterfactuals with missing data, they will
    be list-wise deleted and a warning message will be printed to the screen.}
  \item{range}{An optional numeric vector of length \eqn{k}, where \eqn{k} is 
    the number of covariates.  Each element represents the range of the corresponding
    covariate for use in calculating Gower distances.  Use this argument
    when covariate data do not represent the population of interest,
    such as selection by stratification or experimental manipulation.
    By default, the range of each covariate is calculated
    from the data (the difference of its maximum and minimum values in
    the sample), which is appropriate when a simple random sampling
    design was used.  To supply your own range for the \eqn{k}th covariate,
    set the \eqn{k}th element of the vector equal to the desired range
    and all other elements equal to \code{NA}.  Default is \code{NULL}.}
  \item{freq}{An optional numeric vector of any positive length, the elements
    of which comprise a set of distances.  Used in calculating
    cumulative frequency distributions for the distances of the data
    points from each counterfactual.  For each such distance and
    counterfactual, the cumulative frequency is the fraction of observed
    covariate data points with distance to the counterfactual less
    than or equal to the supplied distance value.  The default varies
    with the distance measure used.  When the Gower distance measure is employed,
    frequencies are calculated for the sequence of Gower distances from
    0 to 1 in increments of 0.05.  When the Euclidian distance measure
    is employed, frequencies are calculated for the sequence of Euclidian
    distances from the minimum to the maximum observed distances in twenty
    equal increments, all rounded to two decimal places.  Default is \code{NULL}.}
  \item{nearby}{An optional scalar indicating
    which observed data points are considered to be nearby (i.e., withing `nearby'
    geometric variances of) the counterfactuals.  Used to calculate the summary statistic
    returned by the function: the fraction of the observed data nearby
    each counterfactual.  By default, the geometric variance of the
    covariate data is used.  For example, setting \code{nearby} to
    2 will identify the proportion of data points within two geometric variances of a
    counterfactual.  Default is \code{NULL}.}
  \item{distance}{An optional string indicating which of two distance measures
    to employ.  The choices are either \code{"gower"}, Gower's non-parametric
    distance measure (\eqn{G^2}), which is suitable for both qualitative
    and quantitative data; or \code{"euclidian"}, squared Euclidian distance, which 
    is only suitable for quantitative data.  The default is \code{"gower"}.}
  \item{miss}{An optional string indicating the strategy for dealing
    with missing data in the observed covariate data set.
    \code{whatif} supports two possible missing data strategies:
    \code{"list"}, list-wise deletion of missing cases; and \code{"case"},
    ignoring missing data case-by-case.  Note that if \code{"case"} is
    selected, cases with missing values are deleted listwise for the
    convex hull test and for computing Euclidian distances, but pairwise deletion is
    used in computing the Gower distances to maximally use available
    information. The user is strongly encouraged to treat missing data
    using specialized tools such as Amelia prior to feeding the data to
    \code{whatif}.  Default is \code{"list"}.}
  \item{choice}{An optional string indicating which analyses to 
    undertake. The options are either \code{"hull"}, only perform the convex hull 
    membership test; \code{"distance"}, do not perform the convex
    hull test but do everything else, such as calculating the distance between
    each counterfactual and data point; or \code{"both"}, undertake both the
    convex hull test and the distance calculations (i.e., do everything).
    Default is \code{"both"}.}
  \item{return.inputs}{A Boolean; should the processed observed
    covariate and counterfactual data matrices on which all
    \code{whatif} computations are performed be returned?  Processing
    refers to internal \code{whatif} operations such as the subsetting
    of covariates via \code{formula}, the deletion of cases with
    missing values, and the coercion of data frames to numeric matrices.
    Primarily intended for diagnostic purposes.  If \code{TRUE}, these matrices
    are returned as a list.  Default is \code{FALSE}.}
  \item{return.distance}{A Boolean; should the matrix of distances
  between each counterfactual and data point be returned?  If
  \code{TRUE}, this matrix is returned as part of the output; if
  \code{FALSE}, it is not.  Default is \code{FALSE} due to the large
  size that this matrix may attain.}
  \item{...}{Further arguments passed to and from other methods.}
}
\details{
  This function is the primary tool for evaluating your counterfactuals.  
  Specifically, it:
  \enumerate{
    \item Determines whether or not your counterfactuals are in the
      convex hull of the observed covariate data.  
    \item Computes the distance of your counterfactuals from each of the \eqn{n}
      observed covariate data points.  The default distance function used is Gower's 
      non-parametric measure.
    \item Computes a summary statistic for each counterfactual based on 
      the distances in (2):  the fraction of observed covariate data points with 
      distances to your counterfactual less than a value you supply.  By
      default, this value is taken to be the geometric variability of the observed
      data.
    \item Computes the cumulative frequency distribution of each counterfactual
      for the distances in (2) using values that you supply.  By default, Gower
     distances from 0 to 1 in increments of 0.05 are used.
  }
}
\value{
  An object of class "whatif", a list consisting of the following 
  six or seven elements:
  \item{call}{The original call to \code{whatif}.}
  \item{inputs}{A list with two elements, \code{data} and \code{cfact}.  Only
    present if \code{return.inputs} was set equal to \code{TRUE} in the call
    to \code{whatif}.  The first element is the processed observed
    covariate data matrix on which all \code{whatif} computations were
    performed.  The second element is the processed counterfactual data
    matrix.}
  \item{in.hull}{A logical vector of length \eqn{m}, where \eqn{m} is the number
    of counterfactuals.  Each element of the vector is \code{TRUE} if the corresponding
    counterfactual is in the convex hull and \code{FALSE} otherwise.}
  \item{dist}{A \eqn{m}-by-\eqn{n} numeric matrix, where \eqn{m} is 
    the number of counterfactuals and \eqn{n} is the number of data points 
    (units).  Only present if \code{return.distance} was set equal to \code{TRUE}
  in the call to \code{whatif}.  The \eqn{[i, j]}th entry of the matrix contains the  
    distance between the \eqn{i}th counterfactual and the \eqn{j}th data point.}
  \item{geom.var}{A scalar.  The geometric variability of the observed covariate
    data.}
  \item{sum.stat}{A numeric vector of length \eqn{m}, where \eqn{m} is the
    number of counterfactuals.   The \eqn{m}th element contains the summary 
    statistic for the corresponding counterfactual.  This summary statistic is 
    the fraction of data points with distances to the counterfactual 
    less than the argument \code{nearby}, which by default is the geometric 
    variability of the covariates.}
  \item{cum.freq}{A numeric matrix.  By default, the matrix has
    dimension \eqn{m}-by-21, where \eqn{m} is the number of
    counterfactuals; however, if you supplied your own frequencies via
    the argument \code{freq}, the matrix has dimension \eqn{m}-by-\eqn{f},
    where \eqn{f} is the length of \code{freq}.  Each row of the
    matrix contains the cumulative frequency distribution for the
    corresponding counterfactual calculated using either the distance 
    measure-specific default set of distance values or the set that you supplied (see 
    the discussion under the argument \code{freq}).  Hence, the \eqn{[i, j]}th
    entry of the matrix is the fraction of data points with 
    distances to the \eqn{i}th counterfactual less than or equal to the
    value represented by the \eqn{j}th column.  The column names contain these
    values.}
}
\references{King, Gary and Langche Zeng.  2006.  "The Dangers of 
  Extreme Counterfactuals."  \emph{Political Analysis} 14 (2).
  Available from \url{http://gking.harvard.edu}.

  King, Gary and Langche Zeng.  2007.  "When Can History Be Our Guide?
  The Pitfalls of Counterfactual Inference."  \emph{International Studies Quarterly}
  51 (March).  Available from \url{http://gking.harvard.edu}.}
\author{Stoll, Heather \email{hstoll@polsci.ucsb.edu}, King, Gary
  \email{king@harvard.edu} and Zeng, Langche \email{zeng@ucsd.edu}}
\note{This function requires the \pkg{lpSolve} package.}
\seealso{
  \code{\link{plot.whatif}},
  \code{\link{summary.whatif}},
  \code{\link{print.whatif}},
  \code{\link{print.summary.whatif}}
}
\examples{
##  Create example data sets and counterfactuals
my.cfact <- matrix(rnorm(3*5), ncol = 5)
my.data <- matrix(rnorm(100*5), ncol = 5)

##  Evaluate counterfactuals
my.result <- whatif(data = my.data, cfact = my.cfact)

##  Evaluate counterfactuals and supply own gower distances for 
##  cumulative frequency distributions
my.result <- whatif(cfact = my.cfact, data = my.data, freq = c(0, .25, .5, 1, 1.25, 1.5))
}
\keyword{htest}
\keyword{models}
\keyword{regression}
