Conditional moments tests where proposed by Tauchen (1985) and Newey (1985). As these tests are quite alike Lagrange multiplier (or score) tests, which is, along with the likelihood ratio and the Wald tests one of the three “classical” tests. These three tests have in common to test a set of hypothesis which enables to go from a unconstrained (or larger) model to a constrained (or smaller) model:

  • the Wald test is based on the values of the vector implied the restrictions evaluated for the constrained model,
  • the likelihood ratio is based on the comparison of the fits of both models (measured by the value of the log-likelihood function),
  • the score test is based on the value of the score (ie the vector of the first derivatives) of the log-likelihood of the unconstrained models using the estimates of the constrained model.

1 The score (or Lagrange multiplier) test

For the special case where the hypothesis can be writen as a linear combination of the coefficients, the unconstrained model can always be parametrized in a way such that the likelihood function of the larger model is: \(\ln L (\beta_1, \beta_2)\), with under H0: \(\beta_2 = 0\) and the larger model reduces to the smaller one. For the score test, the statistic is based on the gradient of the log-likelihood computed with the parameters obtained by estimating the constrained model: \(\hat{\beta}_1 ^c, 0\), which is:

\[ \frac{\partial \ln L}{\partial \beta}(\hat{\beta}_1^c, 0) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta_1}(\hat{\beta}_1^c, 0) \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) = \left( \begin{array}{c} 0 \\ \frac{\partial \ln L}{\partial \beta_2}(\hat{\beta}_1^c, 0) \end{array} \right) \]

As an example, consider a linear or log-linear gaussian model for which one wishes to test the normality hypothesis. For the smaller model, we have:

\[ y_n = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) \]

for the linear model and:

\[ \ln y_n = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) \]

for the log-linear model. A potential larger model consists on taking a Box-Cox transformation of \(y\), denoted \(y^{(\lambda)}\):

\[ y^{(\lambda)} = \alpha + \beta x_n + \epsilon_n \mbox{ with } \epsilon_n \sim N(0, \sigma_\epsilon) \]

with:

\[ y^{(\lambda)} = \left\{ \begin{array}{lcl} \frac{y_n^\lambda - 1}{\lambda} & \mbox{if} & \lambda \neq 0 \\ \ln y_n & \mbox{if} & \lambda = 0 \end{array} \right. \]

which obviously reduce to the smaller log-linear model for \(\lambda = 0\). For \(\lambda = 1\), \(y ^ {(\lambda)} = y - 1\) and the larger model reduce to the linear model except that \(1\) should be substracted from the left hand side of the equation, which means that the intercept of the linear model fitted by ordinary least squares should be reduced by 1.

The log-likelihood of the larger model is:

\[ \ln L (\beta, \sigma, \lambda) = -\frac{N}{2}\ln 2\pi - N \ln \sigma - (1 - \lambda) \sum_{n=1}^N \ln y_n - \frac{1}{2\sigma ^ 2}\sum_{n=1}^N\left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2 \]

The gradient of the log-likelihood is:

\[ \frac{\partial \ln L}{\partial \theta}(\theta) = \left( \begin{array}{c} \frac{\partial \ln L}{\partial \beta}\\ \frac{\partial \ln L}{\partial \sigma}\\ \frac{\partial \ln L}{\partial \lambda}\\ \end{array} \right) = \left( \begin{array}{c} \frac{1}{\sigma ^ 2} \sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) x_n\\ -\frac{N}{\sigma} + \frac{1}{\sigma^3}\sum_{n=1}^N \left(y^{(\lambda)} - \beta^\top x_n\right) ^ 2\\ \sum_{n=1} ^ N \ln y_n - \frac{1}{\sigma ^ 2}\sum_{n=1}^N \left(y_n ^ {(\lambda)} - \beta^\top x_n\right) \left(y_n^{(\lambda)} \ln y_n - \frac{y_n ^ {(\lambda)} - \ln y_n}{\lambda}\right) \end{array} \right) \]

For the constrained linear (\(\hat{\lambda}_{i} = 1\)) and log-linear (\(\hat{\lambda}_o = 0\)) models, we have respectively:

\[ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_i) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \left(1 + y(\ln y - 1)\right) \end{array} \right) \]

\[ \frac{\partial \ln L}{\partial \theta}(\hat{\theta}_o) = \left( \begin{array}{c} 0 \\ 0 \\ \sum_{n=1} ^ N \ln y_n - \frac{1}{2\hat{\sigma}_c ^ 2}\sum_{n=1}^N \left(\ln y_n - \hat{\beta}_c^\top x_n\right) \ln y_n^2 \end{array} \right) \]

The following function returns the log-likelihood for a box-cox model, given a vector of parameters coef, a matrix of covariates X and a vector of response y. It has further arguments:

  • sum: if FALSE, invidual contributions to the log-likelihood (a vector of length \(N\)) and to the gradient (a matrix of dimensions \(N\times K\) is returned,
  • gradient: if TRUE the analytical gradient is computed and returned as an attribute,
  • hessian: if TRUE the analytical hessian is computed and returned as an attribute.
lnL_bc <- function(coef, X, y, sum = FALSE, gradient = FALSE, hessian = FALSE){
    y <- y
    N <- length(y)
    K <- length(coef) - 3
    beta <- coef[1:(K + 1)]
    sigma <- coef[K + 2]
    lambda <- coef[K + 3]
    bX <- as.numeric(X %*% beta)
    if (lambda == 0){
        bc <- log(y)
        bc_l  <- 1 / 2 * log(y) ^ 2
        bc_ll <- 1 / 3 * log(y) ^ 3
    }
    else{
        bc <- (y ^ lambda - 1) / lambda
        bc_l <- bc * log(y) - (bc - log(y)) / lambda
        bc_ll <- bc_l * log(y) - (bc_l * lambda - (bc - log(y))) / lambda ^ 2
    }
    e <- bc - bX
    lnL <- - 1 / 2 * log(2 * pi) - log(sigma) - (1 - lambda) * log(y) -
        1 / (2 * sigma ^ 2) * e ^ 2

    if (gradient){
        g_beta <- 1 / sigma ^ 2 * e * X
        g_sigma <- - 1 / sigma + 1 / (sigma ^ 3) * e ^ 2
        g_lambda <- log(y) - 1 / (sigma ^ 2) * e * bc_l
        G <- cbind(g_beta, g_sigma,  g_lambda)
    }
    if (hessian){
        h_bb <- - 1 / sigma ^ 2 * crossprod(X)
        h_bs <- apply(- 2 / sigma ^ 3 * X * e, 2, sum)
        h_bl <- apply( 1 / sigma ^ 2 * bc_l * X, 2, sum)
        h_ss <- sum(1 / sigma ^ 2 - 3 / sigma ^ 4 * e ^ 2)
        h_sl <- sum(2 / sigma ^ 3 * e * bc_l)
        h_ll <- sum(- 1 / sigma ^ 2 * (e * bc_ll + bc_l ^ 2))
        H <- rbind(cbind(h_bb, sigma = h_bs, lambda = h_bl),
                   sigma = c(h_bs, h_ss, h_sl),
                   lambda = c(h_bl, h_sl, h_ll))
    }
    if (sum){
        lnL <- sum(lnL)
        if (gradient) G <- apply(G, 2, sum)
    }
    if (gradient) attr(lnL, "gradient") <- G
    if (hessian) attr(lnL, "hessian") <- H
    lnL
}

Consider as an example the cars data set which is provided by base R. The response is dist (the stoping distance) and the only covariate is speed (the speed of the car). The smaller linear and log-linear models can be computed using lm:

lin_lm <- lm(dist ~ speed + I(speed ^ 2), cars)
log_lm <- update(lin_lm, log(.) ~ .)

as stated previously, the linear model is a special case of the box-cox model if one is substracted from the intercept.

coefs_lin <- coef(lin_lm)
coefs_lin[1] <- coefs_lin[1] - 1

To use the lnL_bc function, we need to extract the response vector and the covariates matrix:

X <- model.matrix(lin_lm)
y <- model.response(model.frame(lin_lm))

Finally we can evaluate the log-likelihood of the box-cox model for the two special cases of the linear and the log-linear models using the coefficients obtained using lm (\(\sigma\) is a parameter in the log-likelihood, it is extracted from the lm models using the provided sigma2 function which differs from the sigma function by the fact that the deviance is divided by the number of observations and not the number of degrees of freedom of the regression).

sigma2 <- function(x) sigma(x) * sqrt(df.residual(x) / nobs(x))
coefs_lin <- c(coefs_lin, sigma = sigma2(lin_lm), lambda = 1)
coefs_log <- c(coef(log_lm), sigma = sigma2(log_lm), lambda = 0)
lnL_lin <- lnL_bc(coefs_lin, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE)
lnL_log <- lnL_bc(coefs_log, X, y, sum = TRUE, gradient = TRUE, hessian = FALSE)
lnL_lin
## [1] -205.386
## attr(,"gradient")
##   (Intercept)         speed    I(speed^2)       g_sigma      g_lambda 
##  1.041376e-15  1.951564e-14  4.263673e-13  6.175616e-16 -2.184459e+01
lnL_log
## [1] -202.3903
## attr(,"gradient")
##   (Intercept)         speed    I(speed^2)       g_sigma      g_lambda 
## -2.936210e-13 -3.541334e-12 -4.704148e-11  2.620126e-13  2.956768e+01

According to the log-likelihood, the log-linear model fits fairly better the data than the linear one. As expected, all the elements of the gradient vector except the last one is 0. The derivative of the log-likelihood with respect with \(\lambda\) is negative for the linear model and positive for the log-linear model. This means that the log-likelihood should increase for some values of \(\lambda\) between 0 and 1. To fit the box-cox log-likelihood model, we use the maxLik::maxLik function. It is a good general practise to provide “good” (and different) starting values to unsures that the non-linear optimization process converges, and that it converges to a global maximum. We therefore run twice maxLik, with starting values corresponding to the linear and to the non-linear model:

library("maxLik")
bc_lin <- maxLik(lnL_bc, start = coefs_lin, X = X, y = y,
                 sum = TRUE, gradient = TRUE, hessian = TRUE)
bc_log <- maxLik(lnL_bc, start = coefs_log, X = X, y = y,
                 sum = TRUE, gradient = TRUE, hessian = TRUE)

We first check that the fitted model is the same for the two sets of starting values:

cbind(coef(bc_lin), coef(bc_log))
##                     [,1]         [,2]
## (Intercept)  0.213367880  0.213370726
## speed        0.582112891  0.582111088
## I(speed^2)  -0.005530642 -0.005530664
## sigma        1.365042140  1.365033776
## lambda       0.372872290  0.372870677

and we then summarize the fitted model:

summary(bc_lin)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 15 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -197.3795 
## 5  free parameters
## Estimates:
##              Estimate Std. error t value Pr(> t)   
## (Intercept)  0.213368   1.351837   0.158 0.87459   
## speed        0.582113   0.232726   2.501 0.01237 * 
## I(speed^2)  -0.005531   0.006108  -0.905 0.36524   
## sigma        1.365042   0.650610   2.098 0.03590 * 
## lambda       0.372872   0.131795   2.829 0.00467 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

As expected, the fitted value of \(\lambda\) lies between 0 and 1 and is statistically different from 0 and from 1. Note also that the log-likelihood of the box-cox model (-197.38) is larger than the one of the log-linear model (-202.39).

If the hypothesis corresponding to the smaller model are true, the distribution of the score vector is:

\[ g(\tilde{\theta}) \sim N(0, I(\theta)) \]

Were \(I\) is the information matrix. Therefore if H0 is true:

\[ g(\tilde{\theta}) ^ \top I(\theta) ^ {-1} g(\tilde{\theta}) \]

is a \(\chi^2\) with \(J\) degrees of freedom (\(J\) being the number of hypothesis, one in our example). \(I(\theta)=-\mbox{E}\left(\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\theta)\right)\) needs to be estimate and two natural estimators are:

  • the empirical variance of the gradient: \(\sum_{n=1} ^ N g(\tilde{\theta}, x_n, y_n) g(\tilde{\theta}, x_n, y_n)^\top = G(\tilde{\theta}, X, y) ^ \top G(\tilde{\theta}, X, y)\) (also called the outer-product of the gradient), where \(G\) is an \(N\times K\) matrix for which each line is the contribution of an observation to the gradient,
  • the opposite of the hessian: \(-H(\tilde{\theta}) = -\frac{\partial^2 \ln L}{\partial \theta \partial \theta^\top}(\tilde{\theta})\).

The \(G\) matrix is obtained using the lnL_bc function with sum = FALSE and gradient = TRUE and the \(H\) matrix using hessian = TRUE.

lnL_lin <- lnL_bc(coefs_lin, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE)
lnL_log <- lnL_bc(coefs_log, X, y, sum = FALSE, gradient = TRUE, hessian = TRUE)
G_lin <- attr(lnL_lin, "gradient")
H_lin <- attr(lnL_lin, "hessian")
G_log <- attr(lnL_log, "gradient")
H_log <- attr(lnL_log, "hessian")
g_lin <- apply(G_lin, 2, sum)
g_log <- apply(G_log, 2, sum)

The test statistics can then be computed using any of the two estimators of the information matrix:

as.numeric(g_lin %*% solve(- H_lin) %*% g_lin)
## [1] 22.48473
as.numeric(g_lin %*% solve(crossprod(G_lin)) %*% g_lin)
## [1] 18.08425
as.numeric(g_log %*% solve(- H_log) %*% g_log)
## [1] 8.797853
as.numeric(g_log %*% solve(crossprod(G_log)) %*% g_log)
## [1] 9.428458

The two variants of the same test statistics are very close whatever the approximation of the information matrix used. The critical value of a \(\chi^2\) with one degree of freedom at the 5% level being 3.84, the two hypothesis of linear and log normality are strongly rejected.

Note that \(g(\tilde{\theta}) = G(\tilde{\theta})^\top \iota\) where \(\iota\) is a vector of one of length \(N\). Therefore the outer product variant of the test is:

\[ \iota^\top G(\tilde{\theta}) \left[G(\tilde{\theta}) ^ \top G(\tilde{\theta})\right] ^ {-1} G(\tilde{\theta}) ^ \top \iota = \iota^\top M_G(\tilde{\theta}) \iota \]

where \(M_G(\tilde{\theta}) = G(\tilde{\theta})\left[G(\tilde{\theta}) ^ \top G(\tilde{\theta})\right] ^ {-1} G(\tilde{\theta}) ^ \top\) is the projection matrix on the columns of \(G(\tilde{\theta})\). \(M_G(\tilde{\theta}) \iota\) is therefore the fitted values of the regression of a column of ones with the columns of \(G(\tilde{\theta})\) and, as \(M_G(\tilde{\theta})\) is idempotent, the test statistic is the sum of the fitted values of this regression. The (uncetered as there is no intercept in this regression) \(R^2\) is this sum of the fitted values divided by the total variation of the response, which is equal to \(N\) (the response being a vector of ones of length \(N\)). Therefore, the test statistic can be obtained as \(N\) times the \(R^2\) of the regression of \(1\) one on the columns of \(G\):

iota <- rep(1, length(y))
lm_ones_lin <- lm(iota ~ G_lin - 1)
lm_ones_log <- lm(iota ~ G_log - 1)
summary(lm_ones_lin)$r.squared * nobs(lm_ones_lin)
## [1] 18.08425
summary(lm_ones_log)$r.squared * nobs(lm_ones_log)
## [1] 9.428458

2 The conditional moment test

The conditional moment test approach consists on constructing a vector of moments which are 0 under H0. Denoting \(\mu_n\) such a vector, we have, for the normality hypothesis:

\[ \mu_n(\theta) = \left( \begin{array}{c} \mu_{3n}\\ \mu_{4n} \end{array} \right)= \left(\begin{array}{c} \mbox{E}(\epsilon_n ^ 3) \\ \mbox{E}(\epsilon_n ^ 4 - 3 \sigma ^ 4) \end{array}\right) = 0 \]

Denoting \(\hat{\epsilon}_n = y_n - \tilde{\beta} ^ \top x_n\) or \(\hat{\epsilon}_n = \ln y_n - \tilde{\beta} ^ \top x_n\) the residuals of the linear or the log-linear model, the empirical moments are:

\[ m_n(\tilde{\theta}) = \left( \begin{array}{c} \hat{\epsilon}_n ^ 3 \\ \hat{\epsilon}_n ^ 4 - 3 \sigma ^ 4 \end{array} \right) \]

Define : \(M (\tilde{\theta}) ^ \top = \left(m_1, m_2, \ldots, m_N\right)\) an \(N\times r\) matrix for which each line is the moment vector for on observation and \(W(\theta)= \frac{1}{N}\sum_n \mbox{E}\left(\frac{\partial m_n}{\partial \theta}(\theta)\right)\) a \(K \times r\) matrix containing the derivatives of the empirical moments with the parameters of the model. Defining:

\[ Q = \left[M(\tilde{\theta}) - G(\tilde{\theta}) I(\theta) ^ {-1} W(\theta)\right]^\top \left[M(\tilde{\theta}) - G(\tilde{\theta}) I(\theta) ^ {-1} W(\theta)\right] \]

the test statistic is:

\[ m(\tilde{\theta}) ^ \top Q ^ {-1} m(\tilde{\theta}) \]

and under Ho follows a \(\chi^2\) with \(r\) (the length of \(m\)) degrees of freedom.

\(Q\) plays the role of the information matrix for the score test. As previously, different flavours of the test can be obtained depending on how \(Q\) is estimated:

  • numerically, using as previously \(\hat{I}_N = G(\tilde{\theta}) ^ \top G(\tilde{\theta})\) and \(\hat{W}_N = G(\tilde{\theta}) ^ \top M(\tilde{\theta})\),
  • using the analytical derivatives, using as previously \(\hat{I}_A = - H(\tilde{\theta})\) and \(\hat{W}_A = - \frac{\partial M}{\partial \theta}(\tilde{\theta})\)

In this latter case, the test statistic can also be obtained as \(N\) times the \(R^2\) of the regression of a column of one on the columns of \(G\) and \(M\). The following function performs the test.

cm_norm <- function(x, type = c("analytical", "opg", "reg")){
    type <- match.arg(type)
    X <- model.matrix(x)
    y <- model.response(model.frame(x))
    N <- length(y)
    K <- length(coef(x))
    coefs <- c(coef(x), sigma = sigma2(x))
    beta <- coefs[1:K]
    sigma <- coefs[K + 1]
    epsilon <- as.numeric(y - X %*% beta)
    G <- cbind(1 / sigma ^ 2 * X * epsilon,
               sigma = - 1  / sigma + 1 / sigma ^ 3 * epsilon ^ 2)
    M <- cbind(asym = epsilon ^ 3, kurt = epsilon ^ 4 - 3 * sigma ^ 4)
    m <- apply(M, 2, sum)
    if (type == "analytical"){
        I <- - rbind(cbind(- 1 / sigma ^ 2 * crossprod(X),
                           sigma = - 2 / sigma ^ 3 * apply(X * epsilon, 2, sum)),
                     sigma = c(- 2 / sigma ^ 3 * apply(X * epsilon, 2, sum),
                               1 / sigma ^ 2 - 3 / sigma ^ 4 * sum(epsilon ^ 2)))
        W <- - rbind(cbind(asym = - 3 * apply(epsilon ^ 2 * X, 2, sum),
                           kurt = - 4 * apply(epsilon ^ 3 * X, 2, sum)),
                     sigma = c(0, - 12 * N / sigma ^ 3))
    }
    if (type == "opg"){
        I <- crossprod(G)
        W <- crossprod(G, M)
    }
    if (type != "reg"){
        Q <- crossprod(M - G %*% solve(I) %*% W)
        stat <- as.numeric(m %*% solve(Q) %*% m)
    }
    else stat <- summary(lm(rep(1, N) ~ G + M - 1))$r.squared * N
    stat
}

For the linear model, we get:

cm_norm(lin_lm, "analytical")
## [1] 12.2857
cm_norm(lin_lm, "opg")
## [1] 13.41273
cm_norm(lin_lm, "reg")
## [1] 13.41273

and the normality hypothesis is therefore strongly rejected. For the log-linear model:

cm_norm(log_lm, "analytical")
## [1] 0.07940664
cm_norm(log_lm, "opg")
## [1] 0.6255081
cm_norm(log_lm, "reg")
## [1] 0.6255081

on the contrary, the normality hypothesis is not rejected.

References

Newey, Whitney K. 1985. “Maximum Likelihood Specification Testing and Conditional Moment Tests.” Econometrica 53 (5): 1047–70. https://www.jstor.org/stable/1911011.

Tauchen, George. 1985. “Diagnostic Testing and Evaluation of Maximum Likelihood Models.” Journal of Econometrics 30 (1): 415–43. https://doi.org/https://doi.org/10.1016/0304-4076(85)90149-6.