odds.n.ends vignette

Creating a small data set for demonstration

odds.n.ends was created in order to take the results from a binary logistic regression model estimated using the glm() package and compute model significance, model fit, and the odds ratios and 95% confidence intervals typically reported from binary logistic regression analyses.

The small demonstration data set includes three variables. The first is a binary outcome variable (sick) with two values, 1 and 0 where 1 represents sick and 0 represents not sick. The second is an integer representing age in years (age) as one of the predictors, and a three-category nominal variable showing smoking status (smoke).

# enter demo data
sick <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
          0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0)
age <- c(23, 25, 26, 34, 54, 46, 48, 95, 81, 42, 62, 25, 31, 49, 57, 52, 54, 63, 61, 50,
         43, 35, 26, 74, 34, 46, 43, 65, 81, 42, 62, 25, 21, 47, 51, 22, 34, 59, 26, 55)
smoke <- c('Former', 'Former', 'Former', 'Never', 'Current', 'Current', 'Current', 'Current', 'Never', 'Former', 'Never', 'Former', 'Current', 'Former', 'Never', 'Current', 'Current', 'Current', 'Former', 'Never','Former', 'Former', 'Former', 'Never', 'Current', 'Current', 'Current', 'Current', 'Never', 'Former', 'Never', 'Former', 'Current', 'Former', 'Never', 'Current', 'Current', 'Current', 'Former', 'Never')

# create data frame
smokeData <- data.frame(sick, age, smoke)

Estimating the model

The glm() function will be used to estimate a binary logistic regression model predicting the sick outcome based on age and smoke.

# estimate the logistic regression model object
logisticModel <- glm(formula = sick ~ age + smoke, data = smokeData, na.action = na.exclude, family = binomial(logit))

# print model summary for the logistic model object
summary(object = logisticModel)

## 
## Call:
## glm(formula = sick ~ age + smoke, family = binomial(logit), data = smokeData, 
##     na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0490  -0.6251   0.3009   0.6955   1.9315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -3.28649    1.58753  -2.070   0.0384 * 
## age          0.10442    0.03711   2.814   0.0049 **
## smokeFormer -1.12544    0.94693  -1.189   0.2346   
## smokeNever  -2.47194    1.25103  -1.976   0.0482 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.548  on 39  degrees of freedom
## Residual deviance: 37.896  on 36  degrees of freedom
## AIC: 45.896
## 
## Number of Fisher Scoring iterations: 5

Using odds.n.ends to get significance, fit, OR and CI

The summary contains model coefficients, coefficient significance, and deviance and AIC which are measures of lack of fit of the model. While this information is useful in determining which of the predictors is significant and whether the deviance (lack of fit) was reduced between a null model with no predictors in it and an estimated model.

# open odds.n.ends package
library(package = "odds.n.ends")

# get the  basics
odds.n.ends(mod = logisticModel)

## Waiting for profiling to be done...

## $`Logistic regression model significance`
## Chi-squared        d.f.           p 
##      16.652       3.000       0.001 
## 
## $`Contingency tables (model fit): frequency predicted`
##                 Number observed
## Number predicted  1  0 Sum
##              1   19  4  23
##              0    4 13  17
##              Sum 23 17  40
## 
## $`Count R-squared (model fit): percent correctly predicted`
## [1] 80
## 
## $`Model sensitivity`
## [1] 0.826087
## 
## $`Model specificity`
## [1] 0.7647059
## 
## $`Predictor odds ratios and 95% CI`
##                     OR       2.5 %    97.5 %
## (Intercept) 0.03738466 0.001102466 0.6610966
## age         1.11006273 1.041062741 1.2081565
## smokeFormer 0.32450861 0.045942281 2.0537937
## smokeNever  0.08442065 0.005379054 0.8158007

The results show that the model was statistically significantly better than a baseline model at explaining the outcome [\(\chi^2\)(3) = 16.652; p = .001]. The model correctly predicted 19 of those who were sick (sick = 1) and 13 of those who were not sick (sick = 0), for a total of 32 correctly predicted out of 40 (Count-\(R^2\) = .80 or 80% correctly predicted). The model was more sensitive, with 82.6% of those who were sick (the cases) correctly predicted, and less specific, with 76.5% of the members of the reference group correctly predicted. Age was a statistically significant predictor of the outcome; for every one year increase in age, the odds of being sick increased by 11% (OR = 1.11; 95% CI: 1.04 - 1.21). There was no statistically significant difference in odds of being sick for former smokers compared to current smokers. Never smokers had 92% lower odds of being sick compared to current smokers; this decrease was statistically significant (OR = .08; 95% CI: .005 - .82).

The odds.n.ends package has several additional options including the ability to get an ROC curve (use option rocPlot = TRUE) and histograms of predicted probabilities (use option predProbPlot = TRUE). Colors for these plots can be set with options color1 = and color2 =. Finally, the threshold for a predicted probability being counted as a case (outcome = 1) has a default value of .5, so any predicted probability that is .5 or higher will be counted as a case, and any predicted probability below .5 will be counted as a reference group member (outcome = 0). This threshold can be adjusted using the thresh = argument.