Introduction to DiD with Multiple Time Periods

Brantly Callaway and Pedro H.C. Sant’Anna

2022-07-19

Introduction

Difference-in-differences is one of the most common approaches for identifying and estimating the causal effect of participating in a treatment on some outcome.

The “canonical” version of DiD involves two periods and two groups. The untreated group never participates in the treatment, and the treated group becomes treated in the second period.

However, much applied work deals with cases where there are more than two time periods and different units can become treated at different points in time. Regardless of the number of time periods, by far the leading approach in applied work is to try to estimate the effect of the treatment using a two-way fixed effects (TWFE) linear regression. This works great in the case with two periods, but there are a number of recent methodological papers that suggest that there may be substantial drawbacks to using TWFE with multiple time periods.

This vignette briefly discusses the emerging literature on DiD with multiple time periods – both issues with standard approaches as well as remedies for these potential problems. The did package implements a number of these remedies. A vignette for how to use the did package is available here. The background article for these vignettes is Callaway and Sant’Anna (2021), “Difference-in-Differences with Multiple Time Periods”.

Background

To start with, we’ll consider some background material in this section. First, we’ll discuss DiD with two time periods and two groups – this is the “canonical” case of DiD. Second, we briefly consider issues with TWFE linear regressions when there are multiple time periods.

DiD with 2 Periods and 2 Groups

The baseline case for DiD is the one with two periods (let’s call these periods \(t\) and \(t-1\)) and two groups (a treated group and an untreated group).

Notation / Setup

The main assumption in DiD designs is called the parallel trends assumption:

Parallel Trends Assumption

\[ E[Y_t(0) - Y_{t-1}(0)| D=1] = E[ Y_t(0)-Y_{t-1} | D=0] \]

In words, this assumption says that the change (or “path”) in outcomes over time that units in the treated group would have experienced if they had not participated in the treatment is the same as the path of outcomes that units in the untreated group actually experienced. The parallel trends assumption allows for the level of untreated potential outcomes to differ across groups and is consistent with, for example, fixed effects models for untreated potential outcomes where the mean of the unobserved fixed effect can be different across groups.

This assumption is potentially useful because the path of untreated potential outcomes for units in the treated group (the term on the left in the above equation) is not known, but the researcher does observe the path of untreated potential outcomes for units in the untreated group (term on the right in the above equation). In fact, it is straightforward to show that, under the parallel trends assumption, the \(ATT\) is identified and given by \[ ATT = E[ Y_t - Y_{t-1}| D=1] - E[ Y_t - Y_{t-1}| D=0] \]

That is, the \(ATT\) is the difference between the mean change in outcomes over time experienced by units in the treated group adjusted by the mean change in outcomes over time experienced by units in the untreated group; the latter term, under the parallel trends assumption, is what the path of outcomes for units in the treated group would have been if they had not participated in the treatment.

Two way fixed effects regressions

Now let’s move to a more general case where there are \(\mathcal{T}\) total time periods. Denote particular time periods by \(t\) where \(t=1,\ldots,\mathcal{T}\).

By far the most common approach to trying to estimate the effect of a binary treatment in this setup is the TWFE linear regression. This is a regression like \[ Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it} \] where \(\theta_t\) is a time fixed effect, \(\eta_i\) is a unit fixed effect, \(D_{it}\) is a treatment dummy variable, \(v_{it}\) are time varying unobservables that are mean independent of everything else, and \(\alpha\) is presumably the parameter of interest. \(\alpha\) is often interpreted as the average effect of participating in the treatment.

Although this is essentially a standard approach in applied work, there are a number of recent papers that point out potentially severe drawbacks of using the TWFE estimation procedure. These include: Borusyak and Jaravel (2018), Goodman-Bacon (2021), de Chaisemartin and D’Haultfoeuille (2020), and Sun and Abraham (2021).

When will TWFE work?

  1. Effects really aren’t heterogeneous. If the effect of participating in the treatment really is \(\alpha\) for all units, TWFE will work great. That being said, in many applications, treatment effects are very likely to be heterogeneous – they may vary across different units or exhibit dynamics or change across different time periods. In particular applications, this is worth thinking about, but, at least in our view, we think that heterogeneous effects of participating in some treatment is the leading case.

  2. There are only two time periods. This is the canonical case (2 periods, one group becomes treated in the second period, the other is never treated). In this case, under parallel trends an no-anticipation, \(\alpha\) is going to be numerically equal to the \(ATT\). In other words, in this case, even though it looks like you have restricted the effect of participating in the treatment to be the same across all units, TWFE exhibits robustness to treatment effect heterogeneity. Unfortunately, this robustness to treatment effect heterogeneity does not continue to hold when there are more periods and groups become treated at different points in time.

Why is TWFE not robust to treatment effect heterogeneity?

There are entire papers written about this, see, e.g., Borusyak and Jaravel (2018), Goodman-Bacon (2021), de Chaisemartin and D’Haultfoeuille (2020), and Sun and Abraham (2021). But here is the short version: in a TWFE regression, units whose treatment status doesn’t change over time serve as the comparison group for units whose treatment status does change over time. With multiple time periods and variation of treatment timing, some of these comparisons are:

The first of these two comparisons are good (or at least in the spirit of DiD) in that they take the path of outcomes experienced by units that become treated and adjust it by the path of outcomes experienced by units that are not participating in the treatment. The third comparison is different though: it adjusts the path of outcomes for newly treated units by the path of outcomes for already treated units. But this is not the path of untreated potential outcomes, it includes treatment effect dynamics. Thus, these dynamics appear in \(\alpha\), making it very hard to give a clear causal interpretation.

And this issue can have potentially severe consequences. For example, it is possible to come up with examples where the effect of participating in the treatment is positive for all units in all time periods, but the TWFE estimation procedure leads to estimating a negative effect of participating in the treatment. Even in the case where ``negative weights’’ can be ruled out, \(\alpha\) recover a weighted average of \(ATT's\), though these weights are hard to interpret.

Treatment Effects in Difference in Differences Designs with Multiple Periods

In light of the potential problems with TWFE regressions in DiD designs with multiple periods, are there alternative approaches that can be used in this case?

Yes, and it turns out that it is not all that complicated! It is just a matter of using the ``good/desirable’’ comparisons between groups instead of all possible comparisons.

To fix ideas, let’s provide some extended notation and be clear about the identifying assumptions that we are going to make.

Notation

Main Assumptions

Staggered Treatment Adoption Assumption Recall that \(D_{it} = 1\) if a unit \(i\) has been treated by time \(t\) and \(D_{it}=0\) otherwise. Then, for \(t=1,...,\mathcal{T}-1\), \(D_{it} = 1 \implies D_{it+1} = 1\).

Staggered treatment adoption implies that once a unit participates in the treatment, they remain treated. In other words, units do not “forget” about their treatment experience. This is a leading case in many applications in economics. For example, it would be the case for policies that roll out to different locations over some period of time. It would also be the case for many unit-level treatments that have a “scarring” effect. For example, in the context of job training, many applications consider participating in the treatment ever as defining treatment.

Within the DiD context, we believe it is hard to analyze non-staggered treatment setups without further restricting treatment effect heterogeneity across time, groups, treatment sequences, etc. That is the main reason we focus on this leading case.

Parallel Trends Assumption based on never-treated units For all \(g=2,...,\mathcal{T}\), \(t=2,...,\mathcal{T}\) with \(t \ge g\), \[ E[ Y_t(0) - Y_{t-1}(0) | G=g] = E[ Y_t(0) - Y_{t-1}(0)| C=1] \]

This is a natural extension of the parallel trends assumption in the two periods and two groups case. It says that, in the absence of treatment, average untreated potential outcomes for the group first treated in time \(g\) and for the “never treated” group would have followed parallel paths in all post-treatment periods \(t \ge g\).

Note that the aforementioned parallel trend assumption rely on using the ``never treated’’ units as comparison group for all “eventually treated” groups. This presumes that (i) a (large enough) “never-treated” group is available in the data, and (ii) these units are “similar enough” to the eventually treated units such that they can indeed be used as a valid comparison group. In situations where these conditions are not satisfied, one can use an alternative parallel trends assumption that uses the not-yet treated units as valid comparison groups.

Parallel Trends Assumption based on not-yet treated units For all \(g=2,...,\mathcal{T}\), \(s,t=2,...,\mathcal{T}\) with \(t \ge g\) and \(s \ge t\) \[ E[ Y_t(0) - Y_{t-1}(0) | G=g] = E[ Y_t(0) - Y_{t-1}(0)| D_s=0, G\not=g] \] In plain English, this assumption states that one can use the not-yet-treated by time \(s\) (\(s \ge t\)) units as valid comparison groups when computing the average treatment effect for the group first treated in time \(g\). In general, this assumption uses more data when constructing comparison groups. However, as noted in Marcus and Sant’Anna (2021), this assumption does restrict some pre-treatment trends across different groups. In other words, there is no free-lunch.

Group-Time Average Treatment Effects

The above assumptions are natural extensions of the identifying assumptions in the two periods and two groups case to the multiple periods case.

Likewise, a natural way to generalize the parameter of interest (the ATT) from the two periods and two groups case to the multiple periods case is to define group-time average treatment effects:

\[ ATT(g,t) = E[Y_t(g) - Y_t(0) | G=g] \]

This is the average effect of participating in the treatment for units in group \(g\) at time period \(t\). Notice that when there are two time periods and two groups (the canonical case), the average treatment effect on the treated is given by \(ATT = ATT(g=2,t=2)\).

To give a couple more examples, suppose that a researcher has access to three time periods. Then, \(ATT(g=2,t=3)\) is the average effect of participating in the treatment for the group of units that become treated in time period 2, in time period 3. Similarly, \(ATT(g=3,t=3)\) is the average effect of participating in the treatment for the group of units that become treated in time period 3, in time period 3.

Identification of Group-Time Average Treatment Effects

Under either version of the parallel trends assumptions mentioned above, it is straightforward to show that group-time average treatment effects are identified. For instance, when one impose the parallel trends assumption based on “never-treated units”, we have that, for all \(t \ge g\) \[ ATT(g,t) = E[ Y_t - Y_{g-1}| G=g] - E[ Y_t - Y_{g-1}| C=1]. \] Alternatively, when one impose the parallel trends assumption based on “not-yet-treated units”, we have that, for all \(t \ge g\) \[ ATT(g,t) = E[ Y_t - Y_{g-1}| G=g] - E[ Y_t - Y_{g-1}| D_t=0, G\not=g]. \]

These group-time average treatment effects are the building blocks of understanding the effect of participating in a treatment in DiD designs with multiple time periods.

Aggregating Group-Time Average Treatment Effects

Group-time average treatment effects are natural parameters to identify in the context of DiD with multiple periods and multiple groups. But in many applications, there may be a lot of them. There are some benefits and costs here. The main benefit is that it is relatively straightforward to think about heterogeneous effects across groups and time using group-time average treatment effects. On the other hand, it can be hard to summarize them (e.g., they are not just a single number).

In our paper, Callaway and Sant’Anna (2021), “Difference-in-Differences with Multiple Time Periods”, we propose a number of ways to aggregate group-time average treatment effects. Here, we will just consider a few important ones that we think applied researchers are most often interested in. First, consider the average effect of participating in the treatment, separately for each group. This is given by

\[ \theta_S(g) = \frac{1}{\mathcal{T} - g + 1} \sum_{t=2}^{\mathcal{T}} \mathbf{1}\{g \leq t\} ATT(g,t). \]

This parameter may be of interest in its own right, since it allows one to highlight treatment effect heterogeneity with respect to treatment adoption period. Furthermore, it is fairly straightforward to further aggregate \(\theta_S(g)\) to get an easy-to-interpret overall effect parameter,

\[ \theta^O_S := \sum_{g=2}^{\mathcal{T}} \theta_S(g) P(G=g). \]

\(\theta^O_S\) is the overall effect of participating in the treatment across all groups that have ever participated in the treatment. In our view, this is close to being a multi-period analogue of the \(ATT\) in the two period case. Thus, if a researcher is constrained to report a single treatment effect summary parameter, we recommend reporting \(\theta^O_S\).

In DiD setups with multiple periods, it is natural to ask “How does treatment effects vary with elapsed treatment time?” Here, note that researchers are interested in understanding treatment effect dynamics. This is at the heart of event-study-type of analysis that is widespread in applied work.

In this case, a natural way to aggregate the group-time average treatment effect to highlight treatment effect dynamics is given by

\[ \theta_D(e) := \sum_{g=2}^{\mathcal{T}} \mathbf{1} \{ g + e \leq \mathcal{T} \} ATT(g,g+e) P(G=g | G+e \leq \mathcal{T}). \]

This is the average effect of participating in the treatment for the group of units that have been exposed to the treatment for exactly \(e\) time periods.

All of these aggregations are available in the did package and examples with real data are available in our Getting Started with the did Package vignette. In Callaway and Sant’Anna (2021), we also discuss additional aggregation schemes. We encourage you to take a look!

Conclusion

This vignette has covered basic background issues on DiD with multiple periods. Callaway and Sant’Anna (2021) discusses many extensions and these are all provided in the did package as well. See our User Guides for more details.

References