Graded Intervention Time Series (OLS)#
Graded Intervention Time Series extends classical interrupted time series analysis to handle graded interventions - policies or treatments with varying intensity over time, rather than simple on/off changes. Traditional ITS methods model binary interventions (e.g., “policy enacted” vs “no policy”). This method (technically called Transfer Function Interrupted Time Series or TF-ITS in the literature [Box and Tiao, 1975]) handles more realistic scenarios where:
Intervention intensity varies continuously (e.g., advertising spend $0 - 100k, communication frequency 0-10 messages/week)
Effects saturate - diminishing returns as exposure increases (10th message less impactful than the 1st)
Effects persist over time - past interventions continue to influence outcomes (behavioral habits change gradually)
Key Components#
Saturation transforms: Model diminishing returns using Hill, logistic, or Michaelis-Menten functions
Adstock (carryover) transforms: Model persistence using geometric decay with configurable half-life
Baseline controls: Include confounders and natural trends
Counterfactual analysis: Estimate effects by zeroing or scaling interventions
HAC standard errors: Robust inference accounting for autocorrelation and heteroskedasticity
When to Use Graded Intervention Time Series#
Use this method when you have:
✅ Time series data from a single unit (region, market, organization)
✅ Graded intervention with varying intensity over time
✅ Reason to expect saturation (diminishing returns) or carryover effects (persistence)
✅ Baseline controls available for confounders
Compare to related methods:
Classic Interrupted Time Series: Binary on/off intervention (no dose-response modeling)
Synthetic Control: Multiple control units available for comparison
Difference in Differences: Panel data with treatment/control groups
The Autocorrelation Challenge#
Introduction: Understanding the Problem#
Autocorrelation occurs when observations in a time series are correlated with their own past values. In causal inference with time series data, this creates a fundamental challenge:
What is autocorrelation?
Today’s outcome is influenced by yesterday’s (and last week’s, and last month’s…)
This happens through multiple mechanisms:
Persistent unobserved factors: Weather patterns, economic conditions, social trends
Behavioral inertia: Habits and routines change slowly
Institutional dynamics: Organizational processes have memory
Measurement systems: Data collection schedules create patterns
Why is this a problem for causal inference?
Standard regression assumes independent errors — that the unexplained variation at time \(t\) is unrelated to time \(t-1\). When this assumption fails (as it almost always does in time series):
Coefficient estimates remain unbiased ✓ (still correct on average)
Standard errors are WRONG ✗ (typically too small, leading to overconfident inference)
Hypothesis tests are invalid ✗ (false positives, misleading p-values)
Confidence intervals are too narrow ✗ (underestimate true uncertainty)
This means you might conclude an intervention “works” when it actually doesn’t, or claim high precision when you’re actually quite uncertain!
How to handle autocorrelation:
There are several approaches to addressing autocorrelation in time series causal inference:
HAC (Newey-West) standard errors: Correct standard errors without modeling autocorrelation structure
ARIMAX models: Explicitly model AR/MA error structure
GLSAR: Generalized least squares with autoregressive errors
Bayesian time series models: Full posterior inference with temporal dependencies
Bootstrap methods: Resample with preserved temporal structure
This implementation provides both HAC and ARIMAX approaches, each with distinct advantages for different use cases.
Approach 1: HAC Standard Errors (Default)#
HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors — also known as Newey-West standard errors [Newey and West, 1987] — provide robust inference by correcting standard errors without requiring specification of the autocorrelation structure.
Advantages:
Simplicity: No need to specify autocorrelation structure (order of AR/MA terms)
Robustness: Works with any autocorrelation pattern (not just AR or MA)
Computational efficiency: Fast OLS with corrected standard errors
Proven reliability: Well-established method with strong theoretical properties
With HAC (see detailed explanation in the admonition box below):
✅ Causal estimates remain valid: Treatment effect coefficients are unbiased
✅ Inference is corrected: Standard errors, confidence intervals, and p-values account for autocorrelation
✅ No model specification required: Don’t need to guess AR order or lag structure
✅ Honest uncertainty quantification: Confidence intervals reflect true uncertainty
Tradeoff: HAC standard errors are wider (more conservative) than naive OLS, but they provide trustworthy inference even when residuals show complex autocorrelation patterns.
This notebook demonstrates HAC inference in the main analysis sections, showing how it compares to naive OLS and why it matters for valid causal inference.
Understanding HAC Standard Errors
Time series data typically violates OLS assumptions because:
Autocorrelation: Past values influence current values (e.g., yesterday’s weather affects today’s, habits persist over weeks)
Heteroskedasticity: Variance changes over time (e.g., more volatility in certain seasons)
When these violations occur, OLS coefficient estimates remain unbiased, but standard errors are incorrect — typically too small, leading to overconfident inference (narrow confidence intervals, artificially low p-values).
HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors — also known as Newey-West standard errors [Newey and West, 1987] — provide robust inference by correcting standard errors for these violations. This gives reliable confidence intervals and hypothesis tests even when residuals are correlated.
Key Parameter:
hac_maxlags: Controls how many periods of autocorrelation to account for. CausalPy uses the Newey-West rule of thumb:floor(4*(n/100)^(2/9)). For our 156-week dataset, this giveshac_maxlags=4, accounting for up to 4 weeks of residual correlation.
Tradeoff: HAC standard errors are wider (more conservative) than naive OLS, but provide honest uncertainty quantification for time series data.
Approach 2: ARIMAX Models#
ARIMAX (ARIMA with eXogenous variables) explicitly models the autocorrelation structure of residuals using ARIMA(p,d,q) processes, following the classical Box & Tiao (1975) intervention analysis framework [Box and Tiao, 1975].
Advantages:
Efficiency: Smaller standard errors when ARIMA structure is correctly specified
Classical methodology: Follows the original intervention analysis approach
Explicit error modeling: Can characterize and forecast residual dynamics
Tradeoffs:
Requires specification: Must choose p, d, q orders (typically via ACF/PACF plots)
Misspecification risk: Wrong orders can lead to biased or inefficient inference
Less robust: More sensitive to outliers and structural breaks
Section 5 of this notebook demonstrates ARIMAX as an alternative error model, comparing it to HAC and providing guidance on when to use each approach.
Notebook Overview#
This notebook demonstrates Graded Intervention Time Series (Transfer Function ITS) analysis using a simulated water consumption dataset. We’ll walk through data simulation, model fitting with transform parameter estimation, diagnostic checks, counterfactual analysis, and a comparison of different approaches to handling autocorrelation in time series data (HAC vs ARIMAX error models).
Implementation notes
This notebook demonstrates the non-Bayesian implementation using:
OLS regression first with with HAC standard errors (fast, robust inference), then with ARIMAX.
Automated transform parameter estimation via grid search or continuous optimization
Point estimates only (future: bootstrap confidence intervals, Bayesian uncertainty quantification)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import causalpy as cp
# Set random seed for reproducibility
np.random.seed(42)
%config InlineBackend.figure_format = 'retina'
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
/Users/benjamv/mambaforge/envs/CausalPy/lib/python3.13/site-packages/pymc_extras/model/marginal/graph_analysis.py:10: FutureWarning: `pytensor.graph.basic.io_toposort` was moved to `pytensor.graph.traversal.io_toposort`. Calling it from the old location will fail in a future release.
from pytensor.graph.basic import io_toposort
Example Scenario: Water Restrictions Policy#
A regional water authority in a dry climate implements a drought-responsive communications policy. Communication intensity (0-10 scale) varies based on cumulative rainfall deficit over the past 6 weeks. During sustained drought, messaging ramps up to encourage conservation. Most of the time, communications are zero (no routine messaging).
Why this example demonstrates TF-ITS strengths:
Graded intervention: Communication intensity varies from 0-10, not on/off
Sparse activation: Policy activates only during drought (realistic, cost-effective)
Saturation: Repeated messages have diminishing returns as people become desensitized (logistic function)
Adstock: Short carryover effect (~1.5 week half-life) - behavior reverts quickly when conditions change
Confounders: Temperature and rainfall directly affect water consumption and must be controlled
Note: Water consumption exhibits shorter persistence than interventions like advertising or habit-forming behaviors, since consumption is largely need-driven and reverts quickly when environmental conditions (rainfall, temperature) change.
While we use water policy, this method applies to any domain with graded interventions and carryover effects:
Public health campaigns (vaccination messaging, smoking cessation)
Marketing mix modeling (advertising spend, promotions)
Environmental policy (emissions reduction programs)
Traffic management (congestion pricing communications)
Education interventions (remediation program intensity)
1. Generate Simulated Data#
We’ll simulate weekly water consumption data for a catchment area in a dry climate over 3 years with:
Baseline drivers: temperature (seasonal) and rainfall (very low, with drought periods)
Responsive policy: public communications intensity that activates only during sustained drought
Autocorrelated errors: AR(2) process to simulate realistic residual autocorrelation (this demonstrates why HAC standard errors are crucial!)
Seasonal heteroskedasticity: Error variance increases during summer months (higher volatility)
Key features:
Rainfall ranges 0-16 mm/week with extended zero-rainfall periods in summer
Communication intensity is zero most of the time (no routine messaging)
Policy responds to 6-week cumulative rainfall deficit (not just current conditions)
When 6-week rainfall < 20mm and temperature > 27°C, communications ramp up to intensity 6-10
# Generate 156 weeks (3 years) of data
n_weeks = 156
dates = pd.date_range("2022-01-01", periods=n_weeks, freq="W-MON")
t = np.arange(n_weeks)
# Temperature (°C): seasonal pattern with summer peaks (southern hemisphere)
# Peak in Jan (week ~0) and Dec (week ~52), low in July (week ~26)
temperature = 25 + 10 * np.sin(2 * np.pi * t / 52) + np.random.normal(0, 2, n_weeks)
# Rainfall (mm/week): inverse seasonal pattern - very low in summer, moderate in winter
# Drier climate with extended periods of zero rainfall creating drought conditions
rainfall = 8 - 8 * np.sin(2 * np.pi * t / 52) + np.random.normal(0, 3, n_weeks)
rainfall = np.maximum(rainfall, 0) # Censor at zero (no negative rainfall)
# Communication intensity (scale 0-10): Only ramps up after sustained low rainfall
# Policy responds to cumulative rainfall deficit over past 6 weeks
comm_intensity = np.zeros(n_weeks)
# Calculate 6-week rolling sum of rainfall (measure of drought severity)
window_size = 6
for i in range(n_weeks):
start_idx = max(0, i - window_size + 1)
rainfall_6wk = rainfall[start_idx : i + 1].sum()
# Trigger communications only during severe drought conditions
# Expected 6-week rainfall in normal conditions: ~48mm (8mm/week avg)
# Drought threshold: < 20mm over 6 weeks (< 3.3mm/week average)
if rainfall_6wk < 20 and temperature[i] > 27:
# Ramp up intensity based on drought severity
drought_severity = (20 - rainfall_6wk) / 20 # 0 to 1
heat_factor = (temperature[i] - 27) / 10 # 0 to 1
intensity_raw = 6 + 4 * (drought_severity + heat_factor) / 2
comm_intensity[i] = np.floor(
np.clip(intensity_raw, 0, 10)
) # Round down to whole numbers
# Otherwise, communications stay at zero (no routine messaging)
# Baseline water consumption: depends on temperature and rainfall
# Higher temp → more water use, higher rainfall → less water use
baseline = (
4000 # Base consumption
+ 80 * temperature # Temperature effect (~80 ML per degree)
- 20 * rainfall # Rainfall effect (~20 ML per mm)
+ 5.0 * t # Slight upward trend (population growth)
)
# Apply "true" transforms to generate the data using pure numpy
# (Note: for data generation, we use numpy. For model fitting, we use CausalPy's transforms)
# Saturation: Logistic function - diminishing returns as people become desensitized to messaging
# lam=0.5 controls the saturation rate
lam_true = 0.5
comm_saturated = 1 - np.exp(-lam_true * comm_intensity)
# Adstock: geometric with half-life of 1.5 weeks
# Short carryover effect - behavior reverts quickly when messaging stops
# (More realistic for water consumption than longer persistence)
half_life = 1.5
alpha = np.power(0.5, 1 / half_life) # decay rate
l_max = 8 # Shorter window since effect decays quickly
# Apply geometric adstock convolution
comm_transformed = np.zeros_like(comm_saturated)
adstock_weights = np.power(alpha, np.arange(l_max + 1))
adstock_weights = adstock_weights / adstock_weights.sum() # normalize
for t_idx in range(n_weeks):
for lag in range(min(l_max + 1, t_idx + 1)):
comm_transformed[t_idx] += adstock_weights[lag] * comm_saturated[t_idx - lag]
# Generate water consumption with autocorrelated errors (realistic time series)
# Negative coefficient: higher communication intensity → lower water consumption
theta_true = (
-600
) # Treatment coefficient (ML reduction per unit of transformed communication)
# Create AR(2) autocorrelated errors to simulate realistic residual structure
# Even with correct model specification (right variables, right transforms),
# real data has unmodeled factors with temporal persistence:
# - Unmeasured weather patterns (humidity, wind, soil moisture)
# - Social contagion (neighbors influencing each other's conservation behavior)
# - Measurement error with persistence (meter reading schedules)
# - Institutional factors (maintenance schedules, local events)
# This autocorrelation is EXACTLY why we need HAC standard errors!
rho1 = 0.5 # AR(1) coefficient
rho2 = 0.2 # AR(2) coefficient
base_error_sd = 100 # Base standard deviation of innovation
# Add seasonal heteroskedasticity: higher variance in summer (when temperature is high)
# Error SD scales with temperature (normalized to 0-1 range)
temp_normalized = (temperature - temperature.min()) / (
temperature.max() - temperature.min()
)
seasonal_scale = 1 + 0.5 * temp_normalized # SD ranges from 1x to 1.5x base
error_sd_t = base_error_sd * seasonal_scale
# Generate stationary AR(2) process: epsilon_t = rho1 * epsilon_{t-1} + rho2 * epsilon_{t-2} + nu_t
errors = np.zeros(n_weeks)
# Stationary initialization (first two values from unconditional distribution)
var_epsilon = base_error_sd**2 * (1 - rho2) / ((1 + rho2) * ((1 - rho2) ** 2 - rho1**2))
errors[0] = np.random.normal(0, np.sqrt(var_epsilon))
errors[1] = np.random.normal(0, np.sqrt(var_epsilon))
for t_idx in range(2, n_weeks):
errors[t_idx] = (
rho1 * errors[t_idx - 1]
+ rho2 * errors[t_idx - 2]
+ np.random.normal(0, error_sd_t[t_idx])
)
water_consumption = baseline + theta_true * comm_transformed + errors
# Create DataFrame
df = pd.DataFrame(
{
"date": dates,
"t": t,
"water_consumption": water_consumption,
"temperature": temperature,
"rainfall": rainfall,
"comm_intensity": comm_intensity,
}
)
df = df.set_index("date")
Show code cell source
print(df.head(10))
print(f"\nData shape: {df.shape}")
print(
f"Water consumption range: [{df['water_consumption'].min():.0f}, {df['water_consumption'].max():.0f}] ML/week"
)
print(
f"Temperature range: [{df['temperature'].min():.1f}, {df['temperature'].max():.1f}] °C"
)
print(
f"Rainfall range: [{df['rainfall'].min():.1f}, {df['rainfall'].max():.1f}] mm/week"
)
print(f" Number of zero-rainfall weeks: {(df['rainfall'] == 0).sum()}")
print(f" Number of weeks with rainfall < 2mm: {(df['rainfall'] < 2).sum()}")
print(
f"Communication intensity range: [{df['comm_intensity'].min():.1f}, {df['comm_intensity'].max():.1f}]"
)
print(
f" Number of weeks with active communications (>0): {(df['comm_intensity'] > 0).sum()}"
)
t water_consumption temperature rainfall comm_intensity
date
2022-01-03 0 5915.446103 25.993428 13.597324 0.0
2022-01-10 1 6016.524527 25.928838 8.457205 0.0
2022-01-17 2 6501.403293 28.688534 2.511564 0.0
2022-01-24 3 6547.135643 31.592109 7.132822 0.0
2022-01-31 4 6540.520637 29.178925 1.358170 0.0
2022-02-07 5 6420.134525 30.212374 5.816736 0.0
2022-02-14 6 6825.934464 34.789652 6.170805 0.0
2022-02-21 7 6825.192056 34.019977 0.000000 0.0
2022-02-28 8 6612.079720 32.290890 4.306257 0.0
2022-03-07 9 6714.632361 34.939680 2.154695 7.0
Data shape: (156, 5)
Water consumption range: [4657, 7603] ML/week
Temperature range: [11.4, 38.8] °C
Rainfall range: [0.0, 21.2] mm/week
Number of zero-rainfall weeks: 17
Number of weeks with rainfall < 2mm: 34
Communication intensity range: [0.0, 9.0]
Number of weeks with active communications (>0): 42
Let’s look at the water consumption and communication intensity time series. Notice:
Very dry climate with extended zero-rainfall periods in summer
Communications are zero most of the time - only activated during sustained drought
Policy responds to cumulative rainfall deficit over the past 6 weeks, not just current conditions
When 6-week cumulative rainfall drops below 20mm (drought threshold), communications ramp up
Show code cell source
fig, axes = plt.subplots(4, 1, figsize=(12, 12), sharex=True)
# Water consumption
axes[0].plot(df.index, df["water_consumption"], "o-", alpha=0.6, markersize=3)
axes[0].set_ylabel("Water Consumption (ML/week)")
axes[0].set_title("Weekly Water Consumption - Regional Catchment")
axes[0].grid(True, alpha=0.3)
# Temperature
axes[1].plot(df.index, df["temperature"], "o-", alpha=0.6, markersize=3, color="C3")
axes[1].set_ylabel("Temperature (°C)")
axes[1].set_title("Weekly Average Temperature")
axes[1].grid(True, alpha=0.3)
# Rainfall
axes[2].plot(df.index, df["rainfall"], "o-", alpha=0.6, markersize=3, color="C2")
axes[2].set_ylabel("Rainfall (mm/week)")
axes[2].set_title("Weekly Rainfall")
axes[2].grid(True, alpha=0.3)
# Communication intensity
axes[3].plot(df.index, df["comm_intensity"], "o-", alpha=0.6, markersize=3, color="C1")
axes[3].set_ylabel("Communication Intensity (0-10)")
axes[3].set_title("Public Communications Intensity (Treatment Variable)")
axes[3].set_xlabel("Date")
axes[3].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Show correlation between variables
print("\nKey patterns in the data:")
print(
"- Very dry climate: 0-16mm/week rainfall, with extended zero-rainfall periods in summer"
)
print("- Communication intensity is zero most of the time (no routine messaging)")
print(
"- Communications only activate during sustained drought (6-week rainfall < 20mm)"
)
print("- Policy responds to cumulative rainfall deficit, not just current week")
print(
"- When drought conditions persist, communication intensity ramps up to 6-10 range"
)
Key patterns in the data:
- Very dry climate: 0-16mm/week rainfall, with extended zero-rainfall periods in summer
- Communication intensity is zero most of the time (no routine messaging)
- Communications only activate during sustained drought (6-week rainfall < 20mm)
- Policy responds to cumulative rainfall deficit, not just current week
- When drought conditions persist, communication intensity ramps up to 6-10 range
2. Model Fitting#
Fitting a transfer function model involves finding both the optimal transform parameters and the regression coefficients. This is accomplished through a nested optimization procedure. In the outer loop, the algorithm searches for the best saturation and adstock parameters—either by exhaustively evaluating all combinations on a discrete grid, or by using continuous optimization to search more efficiently through the parameter space. For each candidate set of transform parameters, the inner loop applies these transformations to the raw treatment variable and fits a regression model (OLS or ARIMAX) to the data. The root mean squared error (RMSE) of each fitted model is computed, and the parameter combination that minimizes this error is selected.
This nested approach is computationally tractable because ordinary least squares has a closed-form solution based on matrix operations, making each individual model fit very fast. When using grid search with, say, 10 values for each of 3 parameters, the algorithm evaluates 1,000 model fits, which typically completes in under a second. Continuous optimization via gradient-based methods can be even faster, though it may settle on local optima rather than finding the global best. For models with ARIMAX error structures, each fit requires numerical optimization and takes longer, but the overall approach remains the same.
Let’s fit a model using grid search to estimate the transform parameters. We’ll use a coarse grid for speed in this demonstration, though you can make it finer for production use to achieve more precise parameter estimates.
model_estimated = cp.skl_models.TransferFunctionOLS(
saturation_type="logistic",
saturation_grid={
"lam": np.linspace(0.2, 0.8, 10), # Search around true value (0.5)
},
adstock_grid={
"half_life": np.linspace(0.5, 3.0, 10), # Search around true value (1.5)
"l_max": [8], # Fixed
"normalize": [True], # Fixed
},
estimation_method="grid", # or "optimize"
error_model="hac",
)
result_estimated = cp.GradedInterventionTimeSeries(
data=df,
y_column="water_consumption",
treatment_names=["comm_intensity"],
base_formula="1 + t + temperature + rainfall",
model=model_estimated,
)
print("Parameter estimation complete!")
print(f"Best RMSE: {result_estimated.transform_estimation_results['best_score']:.2f}")
print(f"Model R-squared: {result_estimated.score:.4f}")
print("\nEstimated parameters:")
print(result_estimated.transform_estimation_results["best_params"])
Parameter estimation complete!
Best RMSE: 131.20
Model R-squared: 0.9564
Estimated parameters:
{'lam': np.float64(0.2), 'half_life': np.float64(3.0), 'l_max': 8, 'normalize': True}
3. Visualize Estimated vs True Transform Parameters#
Since we know the true parameters used to generate the data, we can compare the estimated transforms to the true transforms. This helps us assess parameter recovery - how well the estimation procedure identifies the true data-generating process.
We’ll visualize:
Saturation curves: How raw communication intensity gets transformed by saturation
Adstock weights: How effects carry over across weeks
# Create true transform objects (parameters used for data generation)
true_saturation = cp.LogisticSaturation(lam=0.5)
true_adstock = cp.GeometricAdstock(half_life=1.5, l_max=8, normalize=True)
# Plot estimated transforms with comparison to true transforms
fig, ax = result_estimated.plot_transforms(
true_saturation=true_saturation, true_adstock=true_adstock, x_range=(0, 10)
)
plt.show()
# Parameter Recovery Assessment
true_params = true_saturation.get_params()
est_params = result_estimated.treatments[0].saturation.get_params()
true_adstock_params = true_adstock.get_params()
est_adstock_params = result_estimated.treatments[0].adstock.get_params()
print("\nParameter Recovery Assessment:")
print(f"Saturation - lam error: {abs(est_params['lam'] - true_params['lam']):.2f}")
print(
f"Adstock - half_life error: {abs(est_adstock_params['half_life'] - true_adstock_params['half_life']):.2f} weeks"
)
Parameter Recovery Assessment:
Saturation - lam error: 0.30
Adstock - half_life error: 1.50 weeks
Interpretation:
Saturation curve (left): Shows how raw communication intensity (0-10) gets transformed by diminishing returns. The curve flattens at higher intensities, meaning the 10th message has much less impact than the 1st.
Adstock weights (right): Shows how a communication “impulse” at week 0 affects water consumption over the following weeks. With half-life ~4 weeks, about 50% of the effect persists after 4 weeks. The bars show the relative contribution of each lag.
Parameter recovery: In this simulated example with known ground truth, we can assess how well the estimation recovered the true parameters. Good recovery suggests the model specification is appropriate and the data is informative.
In real applications (without ground truth), you would:
Use domain knowledge to set reasonable parameter grids/bounds
Compare estimated curves to expert expectations
Check sensitivity of results to different parameter ranges
Consider alternative functional forms (logistic vs Hill saturation, etc.)
4. Model Methods and Diagnostics#
Now that we have a fitted model with estimated transforms, let’s explore the available methods for analysis and diagnostics.
Model Summary#
View the fitted model coefficients and their HAC standard errors (robust to autocorrelation and heteroskedasticity):
# Display model summary
result_estimated.summary(round_to=2)
================================================================================
Graded Intervention Time Series Results
================================================================================
Outcome variable: water_consumption
Number of observations: 156
R-squared: 0.96
Error model: HAC
HAC max lags: 4 (robust SEs accounting for 4 periods of autocorrelation)
--------------------------------------------------------------------------------
Baseline coefficients:
Intercept : 3792 (SE: 89)
t : 4.9 (SE: 0.34)
temperature : 92 (SE: 2.9)
rainfall : -23 (SE: 3)
--------------------------------------------------------------------------------
Treatment coefficients:
comm_intensity : -1132 (SE: 91)
================================================================================
Model Fit Visualization#
Plot observed vs fitted values and residuals:
# Plot model fit
fig, ax = result_estimated.plot()
plt.show()
Residual Diagnostics#
Check for autocorrelation in residuals using ACF/PACF plots and Ljung-Box test.
Important: Finding autocorrelation here is not a model failure — it’s expected! Even well-specified time series models have autocorrelated residuals due to unmodeled temporal factors. This is precisely why we use HAC standard errors (see next section).
# Display diagnostic plots and tests
result_estimated.plot_diagnostics(lags=20)
============================================================
Ljung-Box Test for Residual Autocorrelation
============================================================
H0: Residuals are independently distributed (no autocorrelation)
If p-value < 0.05, reject H0 (autocorrelation present)
------------------------------------------------------------
Lag 1: LB statistic = 36.981, p-value = 0.0000 ***
Lag 5: LB statistic = 54.998, p-value = 0.0000 ***
Lag 10: LB statistic = 64.853, p-value = 0.0000 ***
Lag 20: LB statistic = 70.024, p-value = 0.0000 ***
------------------------------------------------------------
⚠ Warning: Significant residual autocorrelation detected.
- HAC standard errors (if used) account for this in coefficient inference.
- Consider adding more baseline controls or adjusting transform parameters.
============================================================
Why HAC Standard Errors Matter#
The diagnostics above show significant residual autocorrelation (Ljung-Box test p-values < 0.05). This is expected and realistic even with a well-specified model!
Our model includes the right variables (temperature, rainfall, time trend) and the right transforms (saturation, adstock), yet residuals are still autocorrelated. Why? Because real data always has unmodeled factors with temporal persistence:
🌡️ Unmeasured weather: We control for temperature and rainfall, but not humidity, wind, or soil moisture
👥 Social contagion: Neighbors influence each other’s conservation behavior beyond our policy
📊 Measurement dynamics: Water meter reading schedules create systematic patterns
🏛️ Institutional effects: Maintenance schedules, local events, seasonal employment
This residual autocorrelation means naive OLS standard errors are wrong (typically too small). Let’s demonstrate the problem and solution:
Show code cell source
# Compare naive OLS standard errors vs HAC standard errors
import statsmodels.api as sm
# Refit the model with NAIVE (non-robust) standard errors
ols_naive = sm.OLS(result_estimated.y, result_estimated.X_full).fit()
# Extract treatment coefficient and standard errors
# Treatment comes after baseline in the full parameter vector
n_baseline = len(result_estimated.baseline_labels)
treatment_idx = n_baseline # First treatment parameter index
coef = result_estimated.theta_treatment[0]
# Get standard errors
se_naive = ols_naive.bse[treatment_idx]
se_hac = result_estimated.ols_result.bse[treatment_idx]
# Compute confidence intervals
ci_naive_lower = coef - 1.96 * se_naive
ci_naive_upper = coef + 1.96 * se_naive
ci_hac_lower = coef - 1.96 * se_hac
ci_hac_upper = coef + 1.96 * se_hac
print("=" * 70)
print("COMPARISON: Naive OLS vs HAC Standard Errors")
print("=" * 70)
print(f"Treatment coefficient: {coef:.2f}")
print()
print(f"Naive OLS Standard Error: {se_naive:.2f}")
print(f" → 95% CI: [{ci_naive_lower:.2f}, {ci_naive_upper:.2f}]")
print(f" → CI Width: {ci_naive_upper - ci_naive_lower:.2f}")
print()
print(f"HAC Standard Error: {se_hac:.2f}")
print(f" → 95% CI: [{ci_hac_lower:.2f}, {ci_hac_upper:.2f}]")
print(f" → CI Width: {ci_hac_upper - ci_hac_lower:.2f}")
print()
print(f"SE Inflation Factor: {se_hac / se_naive:.2f}x")
print(
f"CI Width Increase: {(ci_hac_upper - ci_hac_lower) / (ci_naive_upper - ci_naive_lower):.2f}x"
)
print("=" * 70)
print()
print("📊 INTERPRETATION:")
print(
f"• Naive SE is TOO SMALL by {(se_hac / se_naive - 1) * 100:.0f}% due to ignoring autocorrelation"
)
print(
f"• HAC SE is {(se_hac / se_naive - 1) * 100:.0f}% larger, providing honest uncertainty"
)
print("• This means naive OLS gives OVERCONFIDENT inference (too-narrow CIs)")
print("• HAC corrects this, giving reliable inference despite residual correlation")
print()
print("✅ This demonstrates why TF-ITS uses HAC by default for time series data!")
======================================================================
COMPARISON: Naive OLS vs HAC Standard Errors
======================================================================
Treatment coefficient: -1132.08
Naive OLS Standard Error: 61.04
→ 95% CI: [-1251.72, -1012.44]
→ CI Width: 239.28
HAC Standard Error: 91.45
→ 95% CI: [-1311.32, -952.84]
→ CI Width: 358.49
SE Inflation Factor: 1.50x
CI Width Increase: 1.50x
======================================================================
📊 INTERPRETATION:
• Naive SE is TOO SMALL by 50% due to ignoring autocorrelation
• HAC SE is 50% larger, providing honest uncertainty
• This means naive OLS gives OVERCONFIDENT inference (too-narrow CIs)
• HAC corrects this, giving reliable inference despite residual correlation
✅ This demonstrates why TF-ITS uses HAC by default for time series data!
Show code cell source
# Visualize the difference in confidence intervals
fig, ax = plt.subplots(figsize=(10, 4))
# Plot point estimate and confidence intervals
y_pos = [0, 1]
labels = ["Naive OLS\n(WRONG)", "HAC\n(CORRECT)"]
cis = [
(ci_naive_lower, ci_naive_upper),
(ci_hac_lower, ci_hac_upper),
]
colors = ["#e74c3c", "#2ecc71"] # Red for naive, green for HAC
for i, (label, ci, color) in enumerate(zip(labels, cis, colors)):
# Plot confidence interval as a horizontal line
ax.plot(
[ci[0], ci[1]],
[y_pos[i], y_pos[i]],
color=color,
linewidth=8,
alpha=0.6,
label=f"{label}",
)
# Plot point estimate as a dot
ax.plot(
coef,
y_pos[i],
"o",
color=color,
markersize=12,
markeredgecolor="black",
markeredgewidth=1.5,
zorder=10,
)
# Add CI text
ax.text(
ci[1] + 20,
y_pos[i],
f" [{ci[0]:.0f}, {ci[1]:.0f}]",
va="center",
fontsize=10,
color=color,
fontweight="bold",
)
# Add true value line (we know it from simulation)
ax.axvline(
theta_true,
color="black",
linestyle="--",
linewidth=2,
alpha=0.8,
label=f"True value: {theta_true:.0f}",
zorder=5,
)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=11)
ax.set_xlabel("Treatment Effect Coefficient", fontsize=12, fontweight="bold")
ax.set_title(
"Confidence Interval Comparison: Naive OLS vs HAC\n"
"Naive OLS is overconfident (too narrow) due to ignoring autocorrelation",
fontsize=12,
fontweight="bold",
)
ax.legend(loc="upper right", fontsize=10)
ax.grid(True, alpha=0.3, axis="x")
ax.set_xlim(coef - 200, coef + 200)
plt.tight_layout()
plt.show()
print("\n🎯 KEY TAKEAWAY:")
print("The naive OLS confidence interval is dangerously narrow. If we relied on it,")
print("we'd be overconfident about our treatment effect estimate. HAC provides the")
print("correct, wider interval that honestly reflects uncertainty in the presence of")
print("autocorrelated residuals. This is essential for valid inference in time series!")
🎯 KEY TAKEAWAY:
The naive OLS confidence interval is dangerously narrow. If we relied on it,
we'd be overconfident about our treatment effect estimate. HAC provides the
correct, wider interval that honestly reflects uncertainty in the presence of
autocorrelated residuals. This is essential for valid inference in time series!
Impulse Response Function#
Visualize how communication effects persist over time through the adstock transformation:
# Plot impulse response function
fig = result_estimated.plot_irf("comm_intensity", max_lag=8)
plt.show()
Counterfactual Effect Estimation#
Estimate the effect of the communications policy by comparing observed outcomes to a counterfactual where communications were never implemented:
# Estimate effect of policy over entire 3-year period
# Counterfactual: set all communications to zero
effect_result = result_estimated.effect(
window=(df.index[0], df.index[-1]),
channels=["comm_intensity"],
scale=0.0, # Zero out all communications (no policy counterfactual)
)
print(
f"Total water saved by communications policy: {-effect_result['total_effect']:.0f} ML"
)
print(f"Average weekly savings: {-effect_result['mean_effect']:.0f} ML/week")
print(
f"Percentage reduction: {-100 * effect_result['mean_effect'] / df['water_consumption'].mean():.1f}%"
)
Total water saved by communications policy: 30821 ML
Average weekly savings: 198 ML/week
Percentage reduction: 3.3%
5. Alternative Error Model: ARIMAX#
So far we’ve used HAC (Newey-West) standard errors, which provide robust inference without requiring us to specify the autocorrelation structure. This is the recommended default approach.
However, TF-ITS also supports ARIMAX (ARIMA with eXogenous variables) error models, following the classical Box & Tiao (1975) intervention analysis framework. ARIMAX explicitly models the ARIMA(p,d,q) structure of the residuals.
When to Consider ARIMAX:#
Advantages:
More efficient: Smaller standard errors when the ARIMA structure is correctly specified
Classical approach: Follows Box & Tiao’s original intervention analysis methodology
Explicit error modeling: Can characterize and forecast the residual dynamics
Disadvantages:
Requires specification: Must choose p, d, q orders (typically via ACF/PACF plots)
Misspecification risk: Wrong orders lead to biased or inefficient inference
Less robust: Sensitive to outliers and structural breaks
Recommendation: Use HAC as default (robust, no specification). Consider ARIMAX when:
You have strong evidence for a specific ARIMA structure (e.g., from ACF/PACF)
Sample size is small and efficiency matters
You want to follow classical time series methodology exactly
Fit Model with ARIMAX Errors#
Since we generated the data with AR(2) errors (rho1=0.5, rho2=0.2), the true error structure is ARIMA(2,0,0). For demonstration purposes, we’ll fit an ARIMA(1,0,0) model, which is a slight misspecification. This shows how ARIMAX still performs reasonably well even when the order is not perfectly matched. In practice, you would use ACF/PACF plots to guide ARIMA order selection:
model_arimax = cp.skl_models.TransferFunctionOLS(
saturation_type="logistic",
saturation_grid={
"lam": np.linspace(0.2, 0.8, 10),
},
adstock_grid={
"half_life": np.linspace(0.5, 3.0, 10),
"l_max": [8],
"normalize": [True],
},
estimation_method="grid",
error_model="arimax",
arima_order=(1, 0, 0),
)
result_arimax = cp.GradedInterventionTimeSeries(
data=df,
y_column="water_consumption",
treatment_names=["comm_intensity"],
base_formula="1 + t + temperature + rainfall",
model=model_arimax,
)
print("ARIMAX model fitted successfully!")
print(
f"Best transform parameters: {result_arimax.transform_estimation_results['best_params']}"
)
ARIMAX model fitted successfully!
Best transform parameters: {'lam': np.float64(0.2), 'half_life': np.float64(3.0), 'l_max': 8, 'normalize': True}
# View summary - note the ARIMA order is displayed
result_arimax.summary(round_to=2)
================================================================================
Graded Intervention Time Series Results
================================================================================
Outcome variable: water_consumption
Number of observations: 156
R-squared: 0.97
Error model: ARIMAX
ARIMA order: (1, 0, 0)
p=1: AR order, d=0: differencing, q=0: MA order
--------------------------------------------------------------------------------
Baseline coefficients:
Intercept : 3825 (SE: 109)
t : 4.9 (SE: 0.47)
temperature : 91 (SE: 3.6)
rainfall : -23 (SE: 3.4)
--------------------------------------------------------------------------------
Treatment coefficients:
comm_intensity : -1142 (SE: 91)
================================================================================
Model Fit Visualization#
Let’s visualize the ARIMAX model fit to see how well it captures the data patterns:
fig, ax = result_arimax.plot()
plt.show()
The top panel shows observed water consumption (blue) vs the ARIMAX model’s fitted values (orange). The bottom panel shows the residuals over time. The ARIMAX model explicitly accounts for autocorrelation in the error term (here using ARIMA(1,0,0)), which should result in residuals that are closer to white noise compared to naive OLS. Note that the true data has AR(2) errors, so this ARIMA(1,0,0) is slightly misspecified—but ARIMAX is still robust enough to capture most of the autocorrelation structure.
Residual Diagnostics#
A key advantage of ARIMAX is that by explicitly modeling the autocorrelation structure, the residuals should exhibit less autocorrelation. Let’s check this with ACF/PACF plots and the Ljung-Box test:
result_arimax.plot_diagnostics(lags=20)
============================================================
Ljung-Box Test for Residual Autocorrelation
============================================================
H0: Residuals are independently distributed (no autocorrelation)
If p-value < 0.05, reject H0 (autocorrelation present)
------------------------------------------------------------
Lag 1: LB statistic = 0.233, p-value = 0.6291
Lag 5: LB statistic = 6.228, p-value = 0.2847
Lag 10: LB statistic = 9.280, p-value = 0.5058
Lag 20: LB statistic = 12.331, p-value = 0.9042
------------------------------------------------------------
✓ No significant residual autocorrelation detected.
============================================================
The ACF and PACF plots should show fewer significant lags compared to naive OLS, since the ARIMAX model has absorbed most of the autocorrelation structure into the error model. The Ljung-Box test p-values should be higher (less evidence of remaining autocorrelation), indicating that the ARIMAX specification has successfully captured the temporal dependence in the errors. Note that since the true data has AR(2) structure and we fit ARIMA(1,0,0), there may still be some residual correlation at lag 2—but overall performance should be substantially improved compared to naive OLS.
Impulse Response Function#
The impulse response function visualizes how a one-unit increase in communication intensity affects water consumption dynamically over time, accounting for the adstock effect:
fig = result_arimax.plot_irf("comm_intensity", max_lag=8)
plt.show()
The IRF shows the immediate effect (week 0) and the decaying carryover effects in subsequent weeks. The estimated half-life indicates how quickly the communication effect dissipates. This is the same transform structure as the HAC model, since both models use the same estimated adstock parameters.
Counterfactual Effect Estimation#
We can estimate the total causal effect of the communications policy by comparing observed outcomes to a counterfactual scenario where communications were never implemented:
# Compute counterfactual effect (zero communications for entire period)
effect_arimax = result_arimax.effect(
window=(df.index[0], df.index[-1]),
channels=["comm_intensity"],
scale=0.0,
)
# Visualize the effect
fig, ax = result_arimax.plot_effect(effect_arimax)
plt.show()
# Print summary
print(f"\n{'=' * 60}")
print("COUNTERFACTUAL EFFECT SUMMARY (ARIMAX)")
print(f"{'=' * 60}")
print(f"Total reduction in water consumption: {effect_arimax['total_effect']:.0f} ML")
print(f"Average weekly reduction: {effect_arimax['mean_effect']:.0f} ML")
print(f"Analysis period: {len(df)} weeks (3 years)")
print(f"{'=' * 60}")
============================================================
COUNTERFACTUAL EFFECT SUMMARY (ARIMAX)
============================================================
Total reduction in water consumption: -31209 ML
Average weekly reduction: -200 ML
Analysis period: 156 weeks (3 years)
============================================================
The top panel shows observed water consumption vs what would have occurred without any communications (counterfactual). The difference between these lines represents the causal effect of the policy. The bottom panel shows the cumulative effect over time, which quantifies the total water savings achieved through the communications program. The ARIMAX estimates should be very similar to the HAC estimates, since both models are fitting the same underlying causal structure—they differ only in how they account for error autocorrelation.
# Visualize the counterfactual analysis
fig, ax = result_estimated.plot_effect(effect_result)
plt.show()
print(
f"\nThe communications policy saved approximately {-effect_result['total_effect']:.0f} ML of water "
f"over the 3-year period, representing a {-100 * effect_result['mean_effect'] / df['water_consumption'].mean():.1f}% "
f"reduction in average consumption."
)
The communications policy saved approximately 30821 ML of water over the 3-year period, representing a 3.3% reduction in average consumption.
Comparison: HAC vs ARIMAX#
Let’s compare the two approaches side-by-side to understand their differences:
# Extract treatment coefficient and standard errors from both models
n_baseline = len(result_estimated.baseline_labels)
treatment_idx = n_baseline
# HAC model
coef_hac = result_estimated.theta_treatment[0]
se_hac = result_estimated.ols_result.bse[treatment_idx]
ci_hac_lower = coef_hac - 1.96 * se_hac
ci_hac_upper = coef_hac + 1.96 * se_hac
# ARIMAX model
coef_arimax = result_arimax.theta_treatment[0]
se_arimax = result_arimax.ols_result.bse[treatment_idx]
ci_arimax_lower = coef_arimax - 1.96 * se_arimax
ci_arimax_upper = coef_arimax + 1.96 * se_arimax
# Create comparison table
comparison_data = {
"Method": ["HAC", "ARIMAX"],
"Coefficient": [f"{coef_hac:.2f}", f"{coef_arimax:.2f}"],
"Std Error": [f"{se_hac:.2f}", f"{se_arimax:.2f}"],
"95% CI": [
f"[{ci_hac_lower:.2f}, {ci_hac_upper:.2f}]",
f"[{ci_arimax_lower:.2f}, {ci_arimax_upper:.2f}]",
],
"CI Width": [
f"{ci_hac_upper - ci_hac_lower:.2f}",
f"{ci_arimax_upper - ci_arimax_lower:.2f}",
],
}
comparison_df = pd.DataFrame(comparison_data)
print("=" * 80)
print("COMPARISON: HAC vs ARIMAX")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("=" * 80)
print()
print("KEY OBSERVATIONS:")
print(f"• Coefficients are similar: HAC={coef_hac:.2f}, ARIMAX={coef_arimax:.2f}")
print(f"• SE ratio (ARIMAX/HAC): {se_arimax / se_hac:.3f}")
if se_arimax < se_hac:
print(
f"• ARIMAX has {(1 - se_arimax / se_hac) * 100:.1f}% smaller SE (more efficient when correctly specified)"
)
else:
print(
f"• HAC has {(1 - se_hac / se_arimax) * 100:.1f}% smaller SE (may indicate ARIMAX misspecification)"
)
print("• Both models give similar inference about the treatment effect")
================================================================================
COMPARISON: HAC vs ARIMAX
================================================================================
Method Coefficient Std Error 95% CI CI Width
HAC -1132.08 91.45 [-1311.32, -952.84] 358.49
ARIMAX -1142.19 91.15 [-1320.83, -963.54] 357.29
================================================================================
KEY OBSERVATIONS:
• Coefficients are similar: HAC=-1132.08, ARIMAX=-1142.19
• SE ratio (ARIMAX/HAC): 0.997
• ARIMAX has 0.3% smaller SE (more efficient when correctly specified)
• Both models give similar inference about the treatment effect
# Visualize confidence interval comparison
fig, ax = plt.subplots(figsize=(10, 4))
y_pos = [0, 1]
labels = ["HAC\n(Robust)", "ARIMAX\n(Efficient)"]
cis = [(ci_hac_lower, ci_hac_upper), (ci_arimax_lower, ci_arimax_upper)]
colors = ["#3498db", "#e74c3c"] # Blue for HAC, Red for ARIMAX
for i, (label, ci, color) in enumerate(zip(labels, cis, colors)):
# Plot confidence interval
ax.plot([ci[0], ci[1]], [y_pos[i], y_pos[i]], color=color, linewidth=8, alpha=0.6)
# Plot point estimate
coef = coef_hac if i == 0 else coef_arimax
ax.plot(
coef,
y_pos[i],
"o",
color=color,
markersize=12,
markeredgecolor="black",
markeredgewidth=1.5,
zorder=10,
)
# Add CI text
ax.text(
ci[1] + 15,
y_pos[i],
f" [{ci[0]:.0f}, {ci[1]:.0f}]",
va="center",
fontsize=10,
color=color,
fontweight="bold",
)
# Add true value line
ax.axvline(
theta_true,
color="black",
linestyle="--",
linewidth=2,
alpha=0.8,
zorder=5,
label=f"True value: {theta_true:.0f}",
)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=11)
ax.set_xlabel("Treatment Effect Coefficient", fontsize=12, fontweight="bold")
ax.set_title(
"Confidence Interval Comparison: HAC vs ARIMAX\n"
"ARIMAX typically has narrower intervals when correctly specified",
fontsize=12,
fontweight="bold",
)
ax.legend(loc="upper right", fontsize=10)
ax.grid(True, alpha=0.3, axis="x")
plt.tight_layout()
plt.show()
print("\n📊 INTERPRETATION:")
print("Both methods capture the true treatment effect, but ARIMAX provides narrower")
print("confidence intervals (more precise estimates) because it explicitly models the")
print(
"autocorrelation structure. HAC is more conservative but doesn't require specification."
)
📊 INTERPRETATION:
Both methods capture the true treatment effect, but ARIMAX provides narrower
confidence intervals (more precise estimates) because it explicitly models the
autocorrelation structure. HAC is more conservative but doesn't require specification.
Decision Guide: Which Error Model to Use?#
Here’s a practical guide for choosing between HAC and ARIMAX:
Criterion |
HAC (Default) |
ARIMAX |
|---|---|---|
Ease of use |
✅ No specification needed |
⚠️ Must choose p, d, q orders |
Robustness |
✅ Works with any autocorrelation |
⚠️ Sensitive to misspecification |
Efficiency |
⚠️ Wider confidence intervals |
✅ Narrower CIs when correct |
Sample size |
✅ Works well with any size |
⚠️ Needs moderate-to-large N |
Computational cost |
✅ Very fast (closed-form OLS) |
⚠️ Slower (iterative ML) |
Outlier sensitivity |
✅ Relatively robust |
⚠️ More sensitive |
Diagnostics required |
✅ None (automatic) |
⚠️ ACF/PACF analysis needed |
Classical methodology |
⚠️ Modern approach (1987) |
✅ Box & Tiao (1975) |
Recommended Decision Tree:
Start with HAC (default): Robust, requires no specification, works in all cases
Consider ARIMAX if:
ACF/PACF plots show clear AR or MA structure
Sample size is small (< 50 obs) and you need efficiency
You want to follow classical time series methodology exactly
You plan to forecast future errors
Stick with HAC if:
You’re uncertain about the error structure
ACF/PACF patterns are unclear or complex
You have outliers or structural breaks
You want the simplest, most robust approach
Bottom Line: HAC is the recommended default for most applications. Use ARIMAX only when you have strong evidence for a specific ARIMA structure and are comfortable with the added complexity and assumptions.