Pre-Diagnostics Guide

Overview

The pre-diagnostics module provides automated validation of MMM inputs before model fitting. These tests help identify potential data quality issues that could affect model reliability.

Pre-diagnostics run automatically by default as part of the standard AMMM pipeline, during the DATA EXPLORATION phase.

What is Tested

1. Stationarity Tests (ADF + KPSS)

Purpose: Assess whether the dependent variable (target) exhibits stationarity or a unit root.

Why it matters: Non-stationary time series can lead to spurious correlations and unreliable inference in regression models.

Tests performed:

  • Augmented Dickey-Fuller (ADF): Tests the null hypothesis of a unit root

  • Kwiatkowski-Phillips-Schmidt-Shin (KPSS): Tests the null hypothesis of stationarity

Interpretation:

ADF Result

KPSS Result

Conclusion

Reject H₀ (p < 0.05)

Fail to reject H₀ (p ≥ 0.05)

Likely stationary

Fail to reject H₀ (p ≥ 0.05)

Reject H₀ (p < 0.05)

Likely unit root

Other combinations

Other combinations

Inconclusive

Remediation if unit root detected:

  • First differencing: Δy_t = y_t - y_{t-1}

  • Detrending: Remove linear or polynomial trends

  • Log transformation: For multiplicative trends

Note: By design, only the target variable is tested for stationarity. There is no requirement for regressors to be stationary in typical MMM applications.

2. Variance Inflation Factor (VIF)

Purpose: Detect multicollinearity among regressors (media spend channels + control variables).

Why it matters: High multicollinearity inflates coefficient variance, making it difficult to isolate individual channel effects.

Interpretation:

VIF Value

Severity

Action

VIF < 5

Low multicollinearity

No action needed

5 ≤ VIF < 10

Moderate multicollinearity

Monitor closely

VIF ≥ 10

High multicollinearity

Flagged - consider remediation

Remediation if high VIF detected:

  • Remove or combine highly correlated channels

  • Principal Component Analysis (PCA) on correlated features

  • Ridge regression or other regularisation techniques

  • Domain knowledge to select most important variables

Additional metrics:

  • Tolerance (1/VIF): Lower values indicate higher multicollinearity

  • Correlation matrix (max): Highest pairwise correlation for each variable

3. Transfer Entropy

Purpose: Detect directional information flow between media channels (X) and the target variable (Y).

Why it matters: Transfer entropy provides a non-linear, model-free measure of predictive relationships, complementing traditional correlation analysis.

What is computed:

  • TE(X→Y): Information flow from channel X to target Y

  • TE(Y→X): Information flow from target Y to channel X

  • p-values: Statistical significance via permutation test (200 permutations by default)

Direction classification:

Condition

Direction

Interpretation

TE(X→Y) significant AND TE(X→Y) > TE(Y→X)

x→y

X likely predicts Y

TE(Y→X) significant AND TE(Y→X) > TE(X→Y)

y→x

Y likely predicts X (reverse causality?)

Both significant

bidirectional

Mutual predictive relationship

Neither significant

none

No strong directional relationship

Important Caveats:

⚠️ This implementation uses pairwise (unconditional) transfer entropy

  • Does NOT control for confounding variables

  • Cannot establish true causality

  • May detect spurious relationships due to common drivers

⚠️ Interpretation guidance:

  • Use TE as an exploratory tool, not confirmatory evidence

  • Significant TE(X→Y) suggests X may have predictive value for Y

  • Always combine with domain knowledge and theoretical understanding

  • For rigourous causal analysis, consider conditional TE or structural models

Optional: Include control variables in TE analysis by setting te_include_controls_in_x=True in the orchestrator function.

Output Files

All diagnostics save results to results/csv/.

For the complete specification of each CSV (column names and meanings), see the Reference Output Files page:

1. stationarity_summary.csv

Column

Description

variable

Variable name (target column)

adf_stat

ADF test statistic

adf_pvalue

ADF p-value

adf_usedlag

Number of lags used in ADF test

adf_nobs

Number of observations used

kpss_stat

KPSS test statistic

kpss_pvalue

KPSS p-value

kpss_lags

Number of lags used in KPSS test

adf_stationary

Boolean: ADF rejects unit root (p < 0.05)

kpss_nonstationary

Boolean: KPSS rejects stationarity (p < 0.05)

stationarity_conclusion

Combined interpretation

See reference: stationarity_summary.csv

2. vif_summary.csv

Column

Description

variable

Variable name

vif

Variance Inflation Factor

tolerance

1 / VIF

corr_max

Maximum absolute pairwise correlation

flag_high_vif

Boolean: VIF > 10

See reference: vif_summary.csv

3. transfer_entropy_summary.csv

Column

Description

variable

Predictor variable name

te_x_to_y

Transfer entropy from X to Y

te_y_to_x

Transfer entropy from Y to X

p_x_to_y

p-value for X→Y

p_y_to_x

p-value for Y→X

significant_x_to_y

Boolean: p_x_to_y < 0.05

significant_y_to_x

Boolean: p_y_to_x < 0.05

direction

Directional classification

See reference: transfer_entropy_summary.csv

Quick read example

import pandas as pd

stationarity = pd.read_csv('results/csv/stationarity_summary.csv')
vif = pd.read_csv('results/csv/vif_summary.csv')
te = pd.read_csv('results/csv/transfer_entropy_summary.csv')

print(stationarity.head())
print(vif.sort_values('vif', ascending=False).head())
print(te.head())

Integration

Automatic Execution

Pre-diagnostics run automatically when you execute:

python runme.py

The diagnostics execute during the DATA EXPLORATION phase, after media spend visualisations and before model fitting.

Programmatic Usage

You can also run diagnostics independently:

from src.diagnostics.pre_diagnostics import run_all_pre_diagnostics
import pandas as pd

# Load your data and config
data = pd.read_csv('your_data.csv')
config = {
    'date_col': 'date',
    'target_col': 'sales',
    'media': [
        {'display_name': 'TV', 'spend_col': 'tv_spend'},
        {'display_name': 'Digital', 'spend_col': 'digital_spend'}
    ],
    'extra_features_cols': ['price', 'competitor_activity']
}

# Run all diagnostics
result_paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir='results'
)

# Print saved file paths
for filename, path in result_paths.items():
    print(f"{filename}: {path}")

Individual Tests

You can run tests individually for more control:

from src.diagnostics.pre_diagnostics import (
    run_stationarity_tests,
    run_vif_tests,
    run_transfer_entropy
)

# Stationarity test on target only
stationarity_df = run_stationarity_tests(
    data=data,
    date_col='date',
    cols=['sales']
)

# VIF test on regressors
vif_df = run_vif_tests(
    data=data,
    cols=['tv_spend', 'digital_spend', 'price']
)

# Transfer entropy
te_df = run_transfer_entropy(
    data=data,
    date_col='date',
    x_cols=['tv_spend', 'digital_spend'],
    y_col='sales',
    permutations=200  # Configurable
)

Advanced Configuration

# Include controls in transfer entropy analysis
result_paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir='results',
    te_include_controls_in_x=True,  # Test controls → target
    te_kwargs={'permutations': 500, 'bins': 10}  # Custom TE settings
)

# Custom stationarity test settings
result_paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir='results',
    stationarity_kwargs={
        'adf_regression': 'ct',  # Include trend in ADF
        'kpss_regression': 'ct'  # Include trend in KPSS
    }
)

Error Handling

The pre-diagnostics module is designed to be non-fatal:

  • If a diagnostic fails, it writes error information to the CSV

  • The pipeline continues with model fitting

  • Warnings are logged for non-critical issues (e.g., constant series, insufficient data)

Performance Considerations

  • Transfer Entropy is the most computationally expensive test

  • Default settings: 200 permutations, 8 bins

  • For large datasets or many channels, consider:

    • Reducing permutations (min 50 for exploratory analysis)

    • Running TE separately on a subset of channels

    • Using fewer bins for discretisation

Typical runtime for demo data (~80 weeks, 7 channels):

  • Stationarity: <1 second

  • VIF: <1 second

  • Transfer Entropy: 10-30 seconds

Best Practices

  1. Always review stationarity results: Non-stationary targets can invalidate regression assumptions

  2. Flag high VIF early: Multicollinearity issues are easier to address before model fitting

  3. Use TE as exploratory tool: Complement with domain knowledge and economic theory

  4. Document findings: Keep notes on which diagnostics flagged issues and how you addressed them

  5. Iterate: Run diagnostics again after data transformations or feature engineering

References

  • Stationarity: Dickey & Fuller (1979); Kwiatkowski et al. (1992)

  • VIF: Marquaridt (1970); O’Brien (2007)

  • Transfer Entropy: Schreiber (2000); Bossomaier et al. (2016)

Limitations and Future Extensions

Current limitations:

  • TE is pairwise (unconditional) only

  • No automatic remediation suggestions

  • Fixed significance threshold (α = 0.05)

Planned extensions:

  • Conditional transfer entropy (control for confounders)

  • Multivariate TE

  • Automated data transformation recommendations

  • Time-varying diagnostics (rolling window analysis)


For questions or issues, please consult the main AMMM documentation or raise an issue on GitHub.