pre_diagnostics¶

Pre-diagnostic tests for MMM input validation.

This module provides three key diagnostics: - Stationarity tests (ADF + KPSS) for the dependent variable - VIF (Variance Inflation Factor) for multicollinearity among regressors - Transfer Entropy for detecting directional information flow

All diagnostics are designed to run before model fitting to validate input data.

Module Contents¶

pre_diagnostics.run_stationarity_tests(data: pandas.DataFrame, date_col: str, cols: List[str], *, kpss_regression: str = 'c', kpss_nlags: str | int | None = 'auto', adf_maxlag: int | None = None, adf_regression: str = 'c', dropna: bool = True) → pandas.DataFrame¶

Run ADF and KPSS stationarity tests on specified columns.

Parameters:

data – DataFrame containing the time series data
date_col – Name of the date column for sorting
cols – List of column names to test (typically just the target variable)
kpss_regression – KPSS regression type - “c” (constant) or “ct” (constant+trend)
kpss_nlags – Number of lags for KPSS test - “auto”, integer, or None
adf_maxlag – Maximum lag for ADF test - integer or None (auto)
adf_regression – ADF regression type - “c”, “ct”, “ctt”, or “nc”
dropna – Whether to drop NA values before testing

Returns:

variable: variable name
adf_stat, adf_pvalue, adf_usedlag, adf_nobs: ADF test results
kpss_stat, kpss_pvalue, kpss_lags: KPSS test results
adf_stationary: boolean (adf_pvalue < 0.05)
kpss_nonstationary: boolean (kpss_pvalue < 0.05, KPSS rejects stationarity)
stationarity_conclusion: “likely stationary”, “likely unit root”, or “inconclusive”

Return type:

DataFrame with columns

pre_diagnostics.run_vif_tests(data: pandas.DataFrame, cols: List[str], *, include_constant: bool = True, dropna: str = 'pairwise') → pandas.DataFrame¶

Compute Variance Inflation Factors to assess multicollinearity.

Parameters:

data – DataFrame containing the regressor data
cols – List of column names to test (media + control variables)
include_constant – Whether to add a constant term for VIF calculation
dropna – NA handling strategy - “pairwise” or “all” - “pairwise”: compute each VIF with rows valid for that variable - “all”: drop rows with any NA across all cols once

Returns:

variable: variable name
vif: variance inflation factor
tolerance: 1/vif
corr_max: maximum absolute pairwise correlation
flag_high_vif: boolean (vif > 10)

Return type:

DataFrame with columns

pre_diagnostics.run_transfer_entropy(data: pandas.DataFrame, date_col: str, x_cols: List[str], y_col: str, *, max_lag: int = 1, bins: int = 8, permutations: int = 200, random_state: int = 42, normalize: bool = True, dropna: bool = True) → pandas.DataFrame¶

Compute pairwise transfer entropy between X variables and Y.

This implements unconditional (pairwise) transfer entropy using discrete estimation via quantile binning. For each X, computes both TE(X→Y) and TE(Y→X) with permutation-based significance tests.

Important Caveat: This is pairwise (unconditional) TE. It does NOT control for confounding variables. For causal interpretation, conditional TE would be required.

Parameters:

data – DataFrame containing the time series data
date_col – Name of the date column for sorting
x_cols – List of predictor column names (e.g., media channels, controls)
y_col – Target variable column name
max_lag – Lag to use for TE calculation (default: 1)
bins – Number of quantile bins for discretization (default: 8)
permutations – Number of permutations for significance test (default: 200)
random_state – Random seed for reproducibility
normalize – Whether to normalize TE values (currently unused, for future)
dropna – Whether to drop NA values before computation

Returns:

variable: predictor variable name
te_x_to_y: transfer entropy from X to Y
te_y_to_x: transfer entropy from Y to X
p_x_to_y: p-value for X→Y
p_y_to_x: p-value for Y→X
significant_x_to_y: boolean (p_x_to_y < 0.05)
significant_y_to_x: boolean (p_y_to_x < 0.05)
direction: “x→y”, “y→x”, “bidirectional”, or “none”

Return type:

DataFrame with columns

pre_diagnostics.run_all_pre_diagnostics(data: pandas.DataFrame, config: Dict[str, Any], results_dir: str, *, stationarity_cols: List[str] | None = None, vif_cols: List[str] | None = None, te_x_cols: List[str] | None = None, te_y_col: str | None = None, te_include_controls_in_x: bool = False, stationarity_kwargs: Dict[str, Any] | None = None, vif_kwargs: Dict[str, Any] | None = None, te_kwargs: Dict[str, Any] | None = None) → Dict[str, str]¶

Run all pre-diagnostics and save results to CSV files.

This orchestrator function runs stationarity tests, VIF analysis, and transfer entropy calculations with sensible defaults based on the config.

Default behaviour: - Stationarity: Test only the target variable (Y) - VIF: Test media spend columns + control variables - Transfer Entropy: X = media spend (optionally + controls), Y = target

Parameters:

data – DataFrame containing the processed data
config – Configuration dictionary with keys: - date_col: date column name (default: “date”) - target_col: target variable name (default: “KPI”) - media: list of dicts with ‘spend_col’ keys - extra_features_cols: list of control variable names
results_dir – Directory to save CSV outputs
stationarity_cols – Override columns for stationarity testing (default: [target_col])
vif_cols – Override columns for VIF testing (default: media + controls)
te_x_cols – Override X columns for TE (default: media, optionally + controls)
te_y_col – Override Y column for TE (default: target_col)
te_include_controls_in_x – Whether to include controls in TE X variables
stationarity_kwargs – Additional kwargs for run_stationarity_tests
vif_kwargs – Additional kwargs for run_vif_tests
te_kwargs – Additional kwargs for run_transfer_entropy

Returns:

Dictionary mapping CSV filenames to their full paths