Input Validator (diagnostics.input_validator)

Utilities for validating input data before modelling. These checks help ensure data quality and prevent common issues prior to model training.

Key checks

  • check_nans(dataframe, target_col, media_cols, control_cols)

    • Verifies that specified columns contain no NaN values.

    • Raises ValueError with a list of offending columns.

  • check_duplicate_columns(dataframe)

    • Ensures column names are unique.

    • Raises ValueError if duplicates are found.

  • check_date_column(date_series, config)

    • Validates chronology, frequency, missing dates, and weekly start day.

    • Attempts to parse using config.get('date_format') if provided.

    • Raises ValueError if unsorted, irregular, or gaps are detected.

  • check_column_variance(dataframe, columns, check_zeros_only=False)

    • Detects columns with zero variance (or all zeros when check_zeros_only=True).

    • Raises ValueError listing columns with issues.

Usage example

import pandas as pd
from src.diagnostics.input_validator import (
    check_nans,
    check_duplicate_columns,
    check_date_column,
    check_column_variance,
)

# Example inputs
config = {"date_format": None}
media_cols = ["tv_spend", "search_spend"]
control_cols = ["price", "competitor_index"]

# 1) Duplicate columns
check_duplicate_columns(df)

# 2) Date column integrity
check_date_column(df["date"], config)

# 3) NaNs across core columns
check_nans(df, target_col="revenue", media_cols=media_cols, control_cols=control_cols)

# 4) Zero-variance checks
check_column_variance(df, columns=media_cols + control_cols, check_zeros_only=False)

Notes

  • These validators print structured status messages; errors raise ValueError and are intended to fail fast.

  • For MMM-specific diagnostics (stationarity, VIF, transfer entropy), see the Pre‑Diagnostics Guide.