Time Series ML Simulation

Two end-to-end time-series machine learning modeling scenarios built on synthetic data:

RMA Freight Forecasting — GRU recurrent network and Facebook Prophet for shipping weight prediction
Network Event Classification — K-means clustering with Latent Semantic Analysis (LSA) for unsupervised log categorization

Both pipelines run from data generation through model training, evaluation, and a FastAPI serving layer.

Blog Post

For a detailed walkthrough of the design decisions, code architecture, and results from this project, read the companion blog post:

Someone Else Owns my Best Code, So I Wrote It All Again: covers the full lifecycle from synthetic data generation through GRU and Prophet model training to the FastAPI serving layer, with code snippets and references throughout.

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the complete pipeline
python scripts/01_generate_data.py
python scripts/02_explore_data.py
python scripts/03_train_rma_model.py
python scripts/03b_train_rma_prophet.py
python scripts/04_train_clustering_model.py
python scripts/05_evaluate_models.py
python scripts/06_export_for_serving.py

# Start API server (optional)
uvicorn mlops.serving.app:app --host 0.0.0.0 --port 8000

Dataset Variants

The data generation script supports presets for creating different dataset configurations:

# List available presets
python scripts/01_generate_data.py --list-presets

# Generate a named variant from a preset
python scripts/01_generate_data.py -n small -p small --save-config

# Override specific parameters via CLI
python scripts/01_generate_data.py -n custom --rma-rows 10000 --event-rows 5000

# Train on a named variant
python scripts/03_train_rma_model.py -d small
python scripts/05_evaluate_models.py -d small

Preset	RMA Rows	Event Rows	Description
`baseline`	50,000	30,000	Default configuration
`small`	5,000	3,000	Quick iteration and testing
`high_anomaly`	50,000	30,000	Elevated anomaly rates
`imbalanced`	50,000	30,000	Skewed class distributions
`noisy`	50,000	30,000	Added noise to features

Synthetic Data

Both datasets are generated from scratch by 01_generate_data.py with controlled random seeds for reproducibility. See data/README.md for full column schemas and statistics.

RMA Shipping Data (`data/raw/rma_shipping_data.csv`)

~50,000 records over 2 years simulating RMA shipping activity across 5 regions and 15 SKU categories. Embedded patterns include weekly seasonality (lower weekend volume), month/quarter-end spikes, regional growth trends (APAC +8%/year), and urgency-correlated shipping methods.

Network Events Data (`data/raw/network_events.csv`)

~30,000 events over 30 days with 6 ground-truth behavioral clusters (normal web, normal DB, suspicious scan, auth failure, data exfiltration, maintenance). Each cluster has distinct port ranges, duration profiles, byte volumes, and log message vocabulary for TF-IDF/LSA feature extraction.

Driver Scripts

Script	Purpose	Outputs
`01_generate_data.py`	Generate both synthetic datasets	`data/raw/*.csv`
`02_explore_data.py`	Exploratory data analysis and visualization	`outputs/figures/eda_*.png`
`03_train_rma_model.py`	Train GRU models (v1 → v3 progressive)	`outputs/models/rma_gru_v*/`
`03b_train_rma_prophet.py`	Train Prophet models (v1 basic, v2 with regressors)	`outputs/models/rma_prophet_v*/`
`04_train_clustering_model.py`	Train K-means + LSA clustering pipeline	`outputs/models/network_clustering_v1/`
`05_evaluate_models.py`	Cross-model evaluation and comparison	`outputs/figures/eval_*.png`
`06_export_for_serving.py`	Package models for FastAPI inference	`outputs/models/*/metadata.json`

Project Structure

time/
├── .github/
│   ├── workflows/tests.yml            # CI pipeline
│   └── ISSUE_TEMPLATE/                # Bug report & feature request templates
├── config/
│   ├── settings.yaml                  # Centralized configuration
│   └── presets/                       # Data generation presets (baseline, small, etc.)
├── data/
│   ├── raw/                           # Generated synthetic datasets
│   ├── processed/                     # Preprocessed data
│   └── README.md                      # Data dictionary
├── src/
│   ├── data_generation/               # Synthetic data generators
│   ├── preprocessing/                 # Feature engineering pipelines
│   ├── models/                        # Model architectures (GRU, Prophet, K-means, LSA)
│   ├── training/                      # Training pipelines
│   ├── evaluation/                    # Metrics and evaluation
│   ├── visualization/                 # Plotting utilities
│   └── utils/                         # Helper functions
├── mlops/
│   ├── model_registry.py              # Model serialization
│   ├── inference.py                   # Inference engines
│   └── serving/                       # FastAPI application
├── scripts/
│   ├── 01_generate_data.py            # Data generation
│   ├── 02_explore_data.py             # EDA and visualization
│   ├── 03_train_rma_model.py          # GRU model training
│   ├── 03b_train_rma_prophet.py       # Prophet model training
│   ├── 04_train_clustering_model.py   # K-means + LSA training
│   ├── 05_evaluate_models.py          # Model evaluation
│   └── 06_export_for_serving.py       # MLOps preparation
├── tests/
├── outputs/
│   ├── figures/                       # Generated visualizations
│   ├── models/                        # Saved model artifacts
│   └── logs/                          # Training logs
├── LICENSE                            # MIT License
├── CONTRIBUTING.md                    # Contribution guidelines
├── CODE_OF_CONDUCT.md                 # Contributor Covenant
├── CITATION.cff                       # Citation metadata
├── pyproject.toml                     # Project metadata and tool config
├── requirements.txt                   # Production dependencies
└── requirements-dev.txt               # Development dependencies

Scenario 1: RMA Freight Forecasting

Business Problem

Unpredictable freight costs for spare parts shipping lead to over-provisioning or costly emergency air shipments.

Current State: Naive persistence forecasting (last week's average) yields ~114 kg/day MAE.

Goal: Reduce forecast error by 35%+ to enable proactive freight allocation.

GRU Approach

A stacked GRU recurrent network trained with progressive complexity:

Input (30 days × features)
    │
    ▼
Embedding Layers (region, SKU, urgency, method)
    │
    ▼
Stacked GRU (2 layers, 64 units each, dropout=0.3)
    │
    ▼
Layer Normalization
    │
    ▼
Residual Connection (+ avg embeddings)
    │
    ▼
MLP Output (7-day forecast)

Prophet Approach

Facebook Prophet with multiplicative seasonality, trained in two versions: a baseline with automatic seasonality detection, and an enhanced version with exogenous regressors (month-end flags, failure rates, average urgency).

Progressive Model Improvement

Version	Changes	MAE (kg)	Improvement
Baseline	Naive persistence	114	—
GRU V1	Simple GRU, numerical only	63	44%
GRU V2	+ Categorical embeddings	63	44%
GRU V3	+ Layer norm, residual, dropout	62	46%
Prophet V1	Seasonality decomposition	—	—
Prophet V2	+ Exogenous regressors	—	—

Key Features

Sequence length: 30 days of history
Prediction horizon: 7 days ahead
Categorical embeddings: Learn semantic relationships
Exogenous features: Repair cycle time, failure rates
Early stopping: Prevents overfitting

Scenario 2: Network Event Classification

Business Problem

Fragmented logs and inconsistent monitoring made real-time classification and root-cause analysis nearly impossible.

Solution

An unsupervised learning pipeline combining K-means clustering with Latent Semantic Analysis (LSA) to surface meaningful patterns and classify anomalous activity.

Pipeline Architecture

Network Events
    │
    ├─► Numerical Features ─► StandardScaler
    │
    └─► Log Messages ─► TF-IDF ─► LSA (20 components)
                                    │
                                    ▼
                            Feature Concatenation
                                    │
                                    ▼
                              K-Means Clustering
                                    │
                                    ▼
                          Cluster Interpretation

Discovered Clusters

Cluster	Characteristics	% of Events
Normal Web	Short duration, moderate bytes	~35%
Normal DB	Very short, internal traffic	~20%
Suspicious Scan	Tiny packets, many ports	~10%
Auth Failure	Error severity, SSH ports	~12%
Data Exfil	Long duration, high bytes	~8%
Maintenance	Off-hours, internal	~15%

Key Features

Automatic K selection: Elbow method + silhouette score
Text analysis: TF-IDF with n-grams + LSA
Anomaly scoring: Distance to cluster centroid
Interpretable results: Human-readable cluster labels

API Endpoints

Health Check

curl http://localhost:8000/api/v1/health

RMA Forecast

curl -X POST http://localhost:8000/api/v1/rma/forecast \
  -H 'Content-Type: application/json' \
  -d @outputs/models/example_rma_request.json

Network Classification

curl -X POST http://localhost:8000/api/v1/network/classify \
  -H 'Content-Type: application/json' \
  -d '{
    "source_ip": "10.0.0.1",
    "dest_ip": "192.168.1.100",
    "port": 443,
    "duration_ms": 150.5,
    "bytes_transferred": 8192,
    "protocol": "TCP",
    "log_message": "GET request completed successfully"
  }'

Key Technical Decisions

Why GRU over LSTM?

Simpler architecture (2 gates vs 3)
Fewer parameters, faster training
Comparable performance on moderate sequences

Why K-means + LSA?

K-means: Fast, interpretable, scales well
LSA: Captures semantic relationships in text
Combined: Multi-modal feature fusion
No labels needed: Unsupervised discovery

Why PyTorch?

More Pythonic API
Easy debugging with eager execution
Strong community support
Flexible for research and production

Model Versioning

Models are saved with full artifacts:

# RMA GRU Model
outputs/models/rma_gru_v3/
├── model_weights.pt          # PyTorch state dict
├── preprocessor.joblib       # Encoders and scalers
├── config.json               # Model configuration
├── training_history.json     # Loss curves
└── metadata.json             # Version info

# RMA Prophet Model
outputs/models/rma_prophet_v2/
├── prophet_model.json        # Serialized Prophet model
├── config.json               # Model configuration
└── metadata.json             # Version info

# Clustering Model
outputs/models/network_clustering_v1/
├── clusterer.joblib          # K-means + LSA pipeline
├── preprocessor.joblib       # TF-IDF, scalers
├── cluster_interpretations.json
└── metadata.json

Requirements

Python 3.9+
PyTorch 2.0+
scikit-learn 1.3+
Prophet
FastAPI 0.100+
See requirements.txt for full list

Hardware

Tested on M2 MacBook Pro
Memory: ~4GB peak

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

# Dev setup
pip install -r requirements-dev.txt
pre-commit install

License

This project is licensed under the MIT License. See LICENSE for details.

Citation

If you use this project in your research, please cite it:

@software{norman2025timeseries,
  author    = {Norman, Justin D},
  title     = {Time Series ML Demonstration},
  version   = {0.1.0},
  url       = {https://github.com/stbiadmin/time},
  license   = {MIT}
}

See CITATION.cff for machine-readable citation metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
config		config
data		data
mlops		mlops
outputs		outputs
scripts		scripts
src		src
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Time Series ML Simulation

Blog Post

Quick Start

Dataset Variants

Synthetic Data

RMA Shipping Data (data/raw/rma_shipping_data.csv)

Network Events Data (data/raw/network_events.csv)

Driver Scripts

Project Structure

Scenario 1: RMA Freight Forecasting

Business Problem

GRU Approach

Prophet Approach

Progressive Model Improvement

Key Features

Scenario 2: Network Event Classification

Business Problem

Solution

Pipeline Architecture

Discovered Clusters

Key Features

API Endpoints

Health Check

RMA Forecast

Network Classification

Key Technical Decisions

Why GRU over LSTM?

Why K-means + LSA?

Why PyTorch?

Model Versioning

Requirements

Hardware

Contributing

License

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

RMA Shipping Data (`data/raw/rma_shipping_data.csv`)

Network Events Data (`data/raw/network_events.csv`)

Packages