Back to projects

Independent / Applied ML Project

Python | PyTorch Geometric | NetworkX | GeoPandas | SQL | XGBoost | spatial statistics

Graph-based traffic collision risk modeling with leakage-aware validation.

This project modeled collision risk over 1.1M+ collision records and road-network topology, representing intersections, road segments, and neighborhoods with spatial, temporal, traffic, structural, weather, and historical-risk covariates.

Key Outcomes

Graph structure improved risk ranking under stricter validation.

Records

1.1M+

collision records with road topology

Ranking

+23.4

hotspot-ranking AUROC points over tabular baselines

Recall

+31%

top-decile recall improvement

Project Breakdown

Problem, method, system, validation, results, reliability, and research value.

Problem

Spatial prediction can leak information through time and geography.

  • Nearby road segments and future collision patterns can make naive validation look better than deployable performance.
  • The project needed to separate graph-structure signal from temporal leakage and spatial autocorrelation artifacts.

Method

Collision risk was formulated as heterogeneous spatiotemporal graph prediction.

  • Nodes and edges represented intersections, road segments, and neighborhoods with spatial, temporal, traffic, structural, weather, and historical-risk covariates.
  • Compared message passing against temporal, tabular, kernel, XGBoost, and geospatial baselines.

System / Stack

The pipeline joined geospatial processing with graph ML.

  • Used Python, PyTorch Geometric, NetworkX, GeoPandas, road-network topology, SQL, XGBoost, spatial statistics, and calibration tooling.
  • Built feature construction, graph assembly, model comparison, calibration, and risk-map outputs.

Validation Methodology

Evaluation used temporal and geographic defenses against leakage.

  • Used temporal cutoffs, geographic buffer zones, held-out corridors, future-information audits, and spatial autocorrelation diagnostics.
  • Compared GCN, GraphSAGE, GATv2, temporal aggregation, XGBoost, kernel, tabular, and non-graph geospatial models.

Results

Topology improved hotspot ranking.

  • Improved hotspot-ranking AUROC by 23.4 points and top-decile recall by 31% over tabular baselines.
  • Connectivity, neighborhood propagation, centrality, temporal exposure windows, and message passing contributed to the lift.

Failure Modes / Reliability Checks

The model was stress-tested for spatial shortcuts.

  • Added calibrated risk maps, ablations, counterfactual edge removal, conformal-style risk sets, reliability curves, ranking-stability checks, and future-information audits.

Why It Matters for Research

The project connects graph learning to scientific validation discipline.

  • It asks whether relational structure improves risk modeling after the validation protocol removes shortcuts that would not survive deployment.