Alpha Radar - Predicting PumpFun Token Trajectories

Domain: Quantitative Finance & Blockchain Data

Stack: CatBoostTransformersTCNFeature EngineeringPython

Links: Kaggle Leaderboard · Dataset · 2nd Place / 150+ teams

Retrospect

Thought Process

Raw tick sequences can't be fed directly into tabular models without losing temporal structure, but pure sequence models are too fragile for hard recall constraints. So I decoupled the two: use TCN and Transformer architectures strictly as encoders to compress noisy tick data into dense embeddings, then feed those into CatBoost with carefully calibrated thresholds. EDA revealed a physical signature of actual alpha tokens: they consistently hit maximum price velocity right at the end of the 30-second window. That became a targeted engineered feature.

What I Learned

Optimizing for a hard constraint like Jaccard plus a 0.75 recall floor makes you think about risk tolerance instead of accuracy. The TCN/Transformer as encoder and gradient booster as classifier is a solid production pattern that maps directly to HFT signal generation and fraud detection.

Developed a constrained ML pipeline to predict alpha token trajectories on Solana (PumpFun) within a 30-second launch window. Used TCN and Transformer encoders to extract features from noisy tick-level data, feeding a CatBoost classifier across 64,208 tokens. Achieved a Jaccard index of 0.21 while satisfying a strict 0.75 recall constraint. Secured 2nd place out of 150+ teams.

§1. The Domain & The Problem

PumpFun is a Solana protocol where thousands of micro-cap tokens launch daily. Alpha tokens are the rare, profitable ones that survive the initial chaos.

Blockchain transaction data at launch is extremely noisy, dominated by MEV bots, sniper algorithms, and rug-pulls. The competition enforced a hard constraint: 0.75 recall was mandatory. In a heavily imbalanced dataset where 99% of tokens are junk, forcing a model to catch 75% of the rare positive class mathematically destroys standard precision.

§2. The Mental Model & Trade-offs

Feeding raw, irregular ticks into a tabular model loses temporal structure. A pure sequence model is too fragile to calibrate for strict asymmetric constraints like a 0.75 recall floor.

Deep Feature Extraction: TCN and Transformer architectures were used strictly as encoders, parsing tick sequences and compressing them into dense embeddings. These were concatenated with token metadata to create a rich tabular dataset.

Market Microstructure Insight: Through EDA, a physical signature emerged: stable, rising coins consistently hit their maximum price velocity at the very end of the 30-second window. Features were engineered specifically to capture this late-window momentum.

Classification: The final dataset (64,208 tokens) was fed into CatBoost with decision thresholds heavily tuned to maximize the Jaccard Index while keeping the 0.75 recall floor intact.

§3. The Architecture

Sequence Ingestion: Tick-level data (price, volume, wallet IDs, timestamps) fed into a TCN/Transformer hybrid encoder.

Feature Engineering: Market microstructure signals targeting late-stage 30-second momentum gradients that differentiate organic alpha buying from bot-driven sniper volume.

Classification Engine: CatBoost handling massive categorical features and dense embeddings, calibrated to tolerate high false-positive rates to satisfy the recall floor.