Three projects, and the story of how each one actually got built — including the parts that didn't work the first time.
The MIDI/Piano Generation Pipeline
It started with a simple, precise ask: take raw, quantized MIDI — the kind with no rubato, no dynamics, no pedal, dead flat and metronomic — and make it sound like a real pianist played it. That became midi-transformer. The first two generations were quick, disposable experiments; gen-3, nicknamed PerformanceNet, was the one that actually worked — a ~10M-parameter model that learned real expressive timing and dynamics well enough to become the project's baseline for months.
Then came the ambitious part. Gens 4 through 7 tried to push past gen-3 — bigger models (up to ~90M params), bf16 autocast, gradient checkpointing, a custom "velocity trajectory loss" meant to fix the one thing gen-3 never nailed: velocity correlation. Almost none of it worked. The trajectory loss turned out to be structurally incapable of helping velocity — a clean, useful negative result, but a negative result all the same. Rubato modeling was worse: at one point what looked like expressive timing turned out, on close inspection of the note counts, to be a fully degenerate output — not subtle humanization, just the model collapsing. A separate gen-7 attempt — pretraining a GPT-style tokenized model on the 80×-larger ARIA-MIDI corpus — got further, but only banked about 5 hours of pretraining before hitting its time budget at roughly 11% of the planned run. Between the failed experiments and the half-finished pretrain, the project's file tree became, in a word used at the time, "a complete mess."
The fix was to stop trying to salvage it in place. Gen-3 was declared the one production model and gens 4–7's checkpoints were deleted outright — deprecated, not hoarded. The promising GPT-style idea got its own clean fork, piano-humanizer, carrying over only the model and runtime code, none of the wreckage. Partway into rebuilding the humanizer, a second realization hit: a lot of the crowdsourced/rough MIDI it needed to run on was itself full of quantization noise, so before humanizing anything, that noise had to go. That spun off a third project, piano-normalizer — an ASAP-trained denoiser using a deliberately constrained "anchored edit-transducer": timing-only add/subtract edits, input pitches always preserved, capped at one edit per bar. The constraint was the point — after watching a free-form model degenerate earlier, a narrow, interpretable edit space was chosen on purpose.
A fourth project, piano-generator, took a different branch entirely: a from-scratch generator, not reusing the normalizer or humanizer, pretrained on ARIA performance data with its own tokenizer. Its first overnight baseline ran in June, and it hit the same ~28M-effective-capacity ceiling the earlier models had — loss bouncing in a tight band no matter how the architecture (RoPE, form mixing) was tweaked, a reminder that some limits are about capacity, not cleverness.
By the time four related projects existed, the housekeeping cost showed up: over 50GB combined, duplicate token caches, and a midi-transformer.git folder bloated to 2.22GB from an early accidental git add -A of the entire dataset. Investigation confirmed it was 236,000 unreachable loose objects, not real history — a git gc --prune=now brought it down to 13MB without touching the actual 7-commit history. The datasets themselves got consolidated into one canonical ~/midi-corpus, with the other three projects reaching it through directory junctions instead of copies — one source of truth, no more silent duplication.
The Weather Model
This one began as infrastructure work before it was a model at all. cdsapi was already installed, a Copernicus CDS account and API key already in hand — the goal was just to get ERA5 climate data flowing. The first surprise came immediately: the API kept returning a 403 required licence error. Each ERA5 dataset — ERA5-Land included — carries its own separate license that has to be accepted once, manually, on that dataset's web page. No amount of retrying the API call fixed it; the fix lived entirely outside the code.
Next came a subtler trap that would have quietly corrupted the data if it had gone unnoticed. Some variables, like 2m temperature, are snapshots — a clean instantaneous reading at whatever hour you ask for. Others — precipitation, solar radiation — are accumulated fields: they represent a running total since a reference hour, so a single 12:00 pull doesn't give you "conditions at noon," it gives you only the 11:00–12:00 delta. Treating those the same way as the snapshot variables would have baked a silent bug into the whole downstream dataset. Rather than solve that immediately, the call was made to ship the six snapshot-friendly state variables first and explicitly defer precipitation and solar radiation until the accumulation logic could be handled properly — a scope cut made for correctness, not convenience.
Performance was its own problem. Serial, one-request-at-a-time pulls across a multi-decade, near-global grid were slow enough to be impractical. Raising concurrency to a pool of 5 requests brought the full run down to roughly 1.5–2 hours — but CDS's queue would occasionally push back with transient hiccups or network drops, so the pipeline was built to keep going through individual failures and print a clear summary of exactly which years failed at the end, exiting non-zero so nothing failed silently. Storage held one more surprise: converting GRIB to NetCDF4 doesn't shrink anything by itself — a deflated float32 NetCDF file can end up the same size as, or bigger than, the packed GRIB it came from. Real savings required also regridding to 0.25° and reducing to single daily (noon) snapshots, not just changing file format. That full ERA5-Land pull — 6 state variables, noon, submitted through the pool-of-5 — finished on June 30th.
With the data pipeline solid, weather-model itself came together as a nanoGPT-style generative forecaster predicting next-day weather from that history. The 3060 Ti's 8GB was the sizing constraint from day one: because each day's worth of data tokenizes down to very little, a 4–8 layer, ~256-dimension model with a full year of context trains comfortably, with room to add gradient checkpointing later if it got wider. Version 1 — six variables, a single location — trained overnight on July 1st. Version 2, adding a pressure variable and expanding from one location to genuine spatial modeling, ran into the float-precision class of bug the MIDI project had already met once before: summing many small values in float32 quietly loses precision once a running total grows large enough, the same family of issue as the earlier timing-drift bug at large absolute timestamps. Recognizing the pattern from having debugged it once already made the second occurrence much faster to catch and correct.
yelp-gpt: A GPT-2 Trained from Scratch on Yelp Reviews
The raw material was already sitting on disk — the Yelp Open Dataset, converted earlier into five Parquet tables (7M reviews, 2M users) chosen specifically because the working set fit comfortably in RAM, making it fast to explore interactively with D-Tale's spreadsheet-like web UI. The new question was more ambitious: could a GPT-2-scale model be trained from scratch, nanoGPT-style, on nothing but that review text?
The hardware reality check came first: an RTX 3060 Ti with 8GB of VRAM, running torch 2.12.1 with CUDA confirmed working. Eight gigabytes is not much room for a 124M-parameter model, so gradient checkpointing was switched on defensively from the start, before any real tuning — the safe assumption being that memory would be tight.
The batch-size tuning that followed overturned that assumption. Naive intuition says a bigger batch trains faster, up to whatever the GPU can physically hold. In practice, batch sizes of 8 and above spilled out of dedicated VRAM into shared system memory, and that spillover made throughput worse, not better, despite technically "fitting." The actual fastest configuration turned out to be a small batch of 4 combined with gradient checkpointing and gradient accumulation — 32 micro-batches averaged into an effective batch of 128 — which only needed empirical sweeping to find, since the naive bigger-batch assumption pointed the wrong way. A second look afterward revealed the checkpointing itself was barely needed at that batch size — the random-init model was only using 2.92GB out of 8, leaving roughly 5GB of headroom sitting unused while checkpointing quietly cost some of that hard-won speed. That tension — safety margin bought at the price of throughput — got weighed directly rather than assumed away, since going too aggressive risked an out-of-memory crash hours into an unattended overnight run.
Before committing to the real run, an operational loose end had to be tied off: a stray training process from an earlier attempt was still technically alive. It had to be confirmed killed and the GPU verified actually free — down to just the memory used by the display itself — before starting the real job, so the notification that the old task had failed could be read correctly as an expected side effect of the kill, not a symptom of an unrelated crash contaminating the new run.
With the environment sorted, the training options were laid out side by side: a tight small-model quick check, a full from-scratch GPT-2 Small (124M params, 12 layers, 768 dimensions, 512-token context) at roughly 12–20 hours with gradient checkpointing and accumulation, and a warm-started fine-tune of a pretrained GPT-2 as a faster alternative. The from-scratch run was picked deliberately as the "best from-scratch" path rather than the shortcut, over roughly 1.33B tokens — about one full epoch across the 7M reviews — reaching a validation loss of 1.65. Sized up against the other two projects running in parallel around the same time — the MIDI model's 54.9M-token performance corpus and the weather model's ERA5-Land grid — yelp-gpt was, by a wide margin, the largest token budget of the three.
midi-transformer: From a Deterministic Regressor to Gen-3
The project's first model, gen-1, was about as simple as a "humanizer" can be: PerformanceRegressor, a deterministic model trained with plain L1 loss. Give it a score, it predicts one specific timing/velocity/pedal value per note. That works, but it can't express that a phrase could reasonably be played several different ways — there's exactly one output per input, no room for interpretation.
Gen-2 tried to fix that by rebuilding the model as PerformanceCVAE — a conditional VAE with a heteroscedastic output head, so the model could both sample different plausible performances from a latent z and report its own uncertainty per note. Half of that worked: the heteroscedastic head produced a genuinely useful per-note confidence signal. The latent side collapsed. val_kl sat pinned at exactly the free-bits floor every epoch, and sampling different z values barely moved the output at all. The cause was structural, not a tuning miss: the score features alone were already predictable enough that the decoder had no incentive to listen to z, and the conditioning path was just one additive bias getting washed out across six transformer layers.
That failure only became visible because of an eval harness built specifically to catch it. Validation NLL alone was actively misleading — it dropped to 0.011 while the model was quietly collapsing. So gen-2's harness added a second measurement: smoothed Pearson correlation between the model's deadpan prediction and the real human performance, averaged over a rolling window of notes. Sweeping that window size exposed the real problem precisely: velocity correlation held steady (~0.62–0.65) at every window size, but rubato correlation fell apart as the window widened — 0.72 raw, down to 0.46 at 8 notes, down to 0.33 at 32. The model was reproducing local, score-driven timing but had no notion of phrase-level rubato, the slow accelerando/ritardando arc a real pianist shapes across a passage.
Gen-3, PerformanceNet, was built to attack that specific finding, not just to be a bigger model. The most important fix was almost embarrassingly upstream of the network itself: gen-2's timing target — onset offset from the nearest metric grid point — was capped at ±half a grid step by construction. That cap physically deleted phrase-level rubato from the training data before the model ever saw it; no amount of retraining could have recovered it. Gen-3 re-represented timing as a local-tempo deviation, log(perf_IOI / score_IOI), an unbounded target that round-trips exactly. It replaced the collapsed VAE latent with five observed, supervised style descriptors (rubato magnitude, dynamic range, articulation, pedal amount, tempo) — a conditioning signal that can't collapse because it isn't sampled. And it changed how progress itself got measured: training now selects the best checkpoint by a composite "expression score" — the average of smoothed velocity and rubato correlation — instead of NLL, precisely because NLL had already been shown blind to the thing that mattered.
To make that comparison visible rather than just a number in a log, a dedicated eval viewer (eval_ui.py) was built: pick any of 177 held-out test pieces and it plots ground truth against gen-2 and gen-3 side by side. Below is that tool running live on a real held-out piece — the middle panel is gen-3's unbounded rubato representation, tracing an actual phrase arc; the bottom panel is gen-2's bounded representation, visibly flat by construction, unable to show a phrase arc even in principle.
Alexander Scriabin — Entragete, Op.63 721 notes · ground truth vs gen-2 vs gen-3
Ground truthgen-2gen-3
Velocity (0–1)
rubato — local tempo (+ slower / − faster); gen-3 + gen-5 vs GT phrase arcs (log-ratio)
gen-2 rubato — onset offset vs grid; GT is bounded ±½ step (no phrase arc) (ms)
Live capture from eval_ui.py (localhost:5001), the eval-comparison tool built alongside gen-3.