World Model Evidence
This page collects the scored comparisons we can publish: world-model forecasts against hidden futures and GPT baselines on the same visible context.
Bismarck is the cleanest showcase: one dense PDF becomes 1,922 dated events, the test hides later history, GPT gets the same pre-branch record, and the page then lets you try sharper alternate-history forks from the same state. The rest of this page keeps the broader scorecard honest across public records, private-shape aggregates, and synthetic stress tests.
Every test set is also split into calm windows and break windows, because averages flatter every forecaster. Middle-earth shows that honestly: persistence wins calm windows there. The interesting score is what happens when momentum fails.
Loading evidence...