Stanford's Foundation Model Transparency Index fell from 58 to 40 in a single year, wiping out two years of gains. The criteria got stricter, but the researchers are unambiguous: this reflects genuine deterioration.
The models reaching human-level performance on PhD-level science benchmarks are the same ones disclosing less about training data, compute, and downstream impact. Companies volunteer capability scores eagerly. Responsible AI benchmarks get blank rows.
Every measurement problem this publication has tracked gets worse when the system being measured becomes less visible. You cannot govern what you cannot inspect, and inspection is retreating precisely as deployment stakes compound.
