Understanding Performance Metrics

2025-10-26 • NHLForecasts

At NHLForecasts.com, we're not hiding behind flashy graphics and vague "expert analysis." Everything our model does—and how well it does it—is laid bare on our Performance page. Think of this as our report card, updated constantly, showing you exactly where we shine and where we struggle. Here's how to read it.

Time Windows: Daily, 30-Day, Season-to-Date, Multi-Season

We publish metrics across multiple windows because a model can be stable overall yet drift in the short term. Use the ranges like this:

Daily: A quick pulse check for the last two weeks. It shows streaks, slumps, and whether the model is running hot or cold.
Last 30 Days: A more reliable short-term view that smooths daily variance.
Season-to-Date: How the model is handling the current season as the league evolves.
Multi-Season: The longest view, anchored by the earliest season in our database (explicitly shown on the page).

The Basics: Accuracy Isn't Everything (But It's a Good Start)

Let's start with the simplest question: How often do we get the winner right? Our accuracy metric answers that, typically landing somewhere between 55-60%. That might not sound dominant, but in hockey—where a lucky bounce or a hot goalie can flip any game—beating the 52.4% threshold of random guessing is actually significant.

But here's the thing: raw accuracy is like judging a poker player by hands won instead of money earned. A prediction model needs to know when it knows something. And that brings us to the metrics that really matter.

We also publish the average probability assigned to the actual winner. Think of it as the model’s “odds of picking the winner.” If we average 60%, we’re not just right—we’re appropriately confident when we’re right.

The Brier Score: Punishing Overconfidence

Imagine two scenarios. In one, our model says the Avalanche have a 51% chance to beat the Coyotes, and Colorado wins. In another, it says the Avalanche have a 95% chance, and Arizona pulls the upset. Both count the same in the accuracy column, but that second miss? That's way worse, because we weren't just wrong—we were confidently wrong.

Enter the Brier Score, which measures prediction quality by squaring the difference between our probability and the outcome. It's calculated as the average of (predicted probability - actual outcome)². Perfect predictions score 0.0; pure guessing lands at 0.25. Our model typically posts Brier Scores between 0.23-0.25, which puts us in respectable territory. Anything under 0.20 would be exceptional—and frankly, suspicious in a sport as chaotic as hockey.

Think of it this way: the Brier Score is the difference between a weatherman who says "70% chance of rain" when it's clearly going to pour versus one who honestly admits "it's a coin flip." Lower scores mean we're not just right more often—we're right about how right we are.

Log Loss: Penalty for Confident Misses

Log loss is another calibration-aware metric. It assigns huge penalties to confident predictions that are wrong. A model that says "90%" and misses gets punished much more than one that says "55%" and misses. Lower is better.

Total Goals: The Over/Under Question

We also track Over 5.5 calibration. For every game we publish a probability that the total goes over 5.5. We then group those predictions into deciles and check whether “70%” really means 70%. If the over/under line is off, it shows up fast in those calibration tables.

To make the totals section even more intuitive, we also report the average probability we assigned to the actual over/under outcome. It’s the totals equivalent of “odds of picking the winner.”

Hockey bettors live and die by the total, and our RMSE (Root Mean Squared Error) metric shows how close we get. An RMSE around 2.0-2.5 goals means our predictions typically land within that range of the actual total. Given that NHL games usually see 4-8 goals, that's a reasonable margin—though we're always working to tighten it.

Calibration: Where the Magic Happens

This is the most important concept on the entire Performance page, so stay with us. Calibration asks a deceptively simple question: When we say something will happen 70% of the time, does it actually happen 70% of the time?

We test this by grouping our predictions into ten buckets—games where we predicted 0-10%, 10-20%, all the way up to 90-100%—and checking whether reality matched our confidence level. If we called 100 games at 60% probability and the favorite won exactly 60 of them, that's perfect calibration for that bucket.

Here's why this matters more than anything else: imagine two models that both correctly pick winners 60% of the time. Model A just slaps "60% confidence" on every single game. Model B actually differentiates, going 90% confident on 80 games (and getting those right) while admitting "yeah, this one's a toss-up" on the others at 30% confidence. Both are 60% accurate, but Model B is infinitely more useful—it tells you when to lean in and when to back off.

Perfect calibration means our probabilities aren't just numbers—they're trustworthy signals you can actually use.

Starter Calibration: How Goalies Affect Reliability

We track calibration separately for games with known starters versus games where we are still predicting the starter. If starter-confirmed games are noticeably better calibrated, that tells us goalie uncertainty is a real source of noise.

Team-by-Team: Finding the Blind Spots

Not all teams are created equal, and our model knows it. The team-specific calibration breaks down how we perform for each franchise when they're playing at home. If we consistently give the Maple Leafs a 65% win probability in home games but they only win 58% of the time, that's a bias we need to address.

Small biases—within 3% or so—are just statistical noise. Anything bigger suggests we're missing something. Maybe it's a mid-season coaching change that rewired the team's identity. Maybe it's injuries piling up faster than our historical data can capture. Maybe it's something about their playing style that our features don't quite capture yet.

Sample size matters here, too. If we've only predicted 15 home games for a team, wild variance is expected. But 50+ games? That's a pattern worth noting.

Venue Effects: Ice Isn't Just Ice

Some arenas have personality. The Ball Arena in Denver sits a mile above sea level, where the thin air can sap visiting teams. Some rinks are slightly larger or smaller than standard. Some buildings have crowds that create genuine home-ice advantage; others might as well be libraries.

Our venue-specific calibration checks whether certain buildings consistently break our predictions. If we're over-predicting home wins at one arena and under-predicting at another, it might be time to dig into what makes that rink unique. Geography, atmosphere, ice conditions—hockey is full of quirks that don't show up in box scores but absolutely matter in real games.

In-Game Checkpoints: Are Live Probabilities Honest?

In-game probabilities change constantly. We track key checkpoints—pregame, end of 1st, end of 2nd, 10:00 left in the 3rd, 5:00 left, and OT start—and ask the same calibration question: when we say 65%, does it really happen 65% of the time?

This is the best way to judge whether the live model is "too jumpy" or appropriately confident as the clock winds down.

xG Holdout Metrics and Splits

For expected goals (xG), we report holdout metrics (ROC AUC, log loss, Brier). These are calculated on a temporal test split to avoid leakage. We also break performance down by shot type and strength state (even strength, power play, penalty kill, empty net) to see where the model is strongest or weakest.

Playoff Performance

Playoff games are a different environment—higher leverage, shorter series, and more goalie volatility. We track performance on playoff games separately so you can see whether the model generalizes beyond the regular season.

How to Actually Use This Stuff

First, check our overall calibration. If the predicted probabilities line up nicely with observed outcomes across all buckets, you can trust the numbers we're putting out there. Our probabilities aren't just educated guesses—they're grounded in what actually happens on the ice.

Second, scan for team and venue biases. If you're looking at a prediction for a team we tend to overrate, mentally shade that probability down a few percentage points. Not enough to flip your read on the game, but enough to stay honest about the uncertainty.

Third, pay attention to sample sizes. A 10% bias over 75 games? That's real. A 15% bias over 12 games? That's probably just randomness talking.

Finally, watch for trends over time. We update these metrics constantly and retrain our models to stay sharp. If you see calibration starting to drift, we're probably already on it—either tweaking the algorithm or investigating why the league might be shifting in ways our model hasn't caught yet.

The Bottom Line

We built this Performance page because transparency matters. Too many prediction sites hide behind mysterious "proprietary models" and cherry-picked stats. We're showing you everything: where we excel, where we're merely okay, and where we're working to improve.

No model is perfect—especially in hockey, where a hot goalie can turn a 70% favorite into a shutout victim. But what we can promise is honesty about how we're doing, measured against the only opponent that matters: reality. Check back regularly, because unlike some teams in rebuilding mode, we're committed to getting better every season.

← Back to Articles