The Quant’s Guide to Crushing MLB: Predicting Outcomes with AI and Modern Analytics

Posted April 30, 2026, 10:28 a.m. by Dave 1 min read

Winning in baseball analytics starts with clear, calibrated probabilities, not hot takes. As a pro sports analyst who builds AI models, I will show you how to turn pregame data into reliable MLB win odds you can trust. We will blend Statcast metrics, weather, travel, and bullpen fatigue then validate, calibrate, and translate those numbers into actionable prices. This isn’t about guessing which team "wants it more." This is about building a system that extracts value from the noise.

Table Of Contents

Problem framing and target definition
Data pipeline and collection
Feature engineering and modeling
Backtesting, calibration, and deployment
Workflow and tools
Step-by-step build, from zero to daily runs
Practical feature templates
Calibration and reliability in practice
Using third-party data and tools effectively
Common pitfalls and how to fix them
Lightweight examples for pricing and staking
Deliverables you should produce every day
Final notes on process quality
Conclusion
Related Posts
Frequently Asked Questions (FAQs)

Key Takeaways

Price each MLB game with win probabilities, not bold picks. Use only info known before first pitch to avoid leakage. Build a steady data flow using Statcast, projected lineups, weather, park, and travel data. Fix late scratches fast and version your inputs. Engineer the right signals such as starting pitcher K-BB%, xwOBA allowed, batter platoon splits, bullpen fatigue, and catcher framing. Test the model the right way with rolling windows, track Brier and log loss, and check calibration. Compare to the closing line to turn probabilities into fair moneylines and stake with small fractional Kelly. ATSwins.ai shows our edge as an AI-powered sports prediction platform with data-driven picks, player props, betting splits, and profit tracking across NFL, NBA, MLB, NHL, and NCAA. Free and paid plans help bettors make smarter choices and stay accountable.

Problem framing and target definition

Define the prediction target

Predict a probability, not a pick. The unit of prediction is a single MLB game’s pregame home-team win probability. Probability beats picks because it lets you price edges against the market. It also enables bankroll sizing and risk control while supporting calibration checks so you can fix drift before it hurts your results.

For labeling, let y = 1 if the home team wins and y = 0 if the home team loses. You should exclude suspensions that resume the next day if pregame prices are reset. Rainouts with no first pitch should be left unlabeled and removed from training entirely. In terms of scope, focus on full-season daily pregame modeling across all MLB games on the slate. Predictions should be finalized 30 to 60 minutes before first pitch to capture lineups and weather but avoid late in-game signals. There is no live in-game modeling in this specific workflow.

Establish clear baselines

Baselines create honest context for your model’s performance and expected edge. The market close baseline is a high bar. Convert the closing American moneyline for each side to an implied probability, remove the vigorish, and use that as the baseline probability. The market is notoriously tough to beat after removing the vig.

A Naive Elo baseline is another option. Start with a league-average prior and update team ratings by game result with a modest K-factor. Add a simple home field bump and pitcher bump. It is crude but fast and a decent sanity check for your more complex models. You might also consider logistic regression for its interpretability or Gradient Boosting for capturing interactions. Calibrated stacking often provides the best Brier or log loss results, though it has more moving parts and is harder to maintain.

Avoid leakage at all costs

Only use information known before the first pitch. This includes confirmed lineups, probable pitchers, travel and rest, forecasted weather, and historical performance up to yesterday. No in-game data or end-of-day stats updated after first pitch should ever enter your training set.

Strict timestamping is mandatory. All datasets must carry an as-of timestamp. Features for a game should be created from snapshots with timestamps earlier than the scheduled first pitch. Regarding lineup uncertainty, if you model expected lineups, do not substitute actual after-the-fact starters during training unless you constrain the as-of time to when lineups were publicly known. Lean on primary sources like Statcast and reproducible machine learning practices with transparent evaluation to keep the model honest.

Scope the evaluation target

The primary objective is achieving calibrated probabilities with low Brier score and low log loss. The secondary objective is generating profit when priced against the market using a conservative staking plan. The goal is stable, season-long daily operation with weekly retraining to keep the model fresh.

Data pipeline and collection

Identify authoritative sources

Core feeds you can trust and reproduce include Baseball Savant for pitch-level and Statcast batted-ball detail. You should also look to FanGraphs for advanced player and team metrics plus park factors. Retrosheet is the gold standard for historical event data, game logs, and play-by-play.

For weather, use the National Weather Service or a paid sports weather API with stadium-specific forecasts. Roster and transaction data should come from official MLB logs. Odds should include at least one reputable book’s open and close, plus consensus if available. If you are benchmarking off public picks, keep a daily record of slate odds and outcomes. Platforms like ATSwins MLB results provide a historical view you can use as a cross-check.

Build a reproducible ETL

Aim for a simple, modular flow. Ingest the daily game schedule with probable pitchers and projected lineups. Fetch last-7, last-30, and season-to-date player stats using only data through yesterday. Retrieve bullpen usage from previous days, including pitches thrown and leverage. Bring in park factors and weather forecasts updated on an hourly cadence. Finally, download closing odds for completed games.

Standardize team and player IDs across sources to create a unique game ID. Create keys for batter-pitcher matchups with platoon sides. Version and store your data by keeping raw snapshots in date-stamped folders and writing cleaned datasets with version tags. Always perform schema checks and range checks for temperature or wind speed to ensure data integrity.

Match lineups to probable pitchers

If lineups are not confirmed, build expected lineups from recent usage and platoon rules. Use a simple optimizer to fill by position while prioritizing healthy starters. Once confirmed, update features that depend on the lineup such as aggregate wOBA, ISO, and K rate. Map each starter by player ID. If a late scratch occurs, recalculate starter features and bullpen fatigue impact to re-price the game within minutes.

Handle missingness and late scratches

For missing stats for a call-up, backfill with minor-league translations or league-average priors with shrinkage. If weather is unknown, use the stadium default and flag it with a binary feature. For openers, maintain a fallback template that allocates expected innings to the first and second pitchers based on recent usage. Aggregate features by their expected innings share.

Data timeliness and caching

Cache upstream pulls for 24 hours where possible. Re-pull only high-volatility items like lineups, weather, and odds. Implement a freshness watchdog. If today’s 4 pm ET dataset is stale, the system should alert you and fail gracefully rather than using outdated information.

Feature engineering and modeling

Pitcher-centric features

For starters, focus on recent true talent indicators with regression-to-the-mean. This includes xwOBA allowed, xERA, K-BB%, and hard-hit percentage. Blend the last-30 days with rest-of-season projections. Track pitch mix deltas to see if usage of primary pitches has changed over the last three starts. Separate xwOBA allowed versus right-handed and left-handed batters using hierarchical shrinkage. Include interactions between pitcher fly-ball rate and weather variables.

For bullpens, create a fatigue index based on rolling 3-day and 5-day pitches thrown per reliever. Factor in talent tiers by calculating the weighted average xFIP for the top three relievers likely to be used in closing leverage situations. Don't forget catchers. Add catcher framing runs per 1000 pitches and passed-ball rates as found in major league catcher stats to account for defensive value.

Batter and lineup features

For each lineup slot, compute a split-adjusted projection versus the opposing starter’s handedness. Aggregate this to a lineup-level wOBA and ISO. Look at team K rate and chase rate against the pitcher’s typical zone usage. Apply park factors for home runs and doubles to adjust baseline ISO. Weighted averages of baserunning runs for the active lineup are useful against batteries that allow steals. Team defense, measured in Outs Above Average, can suppress BABIP.

Park, schedule, and environment

Include park-specific factors for runs, home runs, and even foul territory, which affects pop-up outs. Weather is huge. Include temperature, humidity, and wind speed binned into categories. Distance traveled since the prior game and timezone hops should also be factored in. Penalize teams on their fourth game in three cities. If the umpire crew is pre-announced, include a zone size proxy.

Modeling choices and configurations

Start simple with logistic regression to establish a baseline. Add key interactions like pitcher ground-ball percentage versus wind direction. Calibrate this with Platt scaling. If you move to XGBoost or LightGBM, use monotonic constraints so that relationships remain logical. Tools like scikit-learn are perfect for these tasks.

Time-aware cross-validation

Use rolling-origin evaluation by splitting the season into monthly blocks. Train on March and April, then validate on May. Never peek ahead. For each block, freeze all transforms on the training window before applying them to the validation window. Avoid target encoding leakage by computing encodings using only the training fold with smoothing.

Backtesting, calibration, and deployment

Choose evaluation metrics with intent

Brier score is your main metric for probabilistic accuracy. Log loss is useful for punishing overconfident wrong calls and for model selection. ROC-AUC is informative but secondary since MLB games often have similar probabilities, meaning AUC can look fine even when calibration is way off.

Make calibration your priority

Compare predicted win rates versus actual outcomes in decile bins. The closer the reliability curve is to the diagonal, the better. Post-fit calibration using isotonic regression on validation folds often yields excellent results. Refit these weekly as player features drift throughout the long season.

Compare against the market close

Convert market American odds to implied probabilities and remove the dual-sided vigorish. Edge estimation is simply your probability minus the market probability. Track how these edges translate into realized returns using a simulated staking plan. If you never beat the close, you need to re-check for leakage or stale features.

Manage staking with fractional Kelly

Convert your probability to decimal odds. Use fractional Kelly, perhaps 25% to 50%, to reduce volatility. Your stake should equal your bankroll multiplied by the Kelly fraction. Cap per-bet exposure to 1% or 2% of your bankroll to protect against model shocks or unexpected roster moves.

Workflow and tools

Daily operating checklist

Two hours before first pitch, pull the schedule and validate IDs. Fetch confirmed lineups or build expected ones. Update weather and park conditions. Compute all pitcher and lineup features with time-stamped datasets. Score the slate, convert to fair odds, and compute edges. Run your final sanity checks for injuries and weather before publishing.

Live monitoring and alerts

Set up dashboards to track Brier by day versus a 14-day rolling baseline. Monitor calibration ECE trends and data freshness. If you prefer a ready-made interface for tracking results and splits while you iterate on the model, bookmark ATSwins and its MLB sections for a complementary view and quick benchmarking.

Step-by-step build, from zero to daily runs

First, set up the training dataset using the last two full seasons plus the current season. Create labels from game results and freeze the data as-of time. Second, engineer your first-wave features including xwOBA, K-BB%, and lineup-level wOBA. Third, train a transparent baseline model like logistic regression. Fourth, add nonlinearity with XGBoost, using early stopping to prevent overfitting. Fifth, stack your models and calibrate the final output. Sixth, price your games and simulate betting with fractional Kelly. Seventh, validate against the market close. Finally, deploy the system and monitor for drift.

Practical feature templates

A pitcher feature block should include the starter ID, recent xwOBA allowed over 30 days, K-BB percentage, and pitch mix deltas. You can find detailed pitcher performance data to fill these fields. For lineups, track weighted wOBA versus the specific handedness of the pitcher and team defensive runs saved. The context block must include park factors and weather data like temperature and wind speed.

Calibration and reliability in practice

Bin your predictions into deciles and plot the mean prediction versus the actual outcome. If the line bows upward, the model is underconfident. Adjust this with isotonic regression. Beware of season phase shifts as the ball carries differently in April than in July. Respect out-of-sample discipline and never refit your hyperparameters on the test month.

Common pitfalls and how to fix them

The biggest killer is leakage via end-of-day stats. Fix this by delaying all stats by one day. Overconfidence after a hot week should be handled by re-checking calibration. If you are ignoring bullpen context, incorporate the last three days of usage. For late scratches, ensure you have a quick-recalc path that refreshes only the affected features and reprices the game instantly.

Lightweight examples for pricing and staking

If your model outputs a 0.58 probability for a home win, the fair American odds are approximately -138. If the market is offering -115, you have a clear edge. Using fractional Kelly, if the market decimal odds are 1.87 and your probability is 0.58, your Kelly fraction is roughly 0.097. At 25% Kelly, you would stake about 2.4% of your bankroll, though you may want to cap this at 1.5% for safety.

Deliverables you should produce every day

Every day you should generate a CSV of probabilities per game including the game ID, fair moneylines, and recommended stakes. Produce a JSON for your dashboard to track Brier scores and log loss. Finally, write a markdown summary explaining the top three edges and any overrides due to late scratches or weather alerts.

Final notes on process quality

Prefer primary data over scraped summaries. If you are in doubt, re-derive the metric yourself. Keep everything reproducible by versioning your data, code, and parameters. Always post your predictions before the first pitch and never edit them after the fact. Reviewing the latest MLB standings or news can help provide a sanity check for your model's outputs.

Conclusion

Building an AI model for MLB isn't a "set it and forget it" project. It requires constant data hygiene, rigorous calibration, and an obsession with avoiding leakage. By focusing on probabilities rather than winners, you position yourself to find value where the general public finds noise. Stick to the process, manage your bankroll, and let the data do the heavy lifting.

Frequently Asked Questions (FAQs)

How do I handle doubleheaders in my AI model?

Treat each game of a doubleheader as a separate entry in your data pipeline. However, pay extra attention to bullpen fatigue and lineup changes between Game 1 and Game 2. Bullpens are often stretched thin, and catchers rarely start both games. You should also check for any league-specific rule changes for doubleheaders that might affect the run environment.

What is the best way to account for the MLB trade deadline?

The trade deadline creates a massive shift in team talent that historical data might not immediately capture. When a star player is traded, your lineup-level features should update immediately based on the new roster. You might also add a "roster churn" flag to down-weight historical team-level stats in favor of individual player projections for the first week after the deadline.

Can I use this model for player props instead of game outcomes?

Yes, but you will need to change the target variable. Instead of a binary win/loss outcome, you would predict continuous variables like strikeouts, total bases, or earned runs. The underlying features like pitcher xwOBA and batter platoon splits remain highly relevant, but you’ll need to calibrate for the specific prop market you are targeting. Monitoring player injury reports is even more critical for props.

How much historical data do I actually need?

While you can start with a single season, two to three years of historical data is generally the sweet spot. This allows the model to see enough variety in weather, park effects, and player matchups to find meaningful patterns. Going back further than three or four years can be risky because the "meta" of the game changes, such as shifts in league-wide strikeout rates or the introduction of the pitch clock.

What should I do if my model disagrees wildly with the market?

First, check for data errors. Did a pitcher get scratched? Is there a massive storm in the forecast that your model missed? If the data is clean, a large disagreement often means you've found a high-value edge or the market knows something your model doesn't. Start by taking a smaller stake on these "outlier" games until you can verify if your model has a blind spot or a genuine breakthrough. High-authority MLB news sites can often reveal the context you might be missing.