Creating an Unbiased AI Agent to Select the Top 68 NCAA Basketball Teams for the Big Dance

Creating an Unbiased AI Agent to Select the Top 68 NCAA Basketball Teams for the Big Dance

Every March, college basketball fans anxiously await the announcement of the 68 teams selected for the NCAA Tournament. And every year, having played basketball in high school and college and have been a big fan ever since, I find myself just as captivated—but equally frustrated—by the controversy surrounding who gets in and who gets left out. While rooted in tradition, the existing process often suffers from perceived bias, inconsistent criteria, and a lack of transparency.

That’s why I’ve decided to take a different approach. I want to build a fully auditable, explainable AI Agent to replicate and improve the NCAA’s team selection process. The goal isn’t just automation but fairness, repeatability, and trust.

Technically, my first step would be to architect a pipeline that pulls together disparate datasets using a data federation approach via a platform like?Databricks. I would ingest historical tournament data, NET rankings, advanced stats (from sources like KenPom), win/loss records, conference strength ratings, player injuries, and betting market insights. The data would be normalized and stored in Delta tables with version control to allow backtesting.

Next, I’d create a?feature store?to engineer dozens of features per team, including win shares, adjusted offensive and defensive efficiency, performance vs. top-quadrant teams, and momentum metrics (e.g., last 10 games). Each feature would be statistically profiled and normalized across seasons to ensure comparability.

Then, I would train an ensemble of machine learning models (e.g., XGBoost, Random Forests, and Logistic Regression for baseline interpretability), using supervised learning where the target is binary: "selected" or "not selected" based on prior committee decisions. To reduce overfitting and leakage, I’d use?seasonal time-based validation; meaning models would be trained on past seasons and tested on holdout future ones.

Most importantly, I would integrate fairness constraints and use tools like?Fairlearn?or?AIF360?to measure bias across conference types (Power Five vs. mid-major), regions, and school sizes. I’d use SHAP values to provide local explainability for each selection, enabling users to audit why Team A ranked higher than Team B.

All this would be wrapped in a user interface built for both fans and analysts, with real-time sliders to explore "what-if" scenarios. This is not just a model—it’s a decision support system built on responsible AI principles.

In this article, I’ll walk through each phase of?designing and developing an unbiased AI Agent?for NCAA selection. This system would be grounded in the spirit of the Selection Committee but enhanced by transparency, data integrity, and ethical machine learning.

Goal: Replicate and Enhance the Committee’s Process

The NCAA Selection Committee uses a multifaceted approach to determine the 68 teams that enter the tournament each year. Their evaluation includes:

  • NET Rankings: A proprietary algorithm that factors in game results, strength of schedule, game location, scoring margin (capped at 10 points per game), and net offensive and defensive efficiency.
  • Strength of Schedule (SoS): A calculated measure that considers the difficulty of a team's opponents, with additional weighting for out-of-conference schedules and performance against top-tier teams.
  • Quality Wins (Q1/Q2): Wins against teams ranked in Quadrants 1 and 2 based on game location and opponent strength. These metrics provide context for performance against high-quality competition.
  • Win-loss records: A holistic view of consistency and resilience, accounting for overall record, recent performance, and conference play.
  • Head-to-head matchups and common opponents: Comparative data that helps separate bubble teams with similar profiles.
  • Player injuries and roster availability: Considering team performance with and without key players, especially during key losses or wins.

To emulate this logic with fidelity, I would deconstruct each element into quantifiable variables and encode them into features within the AI Agent’s model. For example, I would build temporal models to track evolving NET and SoS ratings throughout the season, ensuring that each team's data is anchored in the timing of key wins and losses.

Quality wins would be encoded using dynamic quadrant tracking systems, recalculating quadrant status as rankings change. Head-to-head and common opponents would be analyzed using graph-based embeddings, allowing the model to detect community structures and indirect competitive strength.

For injuries, I would merge team rosters with player impact ratings and game-level box score data, then derive adjusted team efficiency deltas based on the presence or absence of starters.

Finally, the model would be evaluated on predictive accuracy and fairness metrics—ensuring?institutional neutrality?by explicitly measuring and adjusting for bias against non-Power Five schools, small conferences, and underdog profiles. The result: a robust, auditable AI system that treats every program equally, using performance and context—not legacy reputation—as the basis for selection.

Define and Mitigate Bias in Tournament Selection

A core motivation behind building this AI Agent is to address the institutional and algorithmic bias that undermines the NCAA selection process. Bias erodes trust and systematically disadvantages teams that perform well on the court but are excluded due to subjective or legacy-based reasoning.

Bias manifests across three critical dimensions in AI systems:

  • Data bias: The NCAA selection history disproportionately favors Power Five conference teams, creating a skewed distribution in training data. To counteract this, I would implement sampling correction strategies such as propensity score matching and stratified sampling, ensuring balanced data representation across conference tiers. Synthetic oversampling (e.g., SMOTE) could also boost the visibility of underrepresented mid-major team profiles in the training set.
  • Model bias: Historical decisions embed human biases into the AI’s learned behavior. I would introduce fairness-aware modeling techniques to mitigate this, including adversarial de-biasing, fairness constraints in the optimization objective, and feature decorrelation. For example, conference affiliation and historical tournament appearance could be explicitly excluded from the model or used only as protected attributes for bias measurement.
  • Interpretation bias: Users may over-rely on simplistic narratives like “momentum” or conference prestige even in interpretable systems. I’d combat this by building model interpretability into the user experience. SHAP explanations would surface counterintuitive results and highlight instances where underdog teams have statistically superior credentials.

These concerns are not theoretical. In 2025, despite weaker performance metrics, the University of North Carolina was controversially selected over multiple mid-major programs. The committee member representing UNC had a conflict of interest and was later reported to have received a bonus tied to the team's inclusion. This situation highlights why a transparent, auditable AI system is not just a technical improvement—it’s an ethical imperative.

To promote fairness and accountability, I embed the following safeguards into the AI Agent:

  • Balanced representation: I use reweighting algorithms based on inverse frequency and class-conditional likelihood to equalize influence across team archetypes. During model evaluation, I disaggregate performance metrics (precision, recall, calibration) by group identifiers such as conference, region, and budget tier.
  • Explainable metrics and criteria: All features in the model pipeline are vetted for causal relevance. Metrics like Q1 win count, adjusted efficiency margin, and strength-of-schedule percentile are computed from first principles and normalized across seasons. Model output is explained via SHAP waterfall plots, LIME approximations, and decision trees that trace selection thresholds.
  • Auditable decision trails: Each selection is logged with full metadata, including feature inputs, preprocessing transformations, SHAP attribution vectors, model version hash, and any human overrides. The metadata is stored in Delta Lake tables and indexed with Unity Catalog to ensure version-controlled reproducibility.

This multi-layered approach ensures that the AI Agent doesn’t simply replicate human bias more efficiently—it actively identifies, measures, and corrects it. As a result, the system provides a fairer, data-driven, and transparent alternative to traditional selection processes.

Aggregate and Normalize Multi-Source Data

An unbiased AI Agent needs a rich, multi-dimensional dataset that encapsulates historical outcomes and the contextual performance signals that inform selection decisions. Specifically, the dataset should include:

  • Historical selection and seeding data?from past NCAA tournaments, including which teams were selected, their seedings, and any available justifications from the selection committee. This serves as a supervised learning ground truth and enables retrospective validation of the AI model.
  • Granular team statistics?include adjusted offensive/defensive efficiency (from sources like KenPom, BartTorvik, and T-Rank), rebound percentages, steal rates, assist-to-turnover ratios, and effective possession efficiency, allow the model to identify high-performing but low-visibility teams.
  • Game-by-game performance logs?derive features like win streaks, second-half performance deltas, clutch-game success rates, and margin-of-victory distributions. Time-window aggregations (e.g., the last five games, conference play, and Q1 matchups) would be encoded using temporal convolutional feature engineering.
  • Ensemble rank comparisons use NET and secondary rankings?(Sagarin, Massey, Haslametrics). Differences between these rankings and actual selection outcomes highlight systemic bias patterns that can be statistically corrected in the model.
  • Conference strength indicators, including weighted inter-conference win ratios, opponent average NET, and adjusted SoS by venue, are important to ensure that non-Power Five teams are accurately evaluated in context.
  • Player-level data?includes individual on- and off-court impact (e.g., RAPM, BPM), injury timelines, and missed games. For example, a team with a three-game losing streak while missing its top scorer shouldn't be penalized like a healthy, underperforming team.
  • Venue-adjusted outcomes were?calculated using a location-adjusted Elo model to correct for home-court advantage and neutral-site volatility. A neutral win over a top-20 opponent should weigh more than a home win over a lower-tier team.
  • Roster analytics include minutes continuity, percentage of returning starters, experience-weighted team performance, and bench depth. These metrics explain why a veteran team may be more consistent than a statistically similar but younger team.

I would use?Databricks?as the centralized analytics and machine learning platform to consolidate these datasets. Each data source—ranging from NCAA statistics archives, public APIs, and licensed third-party analytics—would be ingested through scheduled ETL workflows orchestrated via?Delta Live Tables. The ingestion process would support real-time updates and historical replay, enabling accurate mid-season model refreshes.

For example, using Databricks Autoloader and schema evolution features, I would configure pipelines to automatically process updated KenPom and NET rankings each night while maintaining full historical context. I would also capture and version-control any drift in team metrics or opponent strength over the season.

During data normalization, features such as offensive efficiency would be Z-score normalized seasonally, while ordinal features like rankings would be converted into percentiles and delta metrics. This ensures that the model doesn't overweight raw values and can understand ranking improvements over time.

Unity Catalog would enforce data governance and policy across all personas (analyst, engineer, reviewer). Role-based access would allow model retraining without exposing sensitive raw data, and metadata tracking would allow full transparency into how each dataset contributes to final outputs.

Finally, I would use built-in?Quality Monitoring?to detect outlier games (e.g., 40-point blowouts, bench-only lineups) and label them low-confidence training samples. This would safeguard the model from being trained on statistical anomalies and enhance its robustness.

Through this rigorous, federated data pipeline, the AI Agent would operate with unparalleled breadth and depth, capturing the statistical narrative and the human nuance that defines college basketball performance.

Train a Transparent ML Model

The AI Agent should be built on a supervised learning framework, using ensemble-based machine learning techniques that support binary classification. The goal is to replicate the NCAA tournament selection process by training the model to output "Selected" or "Not Selected" for each team based on over a decade of historical data. These labels are paired with contextual metadata, such as conference bid counts and at-large availability, to ground the model in the structural constraints of tournament selection.

The feature engineering strategy involves creating over 100 features that fall into five primary categories. First, quantitative metrics include Adjusted Net Efficiency (ANE), Pythagorean winning percentage, and KenPom's proprietary stats, such as Luck Rating and Effective Height. These inputs capture a team’s season-long statistical quality. Temporal indicators add context to when performance occurs—like rolling averages of key metrics across the final 5–15 games, current win streaks, and mid-season volatility.

Contextual metrics address broader program dynamics. Strength of Record (SOR) and a custom Bubble Index measure how a team's resume compares to historical tournament thresholds. The Player Continuity Index captures the percentage of returning contributors, influencing consistency and late-season performance. I would apply learned embeddings for categorical variables like conference tier to capture non-linear conference relationships. Finally, injury-derived impact scores reflect team strength fluctuations by calculating performance deltas with and without high-usage players, weighted by game importance.

The architecture relies on a layered ensemble of base learners—XGBoost, CatBoost, and Random Forests—highly effective for structured data. I also include a logistic regression model with L1 regularization to establish a transparent baseline. The predictions from these models are fed into a meta-learner, such as a gradient boosting machine or shallow MLP, to optimize final selection confidence.

A Bayesian Ridge Regression layer sits atop the ensemble to support decision-making under uncertainty. This component provides confidence intervals for each selection probability, giving analysts a clearer picture of which teams are definitively in, and which reside on the bubble.

A rigorous cross-validation pipeline enforces model robustness. I use nested, stratified, time-aware k-fold validation to preserve seasonal context while preventing overfitting. In addition, the latest 1–2 seasons are held out entirely to simulate real-world performance.

Fairness is a top priority. I evaluate the model using Equal Opportunity Difference, Disparate Impact Ratio, and Group Calibration Error. These are visualized with Fairlearn dashboards to show how selection rates vary across protected groups, such as conference affiliation, geographic region, or budget tier.

The system employs three mitigation strategies to reduce bias actively. First, reweighting adjusts the influence of training samples based on representation imbalance. Second, adversarial debiasing introduces a secondary model trained to detect protected attributes; if it succeeds, the primary model is penalized for reducing information leakage. Third, I apply constraint-based regularization to decorrelate inputs known to encode systemic bias.

Interpretability is integral. SHAP values explain global and local model behavior, while LIME and counterfactual explanations help analysts answer “what if” questions—like whether an extra Q1 win would’ve changed a selection outcome. These tools make the model’s logic transparent and support human oversight.

Finally, I use Bayesian optimization (via Optuna) for hyperparameter tuning to prevent overfitting. Regularization techniques such as early stopping, dropout layers, and L2 penalties are included to enhance generalizability, especially for edge cases like the COVID-19 shortened 2020 season.

These techniques form a resilient, auditable, and fair machine-learning pipeline that supports NCAA selection. The AI Agent doesn’t replace human decision-makers—it empowers them with data-driven insights, institutional accountability, and unprecedented transparency. This target label would be joined with season metadata to track context, such as total bids per conference and number of available at-large spots.

Add Explainability and Human-in-the-Loop Review

To establish trust in the AI Agent and enable both technical validation and stakeholder engagement, explainability must be woven into the system’s design from the start. A black-box approach simply won’t suffice when decisions affect real teams, careers, and institutions. Instead, I would build an integrated explainability framework powered by robust computational tools, interactive diagnostics, and human-in-the-loop override mechanisms.

At the core of the interpretability layer are SHAP (SHapley Additive exPlanations) values. These would be calculated for every team prediction at both the global model level and for individual team-level inference. SHAP values allow us to break down the predicted probability into contributions from each feature—such as Adjusted Offensive Efficiency, Q1 wins, or Strength of Schedule. SHAP values would be aggregated across base models using a weighted average and stored in a version-controlled attribution database to ensure consistency across the ensemble. Each explanation would be indexed by team, season, and model version for seamless querying and audit.

To make these insights accessible, I would implement a multi-tiered visualization dashboard using Plotly Dash for developer-led exploration and Power BI for broader stakeholder access. This interface would include:

  • SHAP summary plots to show which features have the most significant influence across all teams.
  • SHAP force and waterfall plots to explain how each feature influenced the prediction for a given team.
  • Delta analysis tables to simulate how selection probabilities would shift if a team gained or lost key wins.
  • Side-by-side comparison modules to show why one team was ranked ahead of another.
  • Visual clustering of teams using dimensionality reduction (t-SNE or UMAP), highlighting similarities in team profiles and selection outcomes.

The system would include a permission-controlled override engine to maintain governance and human judgment. Overrides would be initiated through a secured interface and require structured justifications. Each override would trigger an MLflow experiment run tagged with metadata such as?user_id,?reason_code,?delta_probability, and?timestamp. Overrides would also compare pre- and post-intervention SHAP paths to show how the model's explanation changed. If the volume of overrides crosses a defined threshold, it would automatically trigger retraining with adjusted weights to account for systematic corrections.

Scenario simulation is another critical component. I would implement a forward-simulation layer using cached model pipelines to ensure responsiveness. For instance, a user could simulate a scenario where a team loses its top scorer mid-season, and the system would recalculate team performance metrics like adjusted net rating, effective height, and player win shares. These updated inputs would be run through the model to produce revised selection probabilities, along with new SHAP explanations. Scenarios could be exported as interactive notebooks, PDFs, or JSON payloads to share with analysts, committee members, or journalists.

Finally, to ensure traceability, every interaction with the system—from model inference to dashboard access—would be logged through Databricks Unity Catalog and Azure Monitor. Role-based access control (RBAC) would govern who can view explanations, edit override thresholds, or initiate scenario simulations. These controls guarantee secure access and full accountability.

Altogether, this layer transforms the AI Agent into a system that makes recommendations and can explain, justify, and adapt them. It becomes a transparent, auditable, and human-aware platform capable of supporting informed decision-making while maintaining fairness and governance at every step.. For every team, SHAP values would provide additive feature attributions by decomposing prediction probabilities into the marginal impact of individual variables. I would compute SHAP across multiple model layers, aggregate it across ensembles using weighted averaging, and store these values in a feature attribution database indexed by season, team, and model version.

Test and Validate the Model

Before deployment, rigorous model testing is essential to ensure the AI Agent consistently performs across various performance dimensions, including predictive accuracy, generalization to unseen seasons, fairness across subgroups, sensitivity to input perturbations, and transparency of decision rationale. I designed a comprehensive testing strategy incorporating standard machine learning evaluation techniques and domain-specific validation relevant to NCAA tournament selection to accomplish this.

To evaluate predictive performance, I would start by calculating core classification metrics: AUC-ROC (to measure overall discriminatory ability), precision and recall (to assess false positives and false negatives), F1-score (for balance), and log loss (to penalize overconfident errors). Importantly, these metrics wouldn’t be measured globally alone; I would segment the evaluation across protected subgroups such as Power Five vs. mid-major programs, public vs. private institutions, and budget quartiles to assess any disparities in model performance.

Time-aware validation is essential, given the chronological nature of NCAA seasons. I would implement nested k-fold cross-validation with folds grouped by season to ensure that no data from future games or outcomes leaks into model training. For deployment simulation, I would hold out entire seasons (e.g., 2022 and 2023) and evaluate the model on those after training on prior years. This "out-of-time" testing mirrors the real-world inference pipeline where models are applied prospectively on the current tournament field.

For fairness validation, I would apply statistical fairness audits using frameworks like Fairlearn and AIF360. Key fairness metrics include Equal Opportunity Difference (disparity in true favorable rates across groups), Disparate Impact Ratio (probability of selection across protected classes), Statistical Parity Difference, and Group Calibration Error. I would visualize these metrics with parity dashboards and integrate fairness gates into the CI/CD workflow to prevent production deployment of any model version that regresses in fairness.

Robustness is another critical dimension. I would conduct perturbation-based sensitivity testing, in which features such as NET rating, Strength of Schedule, or win/loss records are adjusted incrementally to simulate uncertainty, missing data, or borderline game outcomes. The model should maintain stability under minor input changes; high variance would flag brittleness. To evaluate this systematically, I would track the standard deviation of selection probability across perturbation sweeps for each team.

I would also run ablation studies, removing groups of features (e.g., injury indicators, temporal win trends, or player continuity metrics) to measure the marginal contribution of each feature class. This not only aids interpretability but validates that the model is not overly reliant on any single source of input or proxy variable.

In addition to synthetic tests, I would conduct post-hoc case-based evaluations using known controversial bubble teams from previous seasons. I would compare the model’s decision with the actual NCAA committee choice and public discourse for each case. Each evaluation would be accompanied by a SHAP explanation showing which features most influenced the selection score.

Finally, all test cases, metrics, plots, and SHAP summaries would be stored in an MLflow registry with versioning. Each batch of tests would be assigned a unique run ID and linked to the model artifact hash, ensuring full reproducibility and auditability.

This layered testing infrastructure guarantees that the AI Agent delivers not only accurate predictions but also fairness, robustness, and traceable transparency—qualities essential to institutional trust in any decision automation system.Rank and Select the Top 68

The Agent finalizes its picks through a multi-stage, rules-based and probabilistically-guided pipeline designed for both transparency and replicability:

  1. Auto-Bid Integration: The 32 automatic qualifiers—conference tournament winners—are directly ingested into the bracket as locked selections. The AI Agent listens for real-time updates via NCAA API endpoints and integrates champions automatically through a watch-triggered update job. These teams are flagged as immutable for downstream ranking models.
  2. At-Large Selection Ranking: Using the previously described ensemble model, each of the remaining eligible teams (typically over 300) is scored based on their model-derived selection probability. The top 36 non-auto-bid teams by selection score are added to the field. To ensure compliance with selection principles, additional post-model filters include:
  3. Seeding Algorithm Simulation: Once the 68 teams are finalized, I use a?constraint satisfaction optimization engine?to simulate the seeding process. This module factors in:
  4. Last Four In / First Four Out Simulation: The Agent conducts a probabilistic sensitivity analysis:

All outputs from this process are exportable to CSV, JSON, or bracket visualization formats and can be re-run with new inputs using version-controlled pipelines.

This automated system emulates and arguably exceeds the manual selection process by incorporating more variables, less subjective bias, and a fully reproducible audit trail.

Final Thoughts

This AI Agent is more than a tech showcase—it’s a blueprint for?democratizing tournament selection, addressing systemic opacity, and re-centering decision-making around quantifiable performance and ethical AI.

For data scientists, fans, and NCAA stakeholders, this signals the emergence of a new paradigm: one in which?technology meets transparency?and trust is earned through reproducibility, fairness audits, and model accountability.

As AI continues to evolve, future iterations of this Agent could incorporate reinforcement learning to dynamically adjust selection strategy as new data becomes available or utilize federated learning to securely incorporate proprietary institutional data without compromising privacy. Advanced language models could ingest and interpret media sentiment, injury reports, and press conference transcripts to weigh human context alongside metrics.

Eventually, selection models could integrate real-time player tracking and biomechanical data, using edge inference on computer vision pipelines to measure defensive spacing or stamina degradation—unlocking even deeper insights into team quality.

Let’s not just make March Madness thrilling, let's make it?trustworthy, explainable, and future-ready.


Mike Kelleher

Alliances and Partnership Manager

4 天前

This is a really interesting idea! Using AI for the NCAA selection process could make it more fair and clear for everyone involved. How do you think it might change fan engagement or the excitement around March Madness? I'd love to hear more about your proposed AI Agent design. It seems like it could bring some fresh transparency to the game.

回复

要查看或添加评论,请登录

Charles Skamser的更多文章

社区洞察