Back to BlogAI & Technology

How to Measure AI Screening Quality

Five core metrics for AI screening quality: completion rate, pass-through rate, predictive validity (target r = 0.20-0.40), candidate NPS, and adverse impact.

Outhire Team
2026-01-29
8 min read
How to Measure AI Screening Quality

TL;DR: Measure AI screening quality across five dimensions: completion rate (target 70-80%), pass-through rate (15-35%), predictive validity (r = 0.20-0.40 for screen score vs. next-stage advancement), candidate NPS (target 30+), and adverse impact ratio (four-fifths rule per EEOC guidelines). Review monthly for operations, quarterly for validity and fairness, annually for full audit. A fast screen that advances the wrong candidates is a fast failure.

The Five Core Quality Metrics

MetricWhat It MeasuresTarget BenchmarkReview CadenceSource
Completion rate% invited candidates who finish70-80% (good); 80%+ (excellent)MonthlyAptitude Research 2025
Pass-through rate% screened who advance to next stage15-35% (well-calibrated)Monthly,
Predictive validityCorrelation between screen scores and outcomesr = 0.20-0.40QuarterlySchmidt & Hunter, 1998
Candidate NPSCandidate experience rating30+ (good); 50+ (excellent)Monthly,
Adverse impact ratioSelection rate parity across demographicsFour-fifths rule complianceQuarterlyEEOC Uniform Guidelines

1. Completion Rate

Benchmarks: Below 60% = significant design issues. 60-70% = room for improvement. 70-80% = good. Above 80% = excellent.

How to improve: Shorten screen length (every additional minute reduces completion 2-4 percentage points per Aptitude Research 2025), simplify instructions, use SMS invitations (+15-20 pts over email), ensure mobile compatibility.

2. Pass-Through Rate

Benchmarks: Below 15% = criteria too restrictive or misaligned with applicant pool. 15-35% = well-calibrated. Above 50% = insufficient screening depth.

How to improve: Adjust scoring thresholds, refine rubrics, ensure criteria match actual job requirements (not aspirational profiles).

3. Predictive Validity

The most important quality metric. If high scores don't correlate with downstream success, screening generates noise, not signal.

How to measure: Correlate AI scores with advancement through interviews, offer rate, offer acceptance, 90-day retention, and hiring manager satisfaction.

Benchmark context: Schmidt & Hunter's 1998 meta-analysis found structured interviews predict job performance at r = 0.51. For a single screening stage, r = 0.20-0.40 is meaningful. Higher correlations suggest the screen captures genuine signal.

How to improve: Rank individual questions by predictive value. Remove/replace the bottom 20%. Weight predictive questions more heavily.

4. Candidate NPS

Benchmarks: Below 0 = serious problems. 0-30 = average. 30-50 = good. Above 50 = excellent.

Common complaints: screen length, question relevance, voice quality, lack of clarity about next steps. A 2024 Gartner survey found 67% of candidates were comfortable with AI assessments given clear process transparency.

5. Adverse Impact Ratio

Standard: A widely used benchmark is the four-fifths rule, where the pass rate for any demographic group should be ≥ 80% of the highest-performing group's rate. Many organizations also run annual bias audits as part of their fair-hiring program.

How to improve: Identify which questions/criteria drive disparity. Revise rubrics. Add alternative assessment paths. Monitor continuously at high volume.

Quality Measurement Framework

Monthly Review

  • Completion rate (overall and by channel/source)
  • Pass-through rate (by role and department)
  • Candidate satisfaction scores and comment themes
  • System uptime and technical issues

Quarterly Deep Dive

  • Predictive validity: score vs. interview outcome correlation
  • Adverse impact analysis across protected categories
  • Question-level analysis: most/least predictive questions
  • Benchmark comparison (industry standards + historical performance)
  • Recruiter feedback on result quality

Annual Audit

  • End-to-end predictive validity including post-hire performance data
  • Formal adverse impact study with statistical rigor
  • Question set refresh based on accumulated data
  • Scoring model recalibration
  • Technology review against current market

Optimization Strategies

Question optimization. After collecting several months of data, rank questions by predictive value. Replace the bottom 20% and reallocate time to higher-signal questions.

Threshold tuning. Use data to find the optimal pass/fail threshold, the point where lowering it further produces diminishing returns in subsequent stages.

Rubric refinement. Scoring rubrics are hypotheses. If "strong" scorers on a question don't outperform "acceptable" scorers in interviews, the rubric needs adjustment.

Feedback loops. Structured channels for recruiter feedback ("Was the summary accurate? Did the score match your assessment?") and hiring manager feedback ("Did AI-screened candidates meet expectations?").

Common Pitfalls

Measuring only speed. Reducing time-to-screen is a process metric, not a quality metric.

Ignoring candidate feedback. Negative experiences damage employer brand and reduce completion rates over time.

Set and forget. Question relevance degrades as roles evolve and candidate pools change. Continuous optimization is not optional.

Optimizing a single metric. Maximizing pass-through at the expense of validity (or vice versa) creates hidden problems. Balance all five metrics.

Frequently Asked Questions

How long to measure predictive validity?

At least one full hiring cycle, typically 3-6 months with 50-100+ screens for statistical significance. Start tracking from day one so data is available when needed.

What is a good predictive validity score?

r = 0.20-0.35 is meaningful for any single assessment method. Structured interviews achieve r = 0.40-0.60 (Schmidt & Hunter, 1998). AI phone screening should target r = 0.20-0.40 for score vs. next-stage advancement.

How to measure adverse impact without demographic data?

Options: voluntary demographic collection through ATS, proxy analysis, or third-party estimated demographic tools. Work with legal/compliance teams. At minimum, monitor pass-through rates across measurable dimensions (geography, application source).

Should all questions be weighted equally?

Start with equal weighting, then adjust based on predictive validity data. Weight more predictive questions higher. Some organizations use ML to optimize weights, this adds complexity and requires careful bias monitoring.

How to benchmark against other companies?

Industry benchmarks are still emerging. Your AI vendor may provide anonymized cross-customer data. Industry analyst reports publish ranges. The most useful comparison: your own quarter-over-quarter improvement.

OT

Written by

Outhire Team

Ready to transform your hiring?

See how Outhire can help you attract and screen top talent with AI-powered recruitment.

Get a Demo