How to Measure AI Screening Quality

Five core metrics for AI screening quality: completion rate, pass-through rate, predictive validity (target r = 0.20-0.40), candidate NPS, and adverse impact.

TL;DR: Measure AI screening quality across five dimensions: completion rate (target 70-80%), pass-through rate (15-35%), predictive validity (r = 0.20-0.40 for screen score vs. next-stage advancement), candidate NPS (target 30+), and adverse impact ratio (four-fifths rule per EEOC guidelines). Review monthly for operations, quarterly for validity and fairness, annually for full audit. A fast screen that advances the wrong candidates is a fast failure.

The Five Core Quality Metrics

Metric	What It Measures	Target Benchmark	Review Cadence	Source
Completion rate	% invited candidates who finish	70-80% (good); 80%+ (excellent)	Monthly	Aptitude Research 2025
Pass-through rate	% screened who advance to next stage	15-35% (well-calibrated)	Monthly	,
Predictive validity	Correlation between screen scores and outcomes	r = 0.20-0.40	Quarterly	Schmidt & Hunter, 1998
Candidate NPS	Candidate experience rating	30+ (good); 50+ (excellent)	Monthly	,
Adverse impact ratio	Selection rate parity across demographics	Four-fifths rule compliance	Quarterly	EEOC Uniform Guidelines

1. Completion Rate

Benchmarks: Below 60% = significant design issues. 60-70% = room for improvement. 70-80% = good. Above 80% = excellent.

How to improve: Shorten screen length (every additional minute reduces completion 2-4 percentage points per Aptitude Research 2025), simplify instructions, use SMS invitations (+15-20 pts over email), ensure mobile compatibility.

2. Pass-Through Rate

Benchmarks: Below 15% = criteria too restrictive or misaligned with applicant pool. 15-35% = well-calibrated. Above 50% = insufficient screening depth.

How to improve: Adjust scoring thresholds, refine rubrics, ensure criteria match actual job requirements (not aspirational profiles).

3. Predictive Validity

The most important quality metric. If high scores don't correlate with downstream success, screening generates noise, not signal.

How to measure: Correlate AI scores with advancement through interviews, offer rate, offer acceptance, 90-day retention, and hiring manager satisfaction.

Benchmark context: Schmidt & Hunter's 1998 meta-analysis found structured interviews predict job performance at r = 0.51. For a single screening stage, r = 0.20-0.40 is meaningful. Higher correlations suggest the screen captures genuine signal.

How to improve: Rank individual questions by predictive value. Remove/replace the bottom 20%. Weight predictive questions more heavily.

4. Candidate NPS

Benchmarks: Below 0 = serious problems. 0-30 = average. 30-50 = good. Above 50 = excellent.

Common complaints: screen length, question relevance, voice quality, lack of clarity about next steps. A 2024 Gartner survey found 67% of candidates were comfortable with AI assessments given clear process transparency.

5. Adverse Impact Ratio

Standard: A widely used benchmark is the four-fifths rule, where the pass rate for any demographic group should be ≥ 80% of the highest-performing group's rate. Many organizations also run annual bias audits as part of their fair-hiring program.

How to improve: Identify which questions/criteria drive disparity. Revise rubrics. Add alternative assessment paths. Monitor continuously at high volume.

Quality Measurement Framework

Monthly Review

Completion rate (overall and by channel/source)
Pass-through rate (by role and department)
Candidate satisfaction scores and comment themes
System uptime and technical issues

Quarterly Deep Dive

Predictive validity: score vs. interview outcome correlation
Adverse impact analysis across protected categories
Question-level analysis: most/least predictive questions
Benchmark comparison (industry standards + historical performance)
Recruiter feedback on result quality

Annual Audit

End-to-end predictive validity including post-hire performance data
Formal adverse impact study with statistical rigor
Question set refresh based on accumulated data
Scoring model recalibration
Technology review against current market

Optimization Strategies

Question optimization. After collecting several months of data, rank questions by predictive value. Replace the bottom 20% and reallocate time to higher-signal questions.

Threshold tuning. Use data to find the optimal pass/fail threshold, the point where lowering it further produces diminishing returns in subsequent stages.

Rubric refinement. Scoring rubrics are hypotheses. If "strong" scorers on a question don't outperform "acceptable" scorers in interviews, the rubric needs adjustment.

Feedback loops. Structured channels for recruiter feedback ("Was the summary accurate? Did the score match your assessment?") and hiring manager feedback ("Did AI-screened candidates meet expectations?").

Common Pitfalls

Measuring only speed. Reducing time-to-screen is a process metric, not a quality metric.

Ignoring candidate feedback. Negative experiences damage employer brand and reduce completion rates over time.

Set and forget. Question relevance degrades as roles evolve and candidate pools change. Continuous optimization is not optional.

Optimizing a single metric. Maximizing pass-through at the expense of validity (or vice versa) creates hidden problems. Balance all five metrics.

Frequently Asked Questions

How long to measure predictive validity?

At least one full hiring cycle, typically 3-6 months with 50-100+ screens for statistical significance. Start tracking from day one so data is available when needed.

What is a good predictive validity score?

r = 0.20-0.35 is meaningful for any single assessment method. Structured interviews achieve r = 0.40-0.60 (Schmidt & Hunter, 1998). AI phone screening should target r = 0.20-0.40 for score vs. next-stage advancement.

How to measure adverse impact without demographic data?

Options: voluntary demographic collection through ATS, proxy analysis, or third-party estimated demographic tools. Work with legal/compliance teams. At minimum, monitor pass-through rates across measurable dimensions (geography, application source).

Should all questions be weighted equally?

Start with equal weighting, then adjust based on predictive validity data. Weight more predictive questions higher. Some organizations use ML to optimize weights, this adds complexity and requires careful bias monitoring.

How to benchmark against other companies?

Industry benchmarks are still emerging. Your AI vendor may provide anonymized cross-customer data. Industry analyst reports publish ranges. The most useful comparison: your own quarter-over-quarter improvement.