AI Scores Ki-67 in Breast Cancer with Near-Perfect Consistency—But Human Oversight Still Essential

AI Scores Ki-67 in Breast Cancer with Near-Perfect Consistency—But Human Oversight Still Essential

In the quiet hum of a modern pathology lab in Chengdu, a digital revolution is unfolding—not with dramatic fanfare, but with pixel-by-pixel precision. At Sichuan University’s West China Hospital, a team led by pathologist Ji Bao has just completed one of the most rigorous real-world validations to date of an artificial intelligence (AI) system designed to quantify Ki-67, a pivotal biomarker in breast cancer diagnostics. Their findings, published in the Journal of Sichuan University (Medical Science), reveal something both promising and sobering: AI can score Ki-67 with near-perfect internal consistency—yet it still cannot replace the trained eye of a human pathologist.

The implications ripple far beyond this single study. Across oncology, reproducibility in biomarker scoring remains a persistent pain point. Pathologists know firsthand how subjective Ki-67 evaluation can be. One doctor, squinting through the eyepiece at 400× magnification, might estimate 35% proliferating tumor cells. Another, scanning the same slide moments later, might call it 48%. Such discrepancies aren’t negligence—they’re inherent to a process that relies on visual estimation, mental tallying, and clinical intuition built over decades. It’s a system stretched thin by rising caseloads, workforce shortages, and the ever-increasing demand for precision in cancer care.

Enter AI—not as a replacement, but as a co-pilot.

The study by Yang Deng, Fengling Li, Hangyu Qin, Yanyan Zhou, Qiqi Zhou, Juan Mei, Li Li, Honghong Liu, Yizhe Wang, Hong Bu, and Ji Bao tested two distinct AI-assisted approaches on 100 real clinical cases of invasive ductal carcinoma (IDC), the most common and lethal form of breast cancer in women worldwide. Both methods used whole-slide images (WSIs) generated from digitized hematoxylin-eosin (HE) and Ki-67 immunohistochemistry (IHC)-stained tissue sections—high-resolution digital replicas of traditional microscope slides. But how the AI interacted with these images—and with the pathologist—differed dramatically.

One system operated fully automatically. Once a slide was scanned, the AI took over completely: first identifying the invasive carcinoma region on the HE image, then aligning—“registering”—that region onto the corresponding Ki-67-stained slide, and finally detecting and classifying each tumor cell nucleus as Ki-67 positive (brown stain) or negative (blue counterstain). No human input was required during analysis. The entire workflow, from upload to final percentage, took 5 to 8 minutes per case.

The second system was semi-automated—a hybrid model closer to how pathologists already work. Here, the physician manually selected 5 to 10 representative high-power fields (HPFs) under the microscope, much like they would in routine practice. But instead of counting cells by hand or mentally estimating, they triggered a smart microscope—the ARM-50 from Sunny Optical—to perform the counting in each selected field in under 20 seconds. The device then averaged the results to generate a final Ki-67 index. Total time per case: just 2 to 3 minutes.

For comparison, the team used the original diagnostic reports from West China Hospital’s pathology department as the “manual” baseline—reflecting how Ki-67 is scored in real clinical settings today.

The results were revealing—and nuanced.

When the two AI methods were pitted against each other, they were astonishingly aligned. Every single case—100 out of 100—showed a difference of ≤10 percentage points. In statistical terms, their intra-class correlation coefficient (ICC), a gold-standard measure of agreement for continuous data, hit 0.992. For context: an ICC above 0.75 is considered “excellent” reproducibility; 0.992 is clinical metrology territory—the kind of consistency you’d expect from a calibrated lab instrument, not a biological interpretation tool. This near-perfect harmony underscores a fundamental strength of AI: standardization. Unlike human observers, algorithms apply identical logic across every pixel, every nucleus, every slide. They don’t get tired, distracted, or biased by recent cases. They don’t vary their counting thresholds mid-session. This repeatability isn’t just convenient—it’s foundational for reliable longitudinal monitoring and multi-center clinical trials, where scoring drift can obscure real treatment effects.

But the more clinically crucial comparisons were between AI and human scoring.

Here, the picture softened—but didn’t blur. Against manual assessment, the fully automated AI agreed within ≤10% in 78% of cases (78/100). In another 17%, the gap was moderate (11–29%). Only 5 cases—5%—showed major discrepancies (≥30% difference). The semi-automated AI performed slightly less consistently by this strict metric: 60% agreement within 10%, 37% in the moderate range, and just 3% with large gaps. Their ICCs—0.720 and 0.724, respectively—fall just below the 0.75 “excellent” benchmark, but well above the 0.40 threshold for “poor” agreement. In practical terms? Clinically acceptable—and notably more stable than inter-observer human variation, which previous studies have shown can exceed 30% in up to a quarter of cases.

So what explains the remaining gaps? The authors are refreshingly candid: it’s less about AI failing, and more about the inherent ambiguity of the current manual standard itself.

First, Ki-67 scoring in routine pathology is rarely exhaustive. It’s typically an estimate, not a census. Pathologists scan several fields, anchor on visual patterns—dense clusters of brown nuclei, sparse negatives—and make a global judgment. This is efficient and often accurate, but introduces natural variability. When AI delivers a precise 42.7%, and a pathologist reports “approximately 45%,” is that a 2.3-point error—or just two valid expressions of the same biological reality? The study’s small number of large discrepancies (3–5 cases) likely reflect edge cases where estimation diverged sharply from ground-truth enumeration—not systemic failure.

Second, and more profoundly, there remains no universal protocol for Ki-67 assessment. Do you count in low power (×100) for broad sampling or high power (×400) for cellular detail? Do you average five fields or ten? Do you report the mean or the maximum proliferative hotspot? Guidelines (like those from the International Ki-67 in Breast Cancer Working Group) offer recommendations—but leave room for institutional and individual interpretation. In this study, the “manual” baseline wasn’t a single expert re-reviewing all 100 slides under strict protocol; it was archival reports generated under real-world clinical conditions, where such variations inevitably accumulate. Thus, AI isn’t disagreeing with truth—it’s disagreeing with convention, and sometimes highlighting where that convention lacks rigor.

Third, AI, for all its consistency, isn’t infallible in edge scenarios. Misregistration between HE and Ki-67 slides—especially if tissue sections aren’t perfectly aligned during processing—can misplace the tumor region. Overlapping nuclei, weak or heterogeneous staining, or inflammation mimicking tumor proliferation can still challenge even advanced models. The authors openly acknowledge this: “the accuracy [of AI] is still sometimes inferior to that of highly experienced pathologists.” The goal, then, isn’t autonomy—it’s augmentation.

Which brings us to the most pragmatic insight from this work: choice of AI workflow should match clinical need.

The fully automated system shines in high-volume screening or retrospective studies, where minimizing hands-on time is paramount. A pathologist uploads a batch of slides at the end of the day; by morning, AI has generated preliminary scores for review. The physician’s role shifts from primary counter to final validator—a more sustainable, higher-value use of expertise.

The semi-automated approach, however, resonates more deeply with diagnostic culture. It preserves the pathologist’s agency: they decide where the tumor is most representative, they exclude artifacts or crush artifacts, they contextualize the count within architectural patterns visible only at the microscope. The AI simply eliminates the tedious, error-prone step of manual tallying. It’s not outsourcing judgment—it’s outsourcing arithmetic. And at 2–3 minutes per case (versus 5–8 for full automation), it’s also faster in this implementation, likely because selective field analysis requires less computational heavy-lifting than whole-slide segmentation.

Crucially, neither method threatens the pathologist’s ultimate authority. As the authors stress: “AI cannot—and must not—fully replace pathologists.” Instead, it relieves them of the “mechanical and repetitive” burdens that contribute to diagnostic fatigue and burnout, freeing cognitive bandwidth for the irreplaceable tasks: integrating Ki-67 with ER, PR, HER2, histologic grade, and clinical history; recognizing unusual subtypes; spotting mimics; and communicating nuanced risk to oncologists and patients.

This isn’t theoretical. Consider the stakes of Ki-67 in breast cancer. It’s not just a number—it’s a compass. High Ki-67 (>20–30%, depending on context) helps distinguish Luminal B from Luminal A hormone receptor–positive cancers, guiding decisions between endocrine therapy alone versus adding chemotherapy. In triple-negative disease, elevated Ki-67 may signal better response to neoadjuvant chemo—and eligibility for clinical trials. Misclassification can mean undertreatment—or unnecessary toxicity. Consistency, therefore, isn’t academic; it’s therapeutic.

Yet adoption faces hurdles beyond algorithmic performance. Integrating AI into clinical workflows demands more than software—it requires rethinking lab logistics (slide scanning throughput, storage, IT infrastructure), retraining staff, and, most delicately, managing expectations. Clinicians may overtrust AI’s “precision” or dismiss it as a black box. Regulators must ensure validation extends beyond ideal research datasets to messy, real-world variability in staining quality and tissue processing. Reimbursement models need to recognize the value of this augmented diagnostic labor.

The West China Hospital team is already looking ahead. Their next steps? Scaling to larger cohorts, incorporating treatment response and long-term survival data to see whether AI-derived Ki-67 better predicts outcomes than manual scores, and exploring hybrid scoring strategies—like combining hotspot and average measurements algorithmically. Ultimately, they aim to contribute quantitative rigor to the ongoing global effort to standardize Ki-67 assessment—not by imposing a rigid rule, but by providing data-driven evidence for what works best.

What makes this study stand out isn’t just its technical rigor—it’s its humility. In an era of AI hype, where “autonomous diagnosis” headlines are common, this group refuses to overclaim. They show AI at its most useful: not as a silver bullet, but as a scalpel—sharp, precise, and wielded best by a skilled hand. The future of pathology won’t be humans or machines. It will be humans, empowered—consistently, reproducibly, sustainably—by machines.

And in that future, a pathologist in a rural clinic, armed with a smart microscope and cloud-connected AI, might deliver Ki-67 assessments just as reliable as those from a top-tier academic center. That’s not disruption. That’s democratization. That’s progress.

Yang Deng¹, Fengling Li¹, Hangyu Qin¹, Yanyan Zhou¹, Qiqi Zhou¹, Juan Mei¹, Li Li¹, Honghong Liu¹, Yizhe Wang², Hong Bu¹, Ji Bao¹
¹Institute of Clinical Pathology, West China Hospital, Sichuan University, Chengdu 610041, China
²Chengdu Knowledge Vision Science and Technology Co. Ltd., Chengdu 610041, China
Journal of Sichuan University (Medical Science)
DOI: 10.12182/20210460202