Artificial Intelligence and the Challenge of False Positives in Lung CT Screening

In the rapidly evolving landscape of medical imaging, artificial intelligence (AI) has emerged as a transformative force, particularly in the detection of pulmonary nodules through computed tomography (CT) scans. Designed to enhance diagnostic accuracy and streamline radiological workflows, AI systems powered by deep learning algorithms have demonstrated impressive sensitivity—often exceeding 95%—in identifying potential lung lesions. However, despite these advancements, a persistent and significant challenge remains: the high rate of false positive pulmonary nodules (FPPNs). These erroneous detections not only burden radiologists with unnecessary follow-up assessments but also risk causing undue anxiety for patients and contributing to inefficient use of healthcare resources.

A recent study conducted by ZUO Lingzi and HUANG Yan from the Department of Radiology at Shenyang Dazhong Hospital sheds new light on this critical issue. Published in Chinese Medical Devices, their research investigates the distribution, characteristics, and underlying causes of false positive nodules mistakenly flagged by AI during routine lung CT screenings. By analyzing a cohort of 500 asymptomatic individuals undergoing preventive health exams, the team aimed to uncover patterns that could inform both clinical practice and future algorithmic improvements.

The study utilized a commercially available AI software developed by Yitu Technology, a leading Chinese artificial intelligence company known for its medical imaging solutions. All CT scans were acquired using a GE Optima CT670 128-slice spiral scanner with standardized protocols, including reconstruction at 1.25 mm slice thickness and dual-window settings for lung and soft tissue evaluation. After automated nodule detection, two experienced radiologists independently reviewed all flagged regions, reaching consensus on whether each finding represented a true or false positive. This human-in-the-loop validation ensured clinical relevance and minimized subjective bias in the final classification.

The results revealed a substantial burden of false positives. Out of 1,518 nodules detected by the AI system, only 740 were confirmed as true positives, leaving 778 classified as false alarms—an average of 1.6 false nodules per scan. This translates to a false positive rate of approximately 51.2%, underscoring the gap between high sensitivity and clinical precision. While such performance aligns with existing literature reporting false positive rates ranging from 4 to 22 per scan in various deep learning models, the consistency of elevated error rates across different populations suggests systemic limitations in current AI architectures.

One of the most striking findings was the disproportionate prevalence of sub-centimeter nodules among false detections. Of the 778 false positives, 534 (68.6%) measured less than 5 mm in diameter. These tiny structures, often indistinguishable from benign anatomical features on axial slices alone, pose a particular challenge for AI systems that rely heavily on local texture and intensity patterns without full contextual integration. The researchers noted that many of these micro-nodules corresponded to normal pulmonary vasculature, linear fibrotic strands, or partial volume effects at vessel bifurcations—structures that mimic the rounded morphology expected of early-stage tumors.

This observation has important implications for screening protocols. If AI systems continue to flag large numbers of sub-5 mm findings, the downstream impact on radiology departments could be significant. Follow-up imaging, additional patient consultations, and prolonged monitoring schedules increase costs and resource utilization. Moreover, the psychological toll on patients who receive an initial “abnormal” result—only to later learn it was a false alarm—cannot be overlooked. The authors suggest that refining size-based thresholds within AI models might reduce this burden. For instance, suppressing alerts for nodules below a certain size unless accompanied by high-risk morphological features could improve specificity without sacrificing sensitivity for clinically meaningful lesions.

Beyond size, the study examined how nodule density influenced false positive rates. Pulmonary nodules are typically categorized as solid, part-solid, or pure ground-gglass opacity (GGO), each carrying different implications for malignancy risk. In this analysis, part-solid nodules exhibited the highest false positive predictive value at 69.7% (23 out of 33 AI-detected part-solid nodules were deemed false), significantly higher than solid nodules (48.3%) and pure GGOs (53.3%). This finding is particularly noteworthy because part-solid nodules are clinically significant—they are associated with a higher probability of adenocarcinoma in situ or minimally invasive adenocarcinoma. Therefore, a high false positive rate in this category is especially problematic, as it may lead to overdiagnosis or unnecessary interventions.

The elevated false positive rate for part-solid nodules likely stems from the inherent complexity of their appearance. Unlike uniformly dense solid nodules or diffusely hazy GGOs, part-solid lesions contain mixed components that can resemble clustered inflammatory changes, atelectasis, or overlapping structures such as bronchiectasis with mucus plugging. The AI model, trained primarily on isolated nodule examples, may struggle to disentangle these overlapping patterns without access to broader anatomical context or multiplanar reconstructions.

Interestingly, the overall number of false positives per scan increased steadily with patient age. Using Spearman rank correlation analysis, the researchers found a strong positive association between age and false positive detection rate (rs = 0.986, P < 0.05). The lowest rate was observed in the 25–34 age group (1.2 false nodules per scan), while the highest occurred in those aged 75 and above (3.0 per scan). This trend mirrors the natural history of pulmonary aging, where cumulative exposure to environmental insults, recurrent infections, and chronic inflammation lead to structural changes such as interstitial fibrosis, pleural thickening, and vascular remodeling—all of which can mimic nodular pathology.

This age-related increase in false positives underscores a key limitation of one-size-fits-all AI models. Current systems are often trained on heterogeneous datasets without sufficient stratification by demographic or physiological variables. As a result, they may not adequately account for the increased background “noise” present in older lungs. Future iterations of AI detection software could benefit from age-adjusted thresholds or adaptive learning frameworks that modulate sensitivity based on patient characteristics.

To better understand the root causes of misclassification, ZUO and HUANG performed a detailed etiological analysis of the 778 false positives. They identified 14 distinct anatomical or imaging artifacts responsible for erroneous detections, grouping them into broader categories for clarity. The most common cause was pleural nodularity, accounting for 21.5% of all false positives (167 cases). These lesions arise when focal thickenings of the visceral pleura extend into the lung parenchyma, creating a rounded appearance that AI interprets as an intrapulmonary nodule. Radiologically, these can be distinguished by their broad base of attachment to the pleura and associated pleural thickening—a feature not always captured in single-slice analysis.

Vascular structures were the second most frequent source of confusion, contributing to 28.4% of false alarms. This category included vessel wall thickening (13.8%), bifurcations (12.2%), and curvilinear segments (2.4%). Small-caliber vessels, especially when cut obliquely on axial imaging, can project as circular opacities resembling nodules. Similarly, branching points where two vessels diverge may appear as a central dot surrounded by a halo—a pattern easily mistaken for a part-solid nodule. The AI’s inability to trace vessel continuity across multiple slices limits its capacity to resolve such ambiguities.

Linear fibrotic strands, or cord-like shadows, constituted 17.9% of false positives. These are typically sequelae of prior infections, such as tuberculosis or pneumonia, and appear as thin, linear densities radiating from the pleura. When oriented perpendicularly to the scan plane, they manifest as small, round foci that satisfy the AI’s geometric criteria for a nodule. Likewise, thickened interlobular septa—part of the pulmonary lobular architecture—were misclassified in 9.0% of cases. These form polygonal or ring-like patterns that simulate ground-glass nodules, particularly in areas of early interstitial disease.

Other notable contributors included pleural plaques (5.4%), interlobar pleural thickening (4.5%), patchy consolidative opacities (4.5%), and bronchiectasis with wall thickening (3.1%). Less common but still relevant were tree-in-bud patterns (1.7%), mucoid impaction (1.0%), and mediastinal vascular protrusions (0.5%). Each of these entities shares visual similarities with true nodules in terms of size, density, or border definition, yet differs fundamentally in origin and clinical significance.

The researchers emphasized that certain causes were more likely to produce specific nodule types. For example, pleural nodules, cords, and thickened vessels predominantly led to false solid nodules. In contrast, vascular bifurcations and lobular structures were more frequently misclassified as pure ground-glass opacities. Part-solid false positives, though rare, were primarily associated with confluent inflammatory patterns such as clustered tree-in-bud lesions or aggregated bronchiectatic segments filled with secretions.

These findings highlight a critical gap in current AI training paradigms: insufficient representation of benign mimics. Most deep learning models are trained on datasets enriched with confirmed malignant or suspicious nodules, with relatively fewer examples of complex benign anatomy. As a result, the AI learns to recognize “nodule-like” patterns but lacks robust negative examples that teach it what not to flag. Incorporating a wider array of non-nodular pulmonary structures into training sets—alongside detailed annotations explaining why they should be excluded—could enhance discriminative power.

Moreover, the spatial reasoning capabilities of current convolutional neural networks (CNNs) remain limited. While 3D CNNs have shown promise in capturing volumetric context compared to 2D counterparts, their deployment in clinical settings is still nascent. The majority of commercial AI tools operate on 2D slice-by-slice analysis, missing crucial information available through multiplanar reformats or longitudinal tracking. Expanding the use of 3D contextual networks, as suggested in prior research, could allow AI to follow vessel paths, trace pleural attachments, and assess lesion continuity across slices—capabilities essential for reducing false positives.

Another consideration is the role of radiologist-AI collaboration. Rather than viewing AI as a standalone diagnostic tool, the optimal model may be one of synergistic partnership. Radiologists bring contextual knowledge, pattern recognition expertise, and clinical judgment that AI currently lacks. Conversely, AI excels at rapid, consistent scanning of large datasets. When combined, the two can complement each other—AI flags potential areas of interest, and the radiologist applies higher-order reasoning to confirm or dismiss them. However, this workflow only succeeds if the AI output is manageable in volume and quality. An excessive number of false positives overwhelms the human reviewer, negating efficiency gains.

The authors also noted that their study population consisted of healthy individuals without diffuse lung disease, extensive scarring, or multiple pleural abnormalities. This selection criterion was intentional, aimed at isolating baseline false positive rates in a relatively clean cohort. Nevertheless, it introduces a potential bias: in real-world clinical practice, many patients have complex pulmonary histories that exacerbate false detection rates. Previous studies have shown that the presence of interstitial lung disease or post-infectious fibrosis can dramatically increase AI-generated false positives. Thus, while the reported average of 1.6 false nodules per scan provides a useful benchmark, actual performance in sicker populations may be less favorable.

Nonetheless, the insights gained from this work offer actionable pathways for improvement. First, developers should consider implementing size-stratified filtering, where alerts for sub-5 mm findings are suppressed unless accompanied by high-risk features such as spiculation or growth over time. Second, density-specific calibration could help address the disproportionately high false positive rate in part-solid nodules. Third, integrating age as a covariate in risk prediction models may allow for dynamic adjustment of detection thresholds. Finally, expanding training datasets to include diverse examples of pleural lesions, vascular variants, and inflammatory patterns would strengthen the AI’s ability to differentiate true pathology from mimicry.

From a clinical perspective, radiologists can benefit from heightened awareness of the most common sources of AI error. Familiarity with the typical appearances of pleural-based nodules, oblique vessel cuts, and lobular septal thickening enables quicker dismissal of false alarms. Furthermore, adopting a systematic review approach—such as scrolling through multiple adjacent slices or switching between lung and mediastinal windows—can aid in distinguishing real nodules from artifacts.

In conclusion, while AI holds immense promise for revolutionizing lung cancer screening, its current iteration is far from perfect. The study by ZUO Lingzi and HUANG Yan demonstrates that false positive pulmonary nodules are not random errors but follow discernible patterns rooted in anatomical complexity and algorithmic limitations. By understanding these patterns—both in terms of distribution and underlying causes—the medical community can work toward refining AI tools to be more precise, reliable, and clinically useful. As the technology matures, the goal should not be complete automation, but rather intelligent augmentation: AI handling the grunt work of initial detection, while radiologists focus on interpretation, integration, and decision-making. Only through such collaboration can the full potential of AI in pulmonary imaging be realized.

ZUO Lingzi, HUANG Yan, Department of Radiology, Shenyang Dazhong Hospital. Artificial Intelligence and the Challenge of False Positives in Lung CT Screening. Chinese Medical Devices. doi:10.3969/j.issn.1674-1633.2021.10.041