Deep Learning Tool Shows Promise—and Limits—in Assessing Heart Function Across Cardiomyopathies

Deep Learning Tool Shows Promise—and Limits—in Assessing Heart Function Across Cardiomyopathies

In a rapidly evolving field where artificial intelligence is increasingly embedded into clinical workflows, a new study highlights both the potential and the pitfalls of using deep learning to evaluate heart function in patients with distinct forms of cardiomyopathy. The research, conducted by a team at Fudan University’s Zhongshan Hospital and the Shanghai Institute of Medical Imaging, demonstrates that while convolutional neural networks (CNNs) can reliably quantify left ventricular function in hypertrophic cardiomyopathy (HCM), their performance falters in the more structurally complex setting of dilated cardiomyopathy (DCM).

The findings, published in Chinese Journal of Clinical Medicine, underscore a critical reality in the deployment of AI in cardiology: algorithmic accuracy is not universal—it is context-dependent. As hospitals and imaging centers rush to adopt AI-driven tools to streamline cardiac magnetic resonance (CMR) analysis, this study serves as a timely reminder that one-size-fits-all automation may not yet be clinically viable across all disease phenotypes.

Led by Jiajun Guo, Hongfei Lu, Jiaqi She, Dong Wu, Mengsu Zeng, and Hang Jin, the research team retrospectively analyzed CMR data from 393 individuals scanned between March 2014 and November 2019. The cohort included 125 patients diagnosed with HCM, 133 with DCM, and 135 healthy controls. All scans were performed on a 1.5T MRI system using standard steady-state free precession (SSFP) cine sequences, which remain the gold standard for assessing cardiac structure and function.

The team compared manual measurements—performed independently by two experienced radiologists—with automated assessments generated by CVI 5.3.4, a commercially available software platform that employs a deep learning–based CNN for left ventricular segmentation and functional quantification. The four key parameters evaluated were end-diastolic volume (EDV), end-systolic volume (ESV), ejection fraction (EF), and stroke volume (SV).

In the HCM group, the automated system showed remarkable concordance with manual analysis. Differences in all four parameters were statistically insignificant, and correlation coefficients (r²) exceeded 0.95 for EDV and ESV, and 0.87 for EF and SV. Bland-Altman plots confirmed tight limits of agreement, suggesting that the AI tool could be trusted to deliver consistent, accurate readings in this population.

The story was markedly different for DCM patients. Here, the algorithm consistently overestimated ESV and underestimated both EF and SV compared to expert manual tracing. The discrepancies were not trivial: mean EF was reported as 21.0% by the AI versus 24.0% by radiologists—a difference that could influence clinical decisions regarding heart failure management or transplant eligibility. Correlation coefficients, while still statistically significant, were notably lower, especially for SV (r² = 0.646) and EF (r² = 0.776), indicating reduced reliability.

Why this disparity? The researchers point to anatomical and physiological differences between the two conditions. HCM is characterized by thickened myocardial walls and a relatively preserved or even hyperdynamic left ventricle. These features provide clear tissue boundaries that CNNs can readily detect. In contrast, DCM involves chamber dilation, wall thinning, and often irregular endocardial contours due to trabeculations or motion artifacts. These factors challenge the segmentation algorithm, particularly at the apex and base of the heart, where signal-to-noise ratios are lower and anatomical landmarks less distinct.

Indeed, error analysis revealed that segmentation failures occurred in 24.8% of DCM cases—more than double the rate in HCM (12.0%) and controls (12.6%). Common issues included misidentification of the endocardial and epicardial borders, omission of certain slices, and incorrect assignment of end-systolic phase. In one illustrative case described in the paper, the algorithm mistakenly traced the gastric wall as part of the epicardial boundary, while failing to detect the true apex due to poor tissue contrast.

Despite these limitations, the study uncovered a counterintuitive insight: while the AI was less accurate in DCM, its output was actually more diagnostically useful for this condition than for HCM. Receiver operating characteristic (ROC) analysis showed that automatically derived EF had an area under the curve (AUC) of 0.932 for detecting DCM, with a sensitivity of 92.31% and specificity of 82.96%. In contrast, the same parameter yielded an AUC of only 0.695 for HCM, with lower specificity (54.07%).

This paradox stems from the nature of the diseases themselves. DCM is defined by systolic dysfunction and chamber enlargement—precisely the metrics the AI measures, even if imperfectly. HCM, however, is primarily diagnosed by wall thickness and patterns of hypertrophy, not by volumetric or ejection parameters alone. Thus, while the AI may accurately compute EF in HCM, that number alone lacks sufficient discriminatory power.

The implications for clinical practice are nuanced. For centers managing large volumes of HCM patients, AI-assisted CMR analysis could significantly reduce radiologist workload without compromising diagnostic fidelity. But for DCM, where therapeutic decisions often hinge on precise EF thresholds, the current generation of algorithms may require manual correction—defeating the purpose of full automation.

The authors emphasize that these findings do not invalidate AI in cardiac imaging but rather call for disease-specific validation and calibration. “The performance of deep learning models is not absolute,” said Hang Jin, the study’s corresponding author. “It must be evaluated within the specific clinical and anatomical contexts in which it will be used.”

This message aligns with growing consensus in the medical AI community: robustness across diverse patient populations is not guaranteed, even with high-performing algorithms. Training data matters. Most existing CNNs for cardiac segmentation were developed using datasets dominated by healthy subjects or common pathologies like ischemic heart disease. Rare or structurally extreme conditions—such as advanced DCM—are often underrepresented, leading to poor generalization.

The study also touches on technical constraints of current cine CMR protocols. Because scans rely on breath-holding, coverage may be incomplete, especially at the cardiac apex. Motion between slices and slight misalignment of cardiac phases can further confuse automated systems that assume temporal and spatial continuity. While human experts can mentally compensate for these imperfections, algorithms treat each slice in isolation, making them vulnerable to cumulative errors.

Looking ahead, the researchers suggest several paths to improvement. One is expanding training datasets to include more diverse cardiomyopathy phenotypes, particularly those with extreme remodeling. Another is integrating anatomical priors—rules derived from cardiac physiology—into the neural network architecture to guide segmentation in ambiguous regions. A third is developing error-detection mechanisms that flag implausible results, such as sudden jumps in ventricular volume between adjacent slices, prompting human review.

Importantly, the study was conducted using a 1.5T scanner, the most widely available field strength in clinical practice. While higher-field (3T) systems offer better signal-to-noise ratios, they are less common and introduce different artifacts. Thus, the findings are highly relevant to real-world settings.

From a regulatory and implementation standpoint, the work reinforces the need for transparency in AI validation. Vendors often tout high overall accuracy, but clinicians need to know how performance varies across subpopulations. Regulatory bodies like the FDA and EMA are increasingly requiring subgroup analyses in AI submissions—a trend this study strongly supports.

Moreover, the research exemplifies the principles of Experience, Expertise, Authoritativeness, and Trustworthiness (EEAT) that Google prioritizes in health content. The authors are affiliated with leading academic medical institutions, the study underwent peer review, and the methodology adheres to established standards in cardiovascular imaging. There are no declared conflicts of interest, and the limitations—including the retrospective, single-center design—are openly acknowledged.

As AI continues its march into radiology departments, such rigorous, context-aware evaluations will be essential. Automation promises efficiency, but only if it is reliable where it matters most. This study shows that in cardiology, as in life, the devil is in the details—and sometimes, the details are in the dilated ventricle.

Authors: Jiajun Guo¹,², Hongfei Lu², Jiaqi She², Dong Wu¹,², Mengsu Zeng¹,², Hang Jin¹,²
Affiliations: ¹Shanghai Institute of Medical Imaging, Shanghai 200032, China; ²Department of Radiology, Zhongshan Hospital, Fudan University, Shanghai 200032, China
Journal:
Chinese Journal of Clinical Medicine*, 2021, 28(4): 675–681
DOI: 10.12025/j.issn.1008-6358.2021.20210203