Experts Pinpoint Critical Benchmarks for AI Medical Device Quality Systems

In the rapidly evolving landscape of healthcare technology, the integration of artificial intelligence into medical devices promises unprecedented advancements in diagnostics, treatment planning, and patient care. Yet, as these sophisticated tools move from research labs to clinical settings, a critical question looms large: how can manufacturers ensure these complex, data-driven systems are not only innovative but also consistently safe, reliable, and of the highest quality? A groundbreaking study published in the March 2021 issue of China Medical Devices offers a concrete answer, providing the industry with its first scientifically derived framework of key indicators for building robust quality management systems specifically tailored for AI medical devices.

This research, spearheaded by Liu Yi from Beihang University and Beijing Beiling Special Purpose Vehicle Co., Ltd., alongside collaborators Wang Hao and Li Shu from the National Institutes for Food and Drug Control, Ren Haiping from Sinopharm Group, and senior author Fan Yubo from Beihang University, addresses a glaring gap in the regulatory and manufacturing ecosystem. While traditional medical devices operate under well-established quality standards like YY/T 0287-2017, AI-powered devices introduce unique complexities. Their performance is not static; it evolves with data, algorithms, and real-world deployment, making conventional quality control measures insufficient. The study’s core mission was to bridge this gap, moving beyond generic guidelines to identify the precise levers manufacturers must pull to guarantee product excellence from the design phase through to post-market surveillance.

The methodology employed was as rigorous as the subject matter demanded. The research team didn’t rely on theoretical models or isolated case studies. Instead, they convened a panel of fifteen of China’s foremost authorities in the field of medical AI. This elite group comprised chief scientists, regulatory reviewers from inspection institutes, researchers from national academies, industry association leaders, and senior R&D executives from leading AI medical device companies. This deliberate mix ensured that the findings would be grounded in both cutting-edge scientific understanding and the gritty realities of industrial production and regulatory compliance. The experts’ collective authority was quantified and found to be exceptionally high, with a calculated Cr value of 0.91, far exceeding the 0.70 threshold for a highly credible consultation. This underscores the study’s foundation in deep, practical expertise.

The process unfolded over two meticulous rounds of expert consultation. The first round served as a massive filtering operation. Starting with the comprehensive 60 tertiary indicators outlined in the YY/T 0287-2017 standard, the experts, guided by their knowledge of AI-specific guidelines like the “Deep Learning Decision Support Software Review Points” and the “Good Machine Learning Practice (GMLP) Report,” engaged in a focused “brainstorming” session. Their task was to ruthlessly eliminate any indicators that were irrelevant to the unique nature of AI software. This wasn’t about diluting standards but about precision targeting. The result was a leaner, more relevant framework: 5 primary categories, 12 secondary categories, and a focused set of 36 tertiary indicators deemed essential for AI medical device quality.

The second round was where the real prioritization happened. Experts were presented with the refined list of 36 indicators and asked to score their relative importance using a Likert 5-point scale. This wasn’t a simple vote; it was a weighted assessment of criticality. The data from these 15 completed questionnaires—achieving a perfect 100% response rate—was then subjected to sophisticated statistical analysis using IBM SPSS software. The team didn’t just look at average scores; they examined standard deviation, coefficient of variation (CV), and “full score rate” to understand not just what was important, but where there was consensus. A low CV and a high full score rate indicate that experts weren’t just agreeing an indicator was important, but that they were emphatically united on its paramount status.

The findings were both illuminating and decisive. All 36 indicators received respectable average scores ranging from 3.40 to 4.93, with low standard deviations (all under 1.00) and low CV values (all under 25%), indicating broad agreement on their overall relevance. However, three indicators emerged not just as important, but as absolutely critical, commanding near-unanimous, top-tier scores from the expert panel. The undisputed champion was “Design and Development Verification,” scoring an exceptional 4.93. This was followed closely by a tie between “Design and Development Confirmation” and “Design and Development Change Control,” both scoring 4.87. The statistical measures for these top three were remarkable: their CV values were a mere 5.23% and 7.23% respectively, and their “full score rate”—the percentage of experts who gave them the maximum possible score—was an astounding 93.33% and 86.67%. This level of consensus is rare in expert consultations and speaks volumes about the non-negotiable nature of these processes for AI medical devices.

Why do these three areas command such universal respect? The answer lies in the fundamental nature of AI. Unlike a scalpel or a stethoscope, an AI medical device is a dynamic, learning system. “Design and Development Verification” ensures that the software, at every stage of its creation, meets its meticulously defined technical specifications. For an AI algorithm, this means verifying that the model architecture is sound, that the training process is reproducible, and that the code performs as intended under controlled conditions. It’s about building the machine correctly. “Design and Development Confirmation,” on the other hand, is about ensuring the machine solves the right problem. This involves validating the AI’s performance in a real or simulated clinical environment, proving that it accurately detects diabetic retinopathy, calculates coronary blood flow, or identifies tumors as intended, and does so safely and effectively for the end-user—the clinician and the patient. It’s the ultimate test of clinical utility.

Perhaps the most crucial, and most uniquely AI-centric, is “Design and Development Change Control.” AI models are not set in stone. They may need to be retrained with new data, updated to fix bugs, or modified to adapt to new clinical protocols. A seemingly minor tweak to an algorithm can have profound, unforeseen consequences on its diagnostic accuracy. This indicator mandates a rigorous, documented process for managing every single change. It requires impact assessments, re-verification, and often re-confirmation before any update is deployed. In a world where software updates are routine, this control is the bedrock of patient safety, preventing a “patch” from becoming a peril.

The implications of this research extend far beyond the factory floor. For regulatory bodies like the National Medical Products Administration (NMPA), which established an AI Medical Device Standardization Technical Committee in 2019, this study provides a scientifically validated checklist. It offers concrete guidance on what to scrutinize during pre-market reviews and post-market surveillance, moving inspections from a generic checklist to a targeted evaluation of the most critical risk points. For hospitals and clinicians adopting these technologies, it provides a framework for due diligence. When evaluating an AI diagnostic tool, they can now ask potential vendors not just about accuracy rates, but about their processes for verification, confirmation, and, critically, how they manage software updates. This empowers healthcare providers to be more informed and proactive partners in ensuring patient safety.

For the AI medical device industry itself, which saw its first wave of nine approved products in China by the end of 2020, this research is a roadmap for sustainable growth. Startups and established players alike are often laser-focused on achieving breakthrough performance and securing regulatory approval. This study serves as a powerful reminder that long-term success and market trust are built on a foundation of rigorous, systematized quality management. By focusing their resources on mastering these three key areas—verification, confirmation, and change control—companies can streamline their development processes, reduce the risk of costly recalls or regulatory setbacks, and build products that clinicians can trust implicitly. It shifts the competitive advantage from merely having the smartest algorithm to having the most robust and reliable development and deployment process.

The authors are careful to position their work as a foundational step, not a final destination. They acknowledge that while they have identified the critical pillars, the detailed implementation of these quality systems will vary based on the specific type of AI device—whether it’s an imaging analysis tool, a clinical decision support system, or a natural language processor for medical records. The next phase of research, as hinted at in the paper’s conclusion, involves a deeper dive into the current practices of leading AI medical device companies. This will allow for the refinement and operationalization of these key indicators, transforming them from high-level concepts into actionable, industry-specific protocols and best practices.

This research arrives at a pivotal moment. Globally, regulatory agencies are scrambling to adapt their frameworks for AI. The U.S. FDA’s approval of IDx-DR in 2018 was a watershed moment, and since then, approvals for tools like Imagen’s OsteoDetect and Subtle Medical’s SubtlePET have accelerated. China is rapidly catching up, driven by strong policy support and a vibrant innovation ecosystem. In this race, the focus cannot be solely on speed to market. The true winners will be those who prioritize quality and safety from the very beginning. This study by Liu Yi, Wang Hao, Li Shu, Ren Haiping, and Fan Yubo provides the essential blueprint for achieving that goal. It moves the conversation from theoretical concerns about AI “black boxes” to practical, actionable steps for building trustworthy, high-quality medical technology. In doing so, it doesn’t just advance academic knowledge; it has the potential to directly improve patient outcomes by ensuring that the powerful AI tools entering our hospitals are as reliable as they are revolutionary.

By Liu Yi, Wang Hao, Li Shu, Ren Haiping, Fan Yubo. Published in China Medical Devices, Vol. 36, No. 03, 2021. doi:10.3969/j.issn.1674-1633.2021.03.005