Peking Union Medical College Hospital Unveils First Standardized DR Fundus Image Database for AI Research

Peking Union Medical College Hospital Unveils First Standardized DR Fundus Image Database for AI Research

In a landmark move aimed at advancing the development and validation of artificial intelligence (AI) systems for diabetic retinopathy (DR) screening, researchers from Peking Union Medical College Hospital have established China’s first standardized, high-quality color fundus photography database dedicated to AI research. The initiative, detailed in a recent publication in Medical Journal of Peking Union Medical College Hospital, introduces a rigorously annotated dataset of 15,000 real-world fundus images, setting a new benchmark for transparency, reproducibility, and clinical relevance in ophthalmic AI.

Diabetic retinopathy remains one of the leading causes of preventable blindness worldwide. In China alone, over 30 million individuals suffer from the condition, yet access to timely screening is severely limited by a shortage of trained retinal specialists. AI-powered diagnostic tools have emerged as promising solutions to bridge this gap—but their clinical adoption hinges on robust, standardized validation against high-quality, representative datasets. Until now, most publicly available DR datasets, such as Kaggle and MESSIDOR-2, have faced criticism for inconsistent image quality, limited demographic diversity, and insufficient alignment with real-world clinical workflows.

Addressing these challenges head-on, the team led by Dr. Chen Youxin at the Department of Ophthalmology, Peking Union Medical College Hospital—affiliated with the Chinese Academy of Medical Sciences and home to the Key Laboratory of Ocular Fundus Diseases—has developed a meticulously curated database that reflects the complexities of actual clinical practice. The dataset includes images from 14 regions across eight provinces and municipalities in China, captured using 12 different mainstream fundus camera models, including Canon, Topcon, Zeiss, Kowa, and SUOER devices. This deliberate inclusion of heterogeneous imaging equipment ensures that AI models trained or validated on this database will be more generalizable across diverse healthcare settings.

One of the most distinctive features of this database is its comprehensive representation of the full spectrum of DR severity. It encompasses not only the standard international clinical classifications—from no DR to mild, moderate, severe non-proliferative DR, and proliferative DR—but also includes post-laser treatment cases, images of poor quality deemed ungradable by experts, and cases with coexisting retinal pathologies such as age-related macular degeneration or retinal vein occlusion. This mirrors the messy reality of clinical screening environments, where comorbidities and suboptimal image acquisition are common.

The creation of this database followed an exceptionally stringent protocol spanning data acquisition, de-identification, preprocessing, annotation, quality control, and governance. All images were captured as 40°–55° posterior pole color fundus photographs centered on the midpoint between the optic disc and macula, saved in lossless formats (JPG/JPEG/BMP/PNG) with a minimum resolution of 30 pixels per degree. This ensures sufficient detail for both human graders and deep learning algorithms to detect subtle microaneurysms, hemorrhages, exudates, and neovascularization.

Crucially, the team implemented a three-tiered annotation system involving ophthalmologists with escalating levels of expertise. Initial annotations were performed by qualified retinal specialists holding at least a master’s degree in ophthalmology and two years of clinical experience in retinal diseases. These were then reviewed by senior assessors—attending physicians with 3–5 years of subspecialty experience. In cases of disagreement, final arbitration was conducted by retinal experts holding associate chief physician titles or higher, with a minimum of eight years of dedicated practice. Each image was evaluated by an odd-numbered panel (typically three graders), and only those achieving consensus—or resolved through structured discussion and expert arbitration—were included. Images that could not reach agreement were excluded, preserving the integrity of the ground truth labels.

To ensure data privacy and regulatory compliance, all images underwent 100% de-identification using ImageMagick software, stripping embedded metadata and any visual identifiers. Preprocessing was handled via OpenCV and custom Python scripts to standardize image dimensions: each circular fundus field was inscribed within a square canvas, with peripheral areas filled in black. This step minimizes unnecessary computational load during model training without altering the original pixel data or introducing artifacts.

Quality assurance extended beyond annotation accuracy. Every grader’s performance was continuously monitored using Cohen’s Kappa statistics, benchmarked against the median label of the group. Only graders achieving a Kappa value above 0.6—indicating substantial to almost perfect agreement—were permitted to continue. Those falling below this threshold underwent retraining and re-evaluation, reinforcing a culture of accountability and precision.

The database underwent both internal and external validation. Internal review was conducted by senior retinal specialists at Peking Union Medical College Hospital. For external validation, a panel of 10 chief physicians from top-tier tertiary hospitals across China independently graded a randomly selected 8–10% sample of the dataset. The resulting inter-rater agreement between the database labels and the external expert consensus yielded a Kappa value of 0.968—classified as “almost perfect agreement”—a testament to the reliability of the annotations.

Perhaps most significantly, the database was designed with flexibility in mind. It can be reconfigured into multiple sub-datasets tailored to specific AI validation scenarios: binary classification (DR vs. no DR), referral-based triage (requiring specialist review vs. not), fine-grained staging (international DR severity levels 0–4), presence of laser scars, coexisting pathologies, or image gradability. This modular architecture allows developers to test their algorithms under conditions that closely mimic real-world deployment—whether in primary care clinics with limited imaging quality or tertiary centers managing complex cases.

In July 2019, the database was publicly released through the Artificial Intelligence Medical Device Innovation Cooperation Platform (AIMD), marking a pivotal shift toward open science in Chinese medical AI. Unlike proprietary datasets used by commercial entities—which remain undisclosed and prevent head-to-head comparisons—this open-access resource enables fair, transparent benchmarking of competing algorithms. To date, it has supported numerous research initiatives and regulatory submissions, including three DR AI systems that have received approval from China’s National Medical Products Administration (NMPA).

The team also established a dynamic update mechanism, committing to add 1,000 new externally validated images annually. This ensures the database evolves alongside clinical practice and technological advancements, maintaining its relevance as a gold-standard reference. Comprehensive documentation governs every stage of the database lifecycle—from data collection and annotation protocols to security policies and version control—adhering to international standards for AI research, including the SPIRIT-AI guidelines.

From a regulatory and ethical standpoint, the project exemplifies best practices in responsible AI development. By grounding the dataset in real-world clinical diversity, enforcing rigorous quality control, and prioritizing transparency, the Peking Union team has addressed key concerns that have historically plagued medical AI: overfitting to idealized data, poor generalizability, and lack of auditability. Their work not only advances DR screening but also provides a replicable blueprint for building standardized databases in other medical imaging domains—from dermatology to radiology.

As AI transitions from research labs to clinical workflows, the need for trustworthy, clinically aligned validation resources becomes ever more critical. The Peking Union DR database represents a significant stride toward that goal. It empowers developers to build more robust systems, enables regulators to conduct more meaningful evaluations, and ultimately promises to accelerate the delivery of equitable, high-quality eye care to millions of diabetic patients across China and beyond.

Looking ahead, the team envisions expanding the database to include multimodal data—such as optical coherence tomography (OCT) scans and longitudinal follow-up images—to support next-generation AI models capable of predicting disease progression and treatment response. They also advocate for international collaboration to harmonize annotation standards and create globally representative datasets, fostering a more inclusive and effective AI ecosystem in ophthalmology.

In an era where data is the foundation of AI, the quality, diversity, and integrity of that data determine the trustworthiness of the resulting tools. With this pioneering effort, Peking Union Medical College Hospital has not only raised the bar for diabetic retinopathy AI research but also reaffirmed the indispensable role of clinician-led, patient-centered innovation in the digital health revolution.

Authors: Weihong Yu, Xiao Zhang, Chan Wu, Huan Chen, Zhikun Yang, Feng He, Zhiqiao Zhang, Bilei Zhang, Di Gong, Yuelin Wang, Jingyuan Yang, Bing Li, Yanyuan Sun, Yajing Ma, Huiqin Lu, Wei Xia, Wei Zhou, Donglei Zhang, Qingmin Pan, Ning Yang, Shuna Wang, Xiaolei Sun, Ying Yu, Chang Su, Bo Wan, Mingqi Wang, Min Wang, Youxin Chen. Affiliations include the Department of Ophthalmology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China, and multiple collaborating hospitals nationwide. Published in Medical Journal of Peking Union Medical College Hospital, 2021, 12(5): 684–688. DOI: 10.12290/xhyxzz.2021-0613.