Xinhua News Agency Pioneers AI-Driven Journalism in Media Convergence Era

Xinhua News Agency Pioneers AI-Driven Journalism in Media Convergence Era

In the high-stakes race to reshape storytelling for the digital age, few institutions carry the weight—and responsibility—of Xinhua News Agency. As China’s national news wire and one of the world’s largest media organizations, Xinhua has long served as both a chronicler and influencer of global narratives. Yet in an era where public attention fractures across platforms and misinformation spreads faster than verified facts, even the most authoritative newsrooms must reinvent themselves—not just to survive, but to lead.

What makes Xinhua’s recent evolution especially compelling is not merely its scale of technological adoption, but the philosophy underpinning it: technology must serve journalism, not replace it. This is not a slogan plastered on internal memos; it is operationalized daily in the agency’s Beijing headquarters through a deliberate, iterative fusion of editorial insight and engineering rigor.

At the heart of this transformation sits Cheng Peng, Director of the Technology R&D Center at Xinhua’s Technical Bureau. Over the past decade, Cheng and his team have quietly built one of the most sophisticated in-house AI infrastructures in global journalism—not by purchasing off-the-shelf solutions, but by co-developing tools with reporters, for reporters. Their work offers a masterclass in how institutional knowledge, editorial judgment, and algorithmic innovation can converge to produce journalism that is faster, deeper, and more resonant—without sacrificing accuracy or authority.

The journey began not with grand declarations, but with a mundane pain point: audio transcription. Reporters returning from field assignments often carried hours of interview recordings—on phones, tape recorders, satellite phones—each requiring laborious manual transcription before editing could begin. In 2016, as part of the “Xinhua All-Media Project,” the Technical Bureau deployed its first integrated speech platform, internally codenamed Yinxun (“Audio Messages”). Rather than relying solely on cloud APIs, the team opted for private deployment of adapted speech models—prioritizing data sovereignty and real-time responsiveness crucial for breaking news.

But the real breakthrough wasn’t the algorithm; it was the Yinxun Box, a compact, 3D-printed hardware-software hybrid device designed to ingest audio from virtually any source—telephone lines, analog radios, live broadcast feeds—and pipe cleaned, timestamped transcripts directly into the newsroom’s editorial system. This seemingly small innovation shaved hours off production cycles during major events like the National People’s Congress sessions or natural disaster coverage. More importantly, it established a working culture: engineers began sitting in news meetings, listening to reporters describe bottlenecks—not as end users, but as collaborators.

That mindset shift proved decisive. When the photography desk needed to curate a year-end feature on global sports moments—but struggled to sift through hundreds of thousands of images—the Technical Bureau didn’t just run a generic image search. Instead, they built a multi-modal retrieval engine tuned to journalistic semantics. Editors could filter by dominant color palette (e.g., “find images dominated by gold and red for Olympic victory shots”), facial expression intensity (“triumph,” “defeat,” “tension”), or compositional balance (“rule of thirds,” “centered portrait”). The system learned not from stock photo metadata, but from decades of Xinhua’s own captioned archives—over 23 million images, many with rich, human-written context.

The result? A visually stunning retrospective published across Xinhua’s digital properties within 48 hours—a feat previously requiring weeks of manual sorting. Crucially, the tool wasn’t shelved after one use. Based on direct feedback, it was expanded to recognize symbolic elements: national flags, sensitive insignia, crowd density. During politically delicate summits, editors could instantly flag images containing unintended visual cues—preventing diplomatic missteps before publication.

Such tight feedback loops define Xinhua’s approach. Consider Jiaozhen (“Truth Check”), an in-house AI proofreading suite. Unlike commercial grammar checkers trained on generic web text, Jiaozhen was fine-tuned on Xinhua’s historical corpus—over seventy years of published news reports, policy documents, and official statements. It doesn’t just catch typos; it catches contextual errors: misattributed policy terminology, incorrect protocol titles, inconsistent geographical references. For instance, it flags if “Taiwan Province” is rendered without the full formal designation, or if a senior leader’s official title omits a required honorific phrase.

Behind Jiaozhen lie six deep-learning models—including BERT and LSTM variants—working in concert. But its real advantage lies in its training data: curated by veteran copy editors who encoded decades of institutional style guidelines into structured correction rules. Third-party evaluations confirm its precision rivals, and in domain-specific accuracy, exceeds, leading market alternatives. Yet what reporters value most isn’t the tech spec sheet—it’s that the system understands what “correct” means in the Xinhua newsroom.

Perhaps the most visible sign of this deep integration is in creative storytelling. In 2021, ahead of the UNESCO World Heritage Committee session in Fuzhou, Xinhua’s multimedia team wanted to make ancient sites feel immediate, alive—not distant relics. Partnering with the Technical Bureau, they produced Let’s Go See World Heritage!, an H5 interactive experience that went viral, drawing over 150 million views.

The magic? Using the First Order Motion Model—a computer vision technique that animates still portraits by mapping facial landmarks—the team brought terracotta warriors, poet Li Bai, and explorer Zhang Qian to life. With subtle lip sync and expressive gestures, these figures narrated their own histories, set against photorealistic reconstructions of the Great Wall, Mogao Caves, and Fujian tulou earthen buildings. Users didn’t just scroll; they leaned in. Comments flooded in: “It’s like stepping into a history book that breathes.” “My child asked to watch it three times.”

Critically, the engineers didn’t stop at the demo. Early versions suffered from unnatural distortions when animating full-body statues or low-resolution archival photos. The fix wasn’t a single algorithm tweak, but a pipeline redesign: first, detect and isolate the face; then apply motion mapping only to that region; finally, seamlessly blend the animated segment back into the original image using adaptive masking and super-resolution enhancement. This iterative, product-driven optimization—seven major releases over four months—enabled mass production. To date, the same framework has powered over seven major interactive features, each with unique visual styles but shared technical DNA.

Xinhua’s leadership didn’t rest there. Anticipating the rollout of 5G messaging—carrier-native rich communication services that bypass app stores—the Technical Bureau launched China’s first 5G news message in December 2020. Within months, they built the Xinhua 5G Intelligent Editorial Platform, transforming the news production chain. Reporters on assignment can now file text, images, and short video clips directly via 5G message—no app download, no login, just tap-and-send—while AI agents auto-tag locations, flag sensitive content, and even draft headline variants.

The platform doubles as a secure citizen-journalism channel: eyewitnesses can submit multimedia tips with verified phone identity (thanks to carrier-level authentication), and editors triage them alongside staff reports. During flood coverage in Henan province, over 30% of early ground-level footage came via this channel—verified, geotagged, and timestamped within minutes. The system won a national “Blossom Cup” 5G Application Award in 2021, not for technical novelty alone, but for operational impact.

None of this happens in isolation. Xinhua’s strategy recognizes a hard truth in the AI era: compute power and model size matter—but so does data sovereignty and domain specificity. While tech giants train trillion-parameter models on scraped web data, Xinhua is cultivating its own advantage: a uniquely rich, multi-decade, cross-modal archive of professionally reported news events. Photos with verified captions. Broadcast transcripts aligned to video frames. Policy documents with revision histories.

In mid-2021, Xinhua partnered with the Beijing Academy of Artificial Intelligence to adapt the massive “WuDao” language model—not on generic text, but on its own corpus—creating the first pre-trained model specialized for news generation and analysis. Early tests show marked improvements in tasks like automatic caption writing, headline consistency, and even classical Chinese poetry composition for cultural features—blending factual reporting with literary tradition.

More ambitiously, the agency is now undertaking large-scale manual annotation of its archive—not just tagging objects or people, but narrative roles: “protagonist,” “witness,” “authority figure,” “affected civilian.” Such granular semantic layering could one day allow AI to assist in reconstructing event chronologies from fragmented reports or identifying underrepresented perspectives across coverage history.

This human-in-the-loop philosophy extends to crisis response. When the pandemic demanded masked-face recognition—an Achilles’ heel for most commercial systems—Xinhua’s engineers didn’t wait for vendors. They gathered tens of thousands of in-house images of journalists and officials wearing masks, retrained their ArcFace-based recognition model, and integrated it into the newsroom’s photo management system within weeks. Today, the system identifies individuals with 92% accuracy even in partial-obscuration scenarios—critical for tracking officials across press conferences or verifying sources in sensitive investigations.

Equally emblematic is the North America Radar—a bespoke public opinion monitoring system built not for marketing, but for news judgment. Commercial tools often drown editors in volume. Xinhua’s version, co-designed with veteran foreign desk editors, weights signals by journalistic relevance: Is the source a verified expert? Does the post correlate with official statements? Has it been amplified by credible institutions? The system surfaced early chatter about supply chain disruptions at U.S. ports in spring 2021—weeks before mainstream coverage—enabling Xinhua to position its reporting ahead of the curve. Editors credit it with boosting overseas tip efficiency by 50%.

Even celebratory projects reflect this discipline. For the 2021 Mid-Autumn Festival, the initial concept was lighthearted: deliver mooncakes to the mythical Moon Rabbit. But when a reporter who covered China’s lunar missions suggested tying it to the Chang’e-5 sample-return mission, the team pivoted. The final product, “Lunar Exploration Squad, Mission Accepted!”, became an interactive 3D game where players guide a rover to collect moon soil—blending festive tradition with cutting-edge science literacy. It drew over 100 million engagements and earned Xinhua’s internal “Best Story” award for the month.

What emerges is not a tech showcase, but a new operating system for journalism: one where algorithms amplify—not automate—human expertise. The Technical Bureau doesn’t hand tools to editors; its engineers sit with them, in the same open-plan newsroom, adjusting parameters in real time as deadlines loom. Every major release undergoes “dogfooding”—the devs use their own tools to file internal memos, proving utility before rollout.

This culture traces back to leadership. Following President Xi Jinping’s 2019 directive on “using mainstream values to steer algorithms,” Xinhua’s leadership mandated that no AI system be deployed without editorial oversight panels—comprising senior reporters, fact-checkers, and ethics officers—reviewing bias, transparency, and societal impact. The result is AI that reinforces journalistic norms, rather than eroding them.

Looking ahead, Cheng Peng and his team are exploring generative AI—cautiously. While large language models can draft press releases or summarize reports, Xinhua insists final output must bear the imprint of human judgment. Their internal guideline is clear: AI may propose three headline options, but the editor chooses—and justifies—the final one. It may suggest related stories, but the reporter decides relevance. It may flag inconsistencies, but the copy editor makes the call.

In an industry anxious about obsolescence, Xinhua offers a different narrative: augmentation over replacement. Its technical infrastructure isn’t a black box; it’s a transparent workshop where journalists and engineers speak each other’s languages—literally and figuratively. That fluency, built over years of shared deadlines and mutual respect, is harder to replicate than any algorithm.

As media convergence deepens globally, Xinhua’s experience suggests a vital lesson: the future of news isn’t about choosing between people and machines. It’s about designing systems where the best of both can thrive—together.

Cheng Peng, Technology R&D Center, Xinhua News Agency Technical Bureau, Beijing 100083, China
China Media Technology, 2021(12):15–18
DOI: 10.19483/j.cnki.11-4653/n.2021.12.003