AI-Powered Book Acquisition Model Boosts Library Efficiency

AI-Powered Book Acquisition Model Boosts Library Efficiency

In a significant stride toward data-driven decision-making in academic libraries, researchers from Shanxi University of Finance and Economics and Taiyuan Daran Science and Technology Co. Ltd have developed an artificial intelligence–based book acquisition mechanism that leverages naive Bayes classification and text segmentation to predict circulation potential with high accuracy. The model, detailed in a recent study published in the Journal of Modern Information, demonstrates how libraries can move beyond subjective selection criteria and embrace quantifiable, objective metrics to optimize their collections.

Traditionally, library acquisition has relied heavily on librarians’ professional judgment, publisher reputation, or reader surveys—methods that, while valuable, are inherently prone to human bias, incomplete data, and shifting demand patterns. The new approach flips this paradigm by using historical circulation data as a proxy for real user behavior, then applying machine learning to uncover hidden patterns in book metadata—specifically titles and publishers—to forecast which new titles are most likely to circulate.

The research team, led by Wang Hong, Wang Yaqin, and Huang Jianguo, focused their analysis on TP18-classified books—a Chinese Library Classification code corresponding to artificial intelligence and related computational topics. This choice was strategic: as AI becomes a cornerstone of academic and industrial innovation, demand for authoritative, accessible resources in this field has surged, making it an ideal testbed for predictive acquisition models.

Using a dataset comprising 249 previously acquired AI-related titles from Taiyuan University of Science and Technology’s library—196 of which had been borrowed at least once—the team constructed a feature-rich corpus based on book titles and publishing houses. Chinese text segmentation via the Jieba algorithm was employed to extract meaningful keywords from titles, while publisher names were retained as discrete categorical features. Author names, despite their intuitive relevance, were excluded due to extreme sparsity in the dataset—most authors appeared only once or twice, offering insufficient statistical signal for reliable modeling.

The resulting dataset was transformed into a document–term matrix with 524 documents (including 275 new, unacquired titles from a 2018 Xinhua Bookstore catalog) and 697 unique terms. Each entry in the matrix indicated the presence or absence of a term in a given book’s metadata—a binary representation well-suited for naive Bayes classification, which assumes feature independence and performs robustly even with high-dimensional, sparse data.

The model was trained on 70% of the historical circulation data and validated on the remaining 30%. Results were striking: the classifier achieved an overall prediction accuracy of approximately 83.8%, with a false positive rate of just 16.2%. In practical terms, this means that when the model recommends a book for acquisition, there is an 83.8% probability it will be borrowed by patrons—a dramatic improvement over guesswork or intuition-based selection.

More importantly, the system identified 131 out of 275 candidate titles as having high circulation potential. Among these, 109 were estimated to be genuinely high-demand works, while only 22 were likely to remain unused. This level of precision allows acquisition librarians to align purchasing decisions with actual user behavior, reduce waste from underutilized purchases, and maximize the return on limited collection development budgets.

The implications extend beyond AI-related collections. The methodology is generalizable: any subject category with sufficient historical circulation data can serve as the foundation for a similar predictive model. As academic libraries worldwide grapple with shrinking budgets and rising expectations for relevance and impact, such tools offer a path toward smarter, more accountable stewardship of information resources.

Critically, the model does not replace librarians—it empowers them. By automating the initial screening of thousands of new titles, it frees professionals to focus on nuanced tasks like evaluating scholarly depth, pedagogical fit, or interdisciplinary relevance—dimensions that algorithms cannot yet assess. The system acts as a first-pass filter, highlighting titles with strong behavioral signals so that human expertise can be applied more strategically.

The study also sheds light on the relative influence of different bibliographic features. Publisher emerged as a powerful predictor: titles from established academic presses like Science Press, Tsinghua University Press, and China Machine Press were far more likely to circulate than those from lesser-known imprints. This aligns with longstanding library practices but now has empirical validation. Title keywords, meanwhile, revealed thematic trends—terms like “deep learning,” “neural networks,” and “machine learning” consistently correlated with higher circulation, reflecting the dominant research and teaching interests in the field.

Notably, the researchers avoided using circulation frequency as a metric, opting instead for a binary classification (circulated vs. never circulated). This decision was deliberate: it prevents the model from overemphasizing a few highly popular titles at the expense of moderately used but still valuable works. In academic settings, a book borrowed just once may still be critically important for a specialized course or research project. The binary approach ensures broader intellectual coverage.

The adoption of naive Bayes—a probabilistic classifier rooted in Bayes’ theorem—was both pragmatic and theoretically sound. Despite its “naive” assumption of feature independence (i.e., that the presence of “deep learning” in a title is unrelated to the publisher being “Tsinghua University Press”), the algorithm has proven remarkably effective in text classification tasks. Its simplicity enables fast training, interpretability, and resilience to overfitting—key advantages in resource-constrained library environments.

Moreover, the team implemented Laplace smoothing to handle zero-probability issues—a common challenge when new terms appear in test data that weren’t seen during training. This technique ensures the model remains robust even when encountering novel phrasing in emerging subfields of AI.

From a policy perspective, the study underscores a broader shift in library science: from reactive collection building to proactive, predictive curation. As digital catalogs expand and print budgets contract, the ability to anticipate demand before purchase becomes not just desirable but essential. This model provides a replicable framework for that transition.

The research also contributes to the growing body of work on AI applications in cultural institutions. While much attention has focused on chatbots, metadata enrichment, or digitization, this study addresses a foundational operational challenge—acquisition—with a lightweight, transparent, and auditable AI solution. Unlike deep learning black boxes, naive Bayes offers traceable decision logic, satisfying institutional needs for accountability and explainability.

Looking ahead, the authors suggest several avenues for refinement. Incorporating abstracts or table-of-contents data could enhance semantic granularity. Integrating temporal trends—such as rising interest in generative AI post-2022—would improve responsiveness to fast-moving fields. And combining this model with user preference data (e.g., course syllabi, faculty research profiles) could create a hybrid system that balances behavioral evidence with institutional mission.

For now, the demonstrated efficacy in the TP18 category offers a compelling proof of concept. Libraries investing in data infrastructure and analytical capacity can replicate this approach across disciplines, transforming acquisition from an art into a science—without losing the human judgment that remains central to scholarly communication.

As higher education faces increasing pressure to demonstrate value and efficiency, such innovations position libraries not as passive repositories but as dynamic, data-informed partners in knowledge creation and dissemination. In an era where every dollar counts and every book must earn its shelf space, predictive acquisition isn’t just smart—it’s necessary.

Wang Hong¹, Wang Yaqin², Huang Jianguo³
¹Library, Shanxi University of Finance and Economics, Taiyuan 030006, China
²School of Information, Shanxi University of Finance and Economics, Taiyuan 030006, China
³Taiyuan Daran Science and Technology Co. Ltd, Taiyuan 030006, China
Journal of Modern Information, DOI: 10.3969/j.issn.1008–0821.2021.09.008