AI-Powered Random Forest Model Achieves 94.8% Accuracy in Chinese Phrase Relation Classification

AI-Powered Random Forest Model Achieves 94.8% Accuracy in Chinese Phrase Relation Classification

In a compelling demonstration of the synergy between computational linguistics and artificial intelligence, researchers have developed a novel method to classify grammatical relations in Chinese noun-noun (N1+N2) phrases with remarkable precision. By leveraging a random forest algorithm trained on a custom-built corpus, the approach achieves a classification accuracy of 94.8%—a significant milestone in natural language processing (NLP) tasks involving syntactic ambiguity resolution in Mandarin Chinese.

The study, led by Quan Yang from the College of Chinese Language and Culture at Beijing Normal University, addresses a longstanding bottleneck in Chinese language processing: determining the correct grammatical relationship within ambiguous two-noun constructions. Such structures are common in Mandarin and can manifest as attributive (e.g., “chicken soup”), coordinate (“bread and butter”), appositive (“President Biden”), or subject-predicate phrases (“market boom”). The human brain resolves these relations almost instantly through context and world knowledge, but machines require explicit modeling—especially in a language like Chinese, which lacks inflectional markers and relies heavily on word order and semantics.

The research, published in the Journal of Chongqing University of Technology (Natural Science), introduces a data-driven, linguistically informed framework that combines semantic similarity metrics with ensemble machine learning. Rather than relying on hand-coded linguistic rules—a traditional but brittle approach—the team engineered a feature set derived from the HIT-SCIR Tongyici Cilin Extended Edition, a hierarchical Chinese thesaurus that encodes lexical semantics in a five-level tree structure. Each noun in a candidate phrase was mapped to its Cilin semantic code, and seven distinct features were extracted for classification: the overall word semantic similarity score, plus pairwise comparisons across each of the five hierarchical levels and the terminal relationship marker (indicating synonymy, relatedness, or uniqueness).

What sets this work apart is not just the feature engineering, but the strategic application of the random forest algorithm. Unlike single decision trees, which can overfit to noise or idiosyncrasies in training data, random forests mitigate variance by aggregating predictions from multiple decorrelated trees. In this implementation, each of the 21 decision trees in the ensemble was trained on a bootstrap sample of the data and a random subset of five features drawn from the full set of seven. The final prediction emerged via majority voting—a robust mechanism that enhances generalization.

The team constructed a meticulously curated corpus of 5,098 unique N1+N2 phrases, sourced from the BCC corpus (covering news, literature, and scientific texts) and manually annotated into four grammatical categories: attributive (95.14%), coordinate (3.30%), appositive (1.26%), and subject-predicate (0.31%). The distribution reflects real-world usage patterns, where attributive constructions dominate. To evaluate performance, the dataset was split 80:20 into training and test sets, yielding 4,078 and 1,020 instances respectively. No duplicates existed between sets, ensuring a rigorous assessment.

The model achieved an overall accuracy of 94.8% on the test set—a figure that stands out in the context of syntactic ambiguity resolution, where even small improvements often require substantial innovation. Detailed breakdowns reveal nuanced performance across categories. The dominant attributive class was identified with 93.73% correctness, though 14 instances were mislabeled as coordinate—a known challenge due to overlapping semantic domains (e.g., “apple pie” vs. “apple and pie”). Conversely, rare classes like appositive and subject-predicate achieved 100% precision: when the model predicted these labels, it was always correct. However, recall for these categories remained modest (30.77% and 33.33%, respectively), indicating that the algorithm often defaulted to the more frequent attributive label when uncertain.

This trade-off is characteristic of imbalanced datasets, and the authors acknowledge it as a key area for future refinement. Strategies could include synthetic data generation, cost-sensitive learning, or integrating external semantic resources like BabelNet or HowNet to enrich feature representation. Nevertheless, the current results underscore a crucial insight: even with extreme class imbalance, a well-constructed machine learning pipeline can deliver high precision for minority classes—vital for downstream applications where false positives carry high costs, such as in legal or medical text processing.

From a technological standpoint, the study validates random forests as a pragmatic and powerful tool for structured NLP tasks. Unlike deep learning models that demand massive labeled datasets and extensive computational resources, random forests offer interpretability, training efficiency, and strong baseline performance with modest data. Each decision tree in the forest can be inspected to understand which features—such as Cilin Level 1 (broad semantic category) or word similarity scores—were most decisive in a given prediction. This transparency aligns with growing demands for explainable AI in sensitive domains.

Moreover, the approach exemplifies effective integration of linguistic knowledge into machine learning. Rather than treating text as mere sequences of symbols, the model incorporates decades of lexicographic work encoded in Cilin. The semantic hierarchy acts as a scaffold, enabling the algorithm to reason about word relationships in a way that mirrors human cognitive categories. For instance, two nouns sharing the same Cilin Level 1 code (e.g., both falling under “food”) are more likely to form coordinate or attributive pairs than if they belong to disparate domains like “food” and “abstract concept.”

The implications extend beyond academic interest. Accurate phrase-level parsing is foundational to numerous real-world NLP applications. In machine translation, misclassifying “stone lion” as a coordinate phrase (implying two separate entities) instead of an attributive one (a lion made of stone) could lead to comical or confusing outputs. In information extraction systems, correctly identifying “Apple CEO” as an appositive structure is essential for linking entities in knowledge graphs. Similarly, sentiment analysis models benefit from understanding whether “price hike” is a neutral noun phrase or carries negative connotation through its syntactic framing.

The research also contributes to broader discussions about cross-linguistic NLP. Much of the field’s progress has centered on English, leveraging its rich inflectional morphology and abundant annotated resources. Languages like Chinese, with their isolating morphology and context-dependent syntax, present distinct challenges that necessitate tailored solutions. This study demonstrates that combining language-specific resources (like Cilin) with general-purpose ML algorithms can yield high-performing systems without requiring language-universal architectures.

Critically, the work adheres to principles of methodological rigor and reproducibility. The corpus construction process—automated filtering followed by manual validation—ensures data quality. Feature definitions are explicit and grounded in established linguistic theory. The choice of C4.5 for decision tree induction, a well-documented algorithm, further enhances replicability. While the paper notes that implementation used MATLAB, the described pipeline could be readily adapted to open-source frameworks like scikit-learn.

Looking ahead, the authors suggest expanding the framework to other phrase types beyond N1+N2, such as verb-object or adjective-noun constructions. They also propose enriching the feature set with distributional semantics from word embeddings or contextualized representations from transformer models—though they caution against discarding symbolic linguistic features, which provide complementary signals to statistical patterns.

The success of this approach also invites reflection on the nature of grammatical relations themselves. Traditionally viewed as discrete categories, the high correlation between semantic similarity and syntactic function in this model suggests a more gradient reality. Perhaps the line between, say, attributive and coordinate is not categorical but probabilistic, shaped by semantic distance, frequency, and discourse context. Machine learning models, by capturing these statistical regularities, may help refine linguistic theories themselves.

In an era where AI systems increasingly mediate human communication—from smart assistants to automated journalism—the ability to parse language with human-like nuance is not merely a technical goal but a social imperative. Errors in syntactic interpretation can propagate into bias, misinformation, or exclusion. Research like Yang’s bridges the gap between theoretical linguistics and practical AI, ensuring that language technologies serve diverse linguistic communities equitably.

The 94.8% accuracy figure, while impressive, is ultimately less significant than the methodological blueprint it represents: a fusion of domain-specific knowledge, careful data curation, and judicious algorithm selection. As NLP continues to evolve, such hybrid approaches—neither purely symbolic nor purely statistical—may offer the most sustainable path toward robust, interpretable, and linguistically aware AI.

In summary, this study demonstrates that even with modest computational means and focused linguistic insight, significant advances in language processing are attainable. It reinforces the value of interdisciplinary collaboration and offers a replicable template for tackling syntactic ambiguity in other resource-rich but morphologically sparse languages. As natural language interfaces become ubiquitous, the quiet precision of models like this one will underpin the reliability of countless digital interactions.

Quan Yang, College of Chinese Language and Culture, Beijing Normal University, Beijing 100875, China. Journal of Chongqing University of Technology (Natural Science), 2021, 35(7): 125–130. doi:10.3969/j.issn.1674-8425(z).2021.07.015