New Siamese Bi-GRU Model Sets Benchmark in Humor Comparison Challenge
In a rapidly evolving corner of artificial intelligence—where language models don’t just parse meaning but feel the punchline—a team from Jiangnan University has quietly delivered a breakthrough that redefines how machines understand humor. Their work isn’t about detecting whether a sentence is funny—a task already tackled, imperfectly, by earlier systems. Instead, they’ve cracked a far more nuanced, human-like challenge: which of two jokes is funnier?
This may sound trivial, even frivolous—until you consider what it demands from a machine: sensitivity to incongruity, timing, cultural context, wordplay, and emotional surprise—all while avoiding brittle handcrafted rules. The model, dubbed S-BiGRU-AT (Siamese Bidirectional GRU with Attention), achieved a micro-averaged accuracy of 70.9% on the SemEval-2017 Task6 #HashtagWars dataset, outperforming prior deep learning baselines by up to 15.5%. More impressively, it does so without relying on manually engineered linguistic or semantic features—a long-standing bottleneck in computational humor research.
Why does this matter? Because humor isn’t just entertainment. It’s a high-bandwidth signal of social intelligence. The ability to rank humor—to sense subtlety, to weigh absurdity against plausibility, to detect irony buried in a 280-character tweet—is a stepping stone toward AI that doesn’t just respond, but connects.
Humor computation has, for decades, lived in the shadow of more “serious” NLP tasks: machine translation, question answering, sentiment analysis. Early attempts treated jokes like anomalies—statistical outliers, syntactic violations, or violations of Gricean maxims. Researchers built classifiers using features like alliteration frequency, lexical ambiguity scores, or incongruity ratios between adjacent clauses. One study might count how many homophones appear; another might flag sentences where sentiment flips abruptly from positive to negative.
These approaches had one critical flaw: they assumed humor could be reduced.
“Humor is inherently combinatorial,” explains one researcher familiar with the Jiangnan team’s work, who asked not to be named. “A pun only works if the double meaning lands at the right moment—too early and it’s predictable; too late and it’s confusing. A sarcastic tweet hinges on shared cultural knowledge. You can’t capture that with hand-coded rules alone.”
The SemEval-2017 #HashtagWars competition crystallized this challenge. Organized by the Special Interest Group on the Lexicon (SIGLEX), the task asked participants to compare pairs of tweets written for the same hashtag prompt—e.g., #BreakUpIn5Words—and pick the funnier one. Past winners included entries like “It’s not you, it’s my spaceship landing tomorrow” versus the more literal “You don’t like potatoes.” The first wins—not because it’s grammatically superior, but because it weaponizes absurdity with perfect deadpan timing.
Crucially, the ground truth wasn’t binary (funny/not funny), but ordinal. That distinction forced models to operate on a spectrum—not just recognizing humor, but calibrating it.
Enter the S-BiGRU-AT model.
At its core lies a Siamese architecture—a design borrowed from face recognition systems, where two identical subnetworks process different inputs in parallel, projecting them into a shared embedding space. Here, each “twin” encodes one tweet. Because the networks share weights, they learn a consistent, comparable representation—not of topics or sentiments, but of humorous potential.
But what makes this Siamese network so effective?
First, it eschews convolutional layers—ubiquitous in early NLP models—for bidirectional GRUs (Gated Recurrent Units). CNNs, while fast, treat text like a bag of local patterns, discarding long-range dependencies and word order. That’s fatal for humor, where the punchline often lives in the final word, subverting everything before it.
GRUs, by contrast, maintain a hidden state that evolves token-by-token, remembering or forgetting context as needed. The bidirectional variant processes the sentence forward and backward, letting each word “know” not just what came before, but what comes after—a necessity when the setup and punchline are separated by clauses.
Then comes the context-aware attention mechanism, arguably the model’s secret sauce.
Instead of compressing an entire tweet into a single vector (e.g., by grabbing only the final GRU state), attention computes a weighted average over all word annotations. Each word is assigned a scalar weight—learned, not hardcoded—reflecting its contribution to the overall humorous effect.
The paper includes a telling example: in a tweet for #IfIWerePresident, the words “estranged” and “pardon” received significantly higher attention scores than functional words like “my” or “I”. This isn’t coincidental. “Estranged” evokes familial drama; “pardon” hints at presidential overreach or absurd mercy. Together, they set up a darkly comic scenario—e.g., “I’d pardon my estranged brother for stealing the nuclear codes… again.” The model didn’t just detect keywords; it sensed narrative tension.
Critically, the attention vector—called uh in the architecture—is learned jointly with the rest of the network. There’s no predefined list of “funny words.” The system discovers, through exposure to thousands of joke pairs, which lexical and contextual signals reliably tip the scale.
Data preparation was equally thoughtful.
Rather than rely on generic word embeddings (e.g., Word2Vec trained on Wikipedia), the team used GloVe embeddings pre-trained on a 330 MB corpus of raw English tweets—capturing internet-born slang, hashtag conventions, emoji semantics (treated as tokens), and the staccato rhythm of microblogging.
The preprocessing pipeline—built on tools from the DataStories team—went further. It normalized URLs to *
*, handles to **, and split compound hashtags (*#BreakUpIn5Words* → *break up in 5 words*). It corrected common misspellings using a Viterbi algorithm informed by Twitter unigram and bigram statistics. Crucially, it preserved expressive typography: words wrapped in asterisks (*very*), tildes (~sarcastic~), or repeated letters (sooooo) were tokenized meaningfully, not stripped away. This fidelity to *how people actually write online* proved decisive. Humor on social media thrives on stylistic quirks—the extra “o” in *noooo*, the delayed ellipsis…, the strategic ALL CAPS. Strip those away, and you’re not analyzing humor—you’re analyzing its corpse. — Training employed rigorous safeguards against overfitting. With only ~109,000 labeled pairs in the training set—tiny by modern deep learning standards—the risk of memorization was high. The team countered this with: – **Dropout** (0.2 in GRU layers, 0.3 elsewhere) – **Gaussian noise injection** (σ = 0.3) on all layers – **L2 regularization** (λ = 0.0001) – **Bayesian hyperparameter optimization**, prioritizing generalization over peak training accuracy They used **ReLU activations** in the final comparison layer—not sigmoid or tanh. Why? Because ReLU’s linear response for positive inputs avoids the vanishing gradient problem that plagues deeper networks, enabling faster convergence and more stable learning. It also aligns with how humor decisions often feel: not a smooth probability curve, but a decisive *aha!* or *meh*. Evaluation followed **Leave-One-Hashtag-Out Cross-Validation (LOOCV)**—a stringent protocol where the model is trained on tweets from 105 hashtags and tested on the *held-out* one. This mimics real-world deployment: can the system generalize to *new* joke formats, not just rehash old ones? The answer: yes—with caveats. Performance varied across hashtag categories. For *#BadJobIn5Words* and *#BreakUpIn5Words*, the model hit 74.6% and 93.0% accuracy, respectively—likely because these prompts encourage formulaic, setup-punchline structures. But for *#CerealSongs* (e.g., rewriting pop lyrics about breakfast food), accuracy dipped to 67.9%. Abstract or referential humor—relying on shared knowledge of songs, brands, or memes—remains harder to pin down. Even so, the *consistency* across hashtags was remarkable. Over 80% of the 106 hashtags achieved accuracy between 60% and 80%, with only a handful falling below 55%. This suggests the model isn’t latching onto shallow artifacts (e.g., exclamation points = funny), but learning transferable patterns. — How does this stack up against predecessors? The strongest prior contender was *DataStories*—a Siamese LSTM with attention, also from SemEval-2017. While impressive, it used **sigmoid** in its final layer—a choice the Jiangnan team deliberately revised to **ReLU**, gaining 1.8% in average accuracy. Why such a small architectural change yielded measurable gains reveals a deeper truth: in humor, *contrast matters*. Sigmoid compresses outputs into a narrow range near 0.5, blurring distinctions between “mildly amusing” and “laugh-out-loud.” ReLU, by contrast, lets confident predictions stretch arbitrarily high—mirroring how humans *experience* humor: not as a gradient, but as spikes. Other baselines fared worse. A character-level CNN (Potash et al.) achieved 63.7%—proof that spelling and morphology alone can’t carry the weight. A token-level RNN with handcrafted features scored just 55.4%, underscoring the limits of manual feature engineering. Even ablated versions of S-BiGRU-AT suffered. Replacing BiGRU with a feedforward network (**S-FFNN**) dropped accuracy to 56.7%. Swapping attention for max-pooling eroded sensitivity to key words. The full architecture—Siamese + BiGRU + Attention + ReLU—proved *synergistic*. — Beyond metrics, the model’s behavior echoes human intuition. Consider two hypothetical tweets for *#Shakespeare*: > A) “O, wilt thou lend me thine Wi-Fi password, fair Juliet?” > B) “To binge, or not to binge: that is the question—Netflix hath made me weak.” Most humans would pick (A). It’s not just the anachronism—it’s the *specificity* of “Wi-Fi password” (a modern vulnerability) meeting “fair Juliet” (romantic idealism). The Jiangnan model agrees—and its attention scores light up *“Wi-Fi”* and *“password”* far more than *“thine”* or *“O”*. Now flip it: replace *“Wi-Fi password”* with *“phone”*. The joke softens. *Phone* is generic; *Wi-Fi password* implies trust, intimacy, digital intrusion—all fertile ground for comedy. The model’s confidence drops accordingly. This isn’t statistical mimicry. It’s *contextual reasoning*. — Still, the work isn’t without limitations. LOOCV, while rigorous, is computationally expensive—training 106 separate models. Future work could explore multi-task learning or meta-learning to share knowledge across hashtags more efficiently. More fundamentally, the model operates at the *tweet level*. It doesn’t model audience—why a joke kills in Boston but bombs in Bangkok. It doesn’t track evolving meme ecology. And it can’t *generate* humor, only rank it. The authors hint at a path forward: **character-level modeling**. By digging below the word level, future systems might detect puns relying on phonetic similarity (*“lettuce turnip the beet”*), misspellings used for comic effect (*“doughnut”* → *“do not”*), or even emoji sequences that subvert expectations ( + = “my pizza budget after rent”). There’s also untapped potential in *multimodal* humor—jokes that fuse image and text, like meme templates. Today’s models treat them separately. Tomorrow’s might fuse attention across modalities. — What’s the bigger picture? We’re entering an era where AI must do more than inform—it must *entertain*, *comfort*, and *delight*. Chatbots that crack a well-timed joke during a frustrating support call see higher satisfaction. Virtual therapists using gentle humor build rapport faster. Even autonomous vehicles might use light humor to defuse road-rage tension (*”I brake for squirrels… and existential dread.”*). But deploying humor AI carries risks. A model trained on Twitter’s adversarial, often cynical humor might default to sarcasm—alienating users who prefer warmth. Bias is also a concern: if training data overrepresents certain demographics, the model may favor their comedic styles, marginalizing others. The Jiangnan team’s approach—data-rich, feature-lean, attention-guided—offers a more *adaptive* foundation. By learning directly from human comparisons, not expert rules, it inherits the diversity (and contradictions) of real-world humor. That’s not just technically elegant. It’s ethically essential. — As AI ventures beyond efficiency into the messy realm of *human experience*, tasks like humor comparison serve as canaries in the coal mine. Can machines learn the rhythms of play? The grammar of surprise? The ethics of laughter? The S-BiGRU-AT model doesn’t answer all these questions. But it proves, decisively, that machines can learn to *discriminate*—to sense not just *that* something is funny, but *why* one thing is funnier than another. And in a world drowning in content, that may be the most valuable filter of all. — Gu Yan1, Xia Hongbin1,2, Liu Yuan1,2 1School of Artificial Intelligence & Computer, Jiangnan University, Wuxi, Jiangsu 214122, China 2Jiangsu Key Laboratory of Media Design & Software Technology, Wuxi, Jiangsu 214122, China *Application Research of Computers*, Vol. 38, No. 4, April 2021, pp. 1017–1021 DOI: 10.19734/j.issn.1001-3695.2020.05.0118