Seems almost yes.
You are seeing exactly what this setup would be expected to do. DistilBERT (and the other models) have learned “does this look like a typical, well-formed excuse in my dataset?” rather than “is this event physically possible in the real world?”. Dragons, portals, teleportation, wizards, aliens, new dimensions all look like perfectly normal sentences, and your labels never strongly say “these should be 0”, so the classifier keeps them on the believable side.
I will go through:
- What your model actually learned (and why all architectures behave the same).
- How this connects to known results on semantic/physical plausibility.
- Direct answers to your five questions.
- A concrete upgrade path: data changes, models, and a multi-stage pipeline.
1. What your classifier actually learned
1.1 Objective + data distribution ⇒ “linguistic plausibility”, not “world realism”
Your setup:
- Input: excuse text only.
- Label: 0 = unbelievable, 1 = believable.
- Dataset: ~400k excuses, globally balanced.
- Crucial skew: the 0 class has very few explicitly impossible/fantasy events.
The training objective is standard:
- Minimize cross-entropy on these labels.
So the model will learn:
“Given an excuse, predict the label humans used in the dataset.”
It will not learn:
“Given an excuse, decide if it is physically possible in the real world.”
If almost all examples near “I teleported to school / I went to school / I ran to school” are labeled 1 in your training data, then the classifier sees teleportation as just another verb in a familiar template and generalizes to “1”.
This is exactly the kind of failure that semantic-plausibility work has documented:
- Wang et al. (2018) show that distributional models trained on text do poorly on physical plausibility of events (e.g.,
man swallow paintball vs man swallow candy) unless you inject explicit world knowledge. (ACL Anthology)
- Porada et al. (2019) show that even BERT, when trained for plausibility, is highly dependent on the coverage and labels for specific event types; when certain implausible patterns are never labeled as such, it fails to mark them implausible. (ACL Anthology)
Your believability classifier is in the same regime: it has no strong supervised evidence that fantasy explanations are a distinct region that should map to 0.
1.2 Pretraining doesn’t fix this: LMs see fantasy all the time
DistilBERT comes from BERT, which is pretrained on large corpora (BooksCorpus, Wikipedia). Those corpora contain:
- Realistic text,
- But also fiction, sci-fi, fantasy.
So in pretraining, the model learns that “dragon”, “portal”, “wizard” are perfectly normal words in many contexts. This agrees with general findings:
- Event-knowledge work shows LMs do encode a gradient of plausible vs implausible events, but their scores are driven by both plausibility and surface features (frequency, syntax) and are not a clean model of physics. (arXiv)
When you fine-tune DistilBERT on your excuses, you overwrite and reshape that knowledge to best fit your labels. If your labels don’t say “dragon excuses ⇒ 0”, the fine-tuning will not make that connection, even if the base model internally “knows” dragons are fictional.
1.3 Why all your other models fail on the same sentences
Naive Bayes, Logistic Regression, SVM, LSTM, LSTM+heuristics all see:
- The same input distribution.
- The same labels.
- The same skew: almost no fantasy labelled 0.
So they converge on roughly the same rule: “short, grammatical first-person excuses that match common structures are more likely to be 1”.
This matches what semantic-plausibility papers find:
- Simple distributional / embedding-based models also misclassify implausible events if those events never show up labeled as implausible in training. (ACL Anthology)
DistilBERT just does a better job on in-distribution cases, giving you 76% overall accuracy, but the underlying definition of “believable” is still governed by the dataset and labels, not physics.
2. How this lines up with the research on plausibility
You are essentially training a semantic plausibility classifier but with “excuse believability” as the label.
Key empirical findings from that literature:
-
Distributional LMs learn text-based plausibility, not full world knowledge.
- Wang et al. (2018) and Porada et al. (2019) explicitly show that distributional models and even BERT struggle with physically implausible / impossible events if they are not well represented and labeled. (ACL Anthology)
-
LLMs do have some event knowledge but still confuse impossible, unlikely, and fictional.
- Kauf et al. (2023) test several LMs on minimal event pairs (possible vs impossible; likely vs unlikely) and find that LMs generally rank possible events above impossible, but still show substantial errors and sensitivity to surface features. (arXiv)
-
Dedicated plausibility estimators (like Vera) are trained explicitly to spot implausible statements.
- Vera (Liu et al., 2023) is trained on ~7M commonsense statements from QA datasets and commonsense KGs to output a plausibility score. It is explicitly designed to filter LM-generated nonsense and trivial commonsense errors. (arXiv)
Your setup is closer to the first group: a task-specific classifier relying on your labels. You are not yet using a Vera-style plausibility model or a semantics-only benchmark; hence the fantasy blind spot.
3. Direct answers to your questions
3.1 Is this normal for DistilBERT (no commonsense grounding)?
Yes, this is normal.
More precisely:
- DistilBERT is not grounded in perception or physics. It is a text-only model.
- It learns statistical co-occurrence patterns in text plus whatever you add via fine-tuning.
- Semantic-plausibility studies show these models can rank possible vs impossible events reasonably well when explicitly trained and tested on that task, but they still often assign non-trivial plausibility to impossible or fictional events, especially outside their training distribution. (arXiv)
So DistilBERT acting as if “A dragon burned my homework” is “believable” within your task is expected given:
- The pretraining objective (masked language modelling on mixed-genre text).
- The fine-tuning objective (fit your labels, not physical realism).
- The lack of strong negative examples marking fantasy as “unbelievable”.
3.2 Do you simply lack enough fantasy/impossible samples?
Yes, but the issue is slightly broader than “not enough examples”:
-
Coverage:
- The “unbelievable” class has few explicit fantasy/impossible events, so the classifier never sees that subspace labeled 0.
-
Label semantics:
- If any fantasy excuses are labeled as 1 (e.g., humorous entries), that actively teaches the model the opposite mapping (“fantasy can be believable”).
-
Class conditioning:
- Balanced global labels (≈50/50) do not guarantee that each important subtype within the 0 class (fantasy, contradictions, extreme coincidences) is well represented.
The semantic-plausibility literature consistently shows that implausible events are harder and underrepresented, and performance is tightly coupled to how well the training data covers their patterns. (ACL Anthology)
So you do need more fantasy/impossible samples labeled 0, but you also need to treat this as a distinct slice of the problem and monitor it explicitly.
3.3 Should you add a separate “unrealistic/fantasy detector” before the classifier?
This is a good idea, but it should complement, not replace, better data.
A separate “realism / impossibility” module (Stage 1) makes sense when:
-
You conceptually see the task as:
- Check if the excuse is realistic (physically possible).
- Given it is realistic, decide if it sounds believable.
-
You want strong guarantees like: “If it is physically impossible, we always output 0.”
Patterns that others use (very close to your plan):
-
Rule-based detector
- Maintain a curated list of clearly fantasy tokens (dragon, wizard, portal, spell, alien, teleport, demon, etc.).
- If any appear, mark as “impossible/fantasy” and force label 0 or a separate tag.
- This is simple but surprisingly effective for the extreme cases you listed.
-
General plausibility model as a gate (Vera-style)
- Use a model like Vera, which outputs a plausibility score for declarative statements. (arXiv)
- Treat low scores as “Stage-1 failure”, and override the DistilBERT classifier.
-
OOD / novelty detector
- Learn an out-of-distribution detector on DistilBERT embeddings or logits: if an excuse is far from the training distribution, treat it as “suspect” and fall back to a conservative rule (e.g., 0). (ACL Anthology)
So yes, a separate “unrealistic/fantasy detector” is a reasonable architectural step. But you will still get the best results if you also:
- Add explicit fantasy/impossible data.
- Possibly change the label scheme to separate “impossible” from “possible-but-unconvincing.”
3.4 Would larger models (DeBERTa, RoBERTa-large) improve realism?
They can help, but they are not a silver bullet.
Larger transformer encoders like RoBERTa-large and DeBERTa-v3-large:
- Encode richer contextual semantics and commonsense.
- Often perform better on benchmarks involving commonsense QA and plausibility judgments. (ACL Anthology)
However:
- They still optimize your labels; they do not spontaneously override what your dataset tells them.
- If fantasy excuses are absent or inconsistently labeled, they will still behave poorly on that slice.
So you can expect:
- Some robustness and calibration improvements.
- Potentially better behavior once you add explicit impossible/fantasy supervision or multi-task with plausibility datasets.
But upgrading the encoder without fixing data and task definition will not, by itself, fix your dragons and portals.
3.5 Should you build a multi-stage pipeline?
Given your goals, a multi-stage pipeline is a good fit. Conceptually:
-
Stage 1 – Realism / plausibility / OOD gate
-
Stage 2 – Believability among realistic excuses
-
Optional: abstention / “I’m not sure” output
- For high-uncertainty or OOD cases, allow the system to abstain rather than make a hard 0/1 call. (ACL Anthology)
This matches the broader trend in the literature: decouple commonsense plausibility from the downstream task, and use dedicated plausibility modules as filters or priors rather than relying on a single classifier to learn everything.
4. Concrete upgrade plan
Here is a practical path you can follow, building on your existing DistilBERT setup.
4.1 Build explicit evaluation slices
First define the slices you care about:
-
Fantasy / impossible excuses
- Your examples (“A dragon burned my homework”, “My homework fell into another dimension”, etc.).
- Extend with many variants (teleportation, magic, aliens, time travel).
-
Physically possible but extremely implausible
- “The President came to my house and confiscated my homework.”
- “All printers in the city simultaneously exploded.”
-
Normal realistic believable / unbelievable excuses
- What your current dataset already has lots of.
Create a few hundred–few thousand examples per slice and keep them out of training as a test/dev set. You will track:
- Overall accuracy/F1.
- Slice-wise metrics (especially fantasy slice).
Semantic-plausibility work does exactly this kind of slice analysis and consistently finds implausible / impossible slices to be the hardest. (ACL Anthology)
4.2 Augment training with fantasy/impossible excuses
Next, add explicit negatives for your weak spot:
-
Manual + synthetic generation
-
Contrastive pairs
-
For each realistic excuse, produce a fantasy variant with minimal edits:
- “My dog ate my homework.” → “A dragon ate my homework.”
- “The bus broke down.” → “The bus teleported into space.”
-
Train the model so that those minimal changes flip the label, reinforcing that the unusual entity/verb is the reason.
Counterfactual augmentation like this has been shown to improve robustness and make models rely on truly causal features rather than incidental correlations. (arXiv)
-
Mild oversampling
- Oversample fantasy/impossible examples during fine-tuning so they form a meaningful percentage of each batch (e.g., 10–30%, not 0.01%).
Then:
- Retrain DistilBERT with the same hyperparameters as a baseline.
- Check fantasy slice metrics; they should improve markedly, even before adding a Stage-1 filter.
4.3 Consider richer labels: separate “impossible” from “unbelievable but possible”
If you want the classifier itself to distinguish causes, move beyond a single bit:
- 0: believable.
- 1: unbelievable but physically possible.
- 2: physically impossible / fantasy.
Or multi-label:
physically_possible ∈ {0,1}
believable_as_excuse ∈ {0,1}
This mirrors how plausibility datasets like PAP and some event-knowledge benchmarks treat plausibility as multi-level and explicitly distinguish impossible from unlikely. (Wiley Online Library)
You can always collapse 1 and 2 into “unbelievable” at inference, but training with separate labels teaches the model that fantasy is a special failure mode.
4.4 Add a Stage-1 plausibility / realism filter
Once you have better training data, add a simple Stage-1 module in front of DistilBERT:
-
Rule filter
- Check for obviously fantasy tokens and patterns; if found, flag as impossible and output 0 directly.
-
Plausibility score
- Use a plausibility model (e.g., Vera) to score the excuse: low score ⇒ classify as unrealistic. (arXiv)
-
OOD detector
- Train an OOD detector on the DistilBERT embedding space to identify excuses that deviate strongly from typical training examples (fantasy often will). (ACL Anthology)
Combine them conservatively: if any strong signal says “impossible / out-of-distribution”, you override Stage-2.
4.5 Only then consider larger encoders
With all of the above in place, trying DeBERTa-v3-base/large or RoBERTa-large is worthwhile:
- They tend to give better calibrated scores and more robust semantic distinctions when trained on the same data. (ACL Anthology)
But they will now also have better data and labels to learn from, so any improvements will be meaningful and not just capacity-driven overfitting of the old label semantics.
5. Summary
- Your classifier (DistilBERT or otherwise) has learned linguistic and dataset-internal “believability”, not physical realism.
- This is normal for BERT-style models, and well documented in semantic-plausibility and event-knowledge research. (ACL Anthology)
- Yes, you are missing fantasy/impossible examples in the 0 class, and that is a key cause.
- A separate “unrealistic/fantasy detector” (Stage-1) is a good idea, especially in combination with rule-based checks and a generic plausibility model like Vera. (arXiv)
- Larger models can help a bit but will not solve the fundamental issue without better data and an explicit realism notion.
- A multi-stage pipeline (Stage-1 realism, Stage-2 excuse-believability) plus targeted data augmentation and slice-wise evaluation is a solid, scalable way to get the behavior you want.
If you implement:
- A dedicated fantasy/impossible test slice,
- Augmented training with explicit fantasy negatives and contrastive pairs, and
- A simple Stage-1 filter (rules + plausibility score),
you should see your dragons / portals / wizards reliably move from “believable” to “unbelievable,” while maintaining or improving your overall 0.76 accuracy.