YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GLiNER Small v2.1 β GPU-Optimized Inference
Optimization result: 1.71Γ faster inference on GPU with zero F1-score loss across 11 NER evaluation datasets.
Model: binga/gliner_small_v2.1-optimized-gpu
What is this?
This is an optimized variant of the original urchade/gliner_small-v2.1 model, specifically tuned for maximum GPU inference speed without sacrificing NER accuracy.
Optimizations Applied
| Technique | Speedup | F1 Impact | Notes |
|---|---|---|---|
| FP16 (half-precision) | ~1.28Γ | Zero loss | Reduces memory bandwidth, enables faster Tensor Cores |
| torch.compile(mode="max-autotune") | ~1.71Γ | Zero loss | Compiles transformer backbone + span/prompt layers |
| Inference packing | ~1.84Γ (batch throughput) | Zero loss | Packs variable-length sequences for better GPU utilization |
Recommended Usage
For best latency (single text):
from gliner import GLiNER
import torch
model = GLiNER.from_pretrained("binga/gliner_small_v2.1-optimized-gpu", map_location="cuda")
model.to("cuda")
model.half() # FP16 β critical for speed
# Optional: compile submodules for additional speed
if hasattr(model, "model"):
inner = model.model
for attr in ["token_rep_layer", "span_rep_layer", "prompt_rep_layer"]:
layer = getattr(inner, attr, None)
if layer is not None:
setattr(inner, attr, torch.compile(layer, mode="max-autotune"))
text = "Apple Inc. was founded by Steve Jobs in California."
labels = ["person", "organization", "location"]
entities = model.predict_entities(text, labels, threshold=0.5)
For best throughput (batch processing):
from gliner import InferencePackingConfig
# Enable inference packing (packs variable-length sequences)
model.configure_inference_packing(
InferencePackingConfig(max_length=384, streams_per_batch=8)
)
results = model.inference(texts, labels, threshold=0.5, batch_size=32)
Benchmark Results
Evaluated on 11 diverse NER datasets (CoNLL-2003, OntoNotes 5, BC5CDR, WNUT-2017, TweetNER7, MIT Movie, MIT Restaurant (Fin), CrossNER AI/Literature/Science, WikiNeural):
| Dataset | Samples | Entities | Baseline F1 | Optimized F1 | Speedup |
|---|---|---|---|---|---|
| conll2003 | 3,453 | 4 | 0.5483 | 0.5481 | 1.83Γ |
| ontonotes5 | 8,262 | 18 | 0.2797 | 0.2797 | 1.75Γ |
| bc5cdr | 5,865 | 2 | 0.6592 | 0.6591 | 1.73Γ |
| wnut2017 | 1,287 | 6 | 0.4255 | 0.4252 | 1.75Γ |
| tweetner7 | 3,383 | 7 | 0.2829 | 0.2828 | 1.73Γ |
| mit_movie | 1,953 | 12 | 0.5183 | 0.5183 | 1.80Γ |
| fin | 305 | 4 | 0.2906 | 0.2906 | 1.25Γ |
| crossner_ai | 431 | 14 | 0.5000 | 0.5002 | 1.71Γ |
| crossner_literature | 416 | 12 | 0.6444 | 0.6444 | 1.73Γ |
| crossner_science | 543 | 17 | 0.6330 | 0.6332 | 1.75Γ |
| wikineural | 3,000 | 16 | 0.5465 | 0.5462 | 1.67Γ |
| AVERAGE | 28,358 | β | 0.4844 | 0.4844 | 1.71Γ |
Performance Guarantee
- F1 difference: -0.0000 (zero loss, within measurement noise)
- All 11 datasets: No statistically significant performance degradation
- Zero-shot NER: Maintains the same generalization capability
Hardware Requirements
- GPU: NVIDIA GPU with Tensor Cores (T4, A10, A100, H100 recommended)
- VRAM: ~1.5GB for FP16 inference (vs ~3GB FP32)
- CUDA: 11.8+ or 12.x
- PyTorch: 2.0+ (for
torch.compilesupport)
Model Details
- Base model:
microsoft/deberta-v3-small(6 layers, 768 hidden) - Architecture: Uni-encoder span-based NER
- Parameters: ~166M (same as original)
- Max length: 384 tokens
- Max entity types: 25 per inference call
- Max span width: 12 words
- License: Apache-2.0
Citation
Original GLiNER paper:
@inproceedings{zaratiana2024gliner,
title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
author={Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
booktitle={NAACL},
year={2024}
}
Acknowledgments
This optimized model is based on the original urchade/gliner_small-v2.1 by Urchade Zaratiana et al. All credit for the model architecture and training goes to the original authors.