GLiNER Small v2.1 — GPU-Optimized Inference

Optimization result: 1.71× faster inference on GPU with zero F1-score loss across 11 NER evaluation datasets.

Model: binga/gliner_small_v2.1-optimized-gpu

What is this?

This is an optimized variant of the original urchade/gliner_small-v2.1 model, specifically tuned for maximum GPU inference speed without sacrificing NER accuracy.

Optimizations Applied

Technique	Speedup	F1 Impact	Notes
FP16 (half-precision)	~1.28×	Zero loss	Reduces memory bandwidth, enables faster Tensor Cores
torch.compile(mode="max-autotune")	~1.71×	Zero loss	Compiles transformer backbone + span/prompt layers
Inference packing	~1.84× (batch throughput)	Zero loss	Packs variable-length sequences for better GPU utilization

Recommended Usage

For best latency (single text):

from gliner import GLiNER
import torch

model = GLiNER.from_pretrained("binga/gliner_small_v2.1-optimized-gpu", map_location="cuda")
model.to("cuda")
model.half()  # FP16 — critical for speed

# Optional: compile submodules for additional speed
if hasattr(model, "model"):
    inner = model.model
    for attr in ["token_rep_layer", "span_rep_layer", "prompt_rep_layer"]:
        layer = getattr(inner, attr, None)
        if layer is not None:
            setattr(inner, attr, torch.compile(layer, mode="max-autotune"))

text = "Apple Inc. was founded by Steve Jobs in California."
labels = ["person", "organization", "location"]
entities = model.predict_entities(text, labels, threshold=0.5)

For best throughput (batch processing):

from gliner import InferencePackingConfig

# Enable inference packing (packs variable-length sequences)
model.configure_inference_packing(
    InferencePackingConfig(max_length=384, streams_per_batch=8)
)

results = model.inference(texts, labels, threshold=0.5, batch_size=32)

Benchmark Results

Evaluated on 11 diverse NER datasets (CoNLL-2003, OntoNotes 5, BC5CDR, WNUT-2017, TweetNER7, MIT Movie, MIT Restaurant (Fin), CrossNER AI/Literature/Science, WikiNeural):

Dataset	Samples	Entities	Baseline F1	Optimized F1	Speedup
conll2003	3,453	4	0.5483	0.5481	1.83×
ontonotes5	8,262	18	0.2797	0.2797	1.75×
bc5cdr	5,865	2	0.6592	0.6591	1.73×
wnut2017	1,287	6	0.4255	0.4252	1.75×
tweetner7	3,383	7	0.2829	0.2828	1.73×
mit_movie	1,953	12	0.5183	0.5183	1.80×
fin	305	4	0.2906	0.2906	1.25×
crossner_ai	431	14	0.5000	0.5002	1.71×
crossner_literature	416	12	0.6444	0.6444	1.73×
crossner_science	543	17	0.6330	0.6332	1.75×
wikineural	3,000	16	0.5465	0.5462	1.67×
AVERAGE	28,358	—	0.4844	0.4844	1.71×

Performance Guarantee

F1 difference: -0.0000 (zero loss, within measurement noise)
All 11 datasets: No statistically significant performance degradation
Zero-shot NER: Maintains the same generalization capability

Hardware Requirements

GPU: NVIDIA GPU with Tensor Cores (T4, A10, A100, H100 recommended)
VRAM: ~1.5GB for FP16 inference (vs ~3GB FP32)
CUDA: 11.8+ or 12.x
PyTorch: 2.0+ (for torch.compile support)

Model Details

Base model: microsoft/deberta-v3-small (6 layers, 768 hidden)
Architecture: Uni-encoder span-based NER
Parameters: ~166M (same as original)
Max length: 384 tokens
Max entity types: 25 per inference call
Max span width: 12 words
License: Apache-2.0

Citation

Original GLiNER paper:

@inproceedings{zaratiana2024gliner,
  title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
  author={Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
  booktitle={NAACL},
  year={2024}
}

Acknowledgments

This optimized model is based on the original urchade/gliner_small-v2.1 by Urchade Zaratiana et al. All credit for the model architecture and training goes to the original authors.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support