YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GLiNER Small v2.1 β€” GPU-Optimized Inference

Optimization result: 1.71Γ— faster inference on GPU with zero F1-score loss across 11 NER evaluation datasets.

Model: binga/gliner_small_v2.1-optimized-gpu

What is this?

This is an optimized variant of the original urchade/gliner_small-v2.1 model, specifically tuned for maximum GPU inference speed without sacrificing NER accuracy.

Optimizations Applied

Technique Speedup F1 Impact Notes
FP16 (half-precision) ~1.28Γ— Zero loss Reduces memory bandwidth, enables faster Tensor Cores
torch.compile(mode="max-autotune") ~1.71Γ— Zero loss Compiles transformer backbone + span/prompt layers
Inference packing ~1.84Γ— (batch throughput) Zero loss Packs variable-length sequences for better GPU utilization

Recommended Usage

For best latency (single text):

from gliner import GLiNER
import torch

model = GLiNER.from_pretrained("binga/gliner_small_v2.1-optimized-gpu", map_location="cuda")
model.to("cuda")
model.half()  # FP16 β€” critical for speed

# Optional: compile submodules for additional speed
if hasattr(model, "model"):
    inner = model.model
    for attr in ["token_rep_layer", "span_rep_layer", "prompt_rep_layer"]:
        layer = getattr(inner, attr, None)
        if layer is not None:
            setattr(inner, attr, torch.compile(layer, mode="max-autotune"))

text = "Apple Inc. was founded by Steve Jobs in California."
labels = ["person", "organization", "location"]
entities = model.predict_entities(text, labels, threshold=0.5)

For best throughput (batch processing):

from gliner import InferencePackingConfig

# Enable inference packing (packs variable-length sequences)
model.configure_inference_packing(
    InferencePackingConfig(max_length=384, streams_per_batch=8)
)

results = model.inference(texts, labels, threshold=0.5, batch_size=32)

Benchmark Results

Evaluated on 11 diverse NER datasets (CoNLL-2003, OntoNotes 5, BC5CDR, WNUT-2017, TweetNER7, MIT Movie, MIT Restaurant (Fin), CrossNER AI/Literature/Science, WikiNeural):

Dataset Samples Entities Baseline F1 Optimized F1 Speedup
conll2003 3,453 4 0.5483 0.5481 1.83Γ—
ontonotes5 8,262 18 0.2797 0.2797 1.75Γ—
bc5cdr 5,865 2 0.6592 0.6591 1.73Γ—
wnut2017 1,287 6 0.4255 0.4252 1.75Γ—
tweetner7 3,383 7 0.2829 0.2828 1.73Γ—
mit_movie 1,953 12 0.5183 0.5183 1.80Γ—
fin 305 4 0.2906 0.2906 1.25Γ—
crossner_ai 431 14 0.5000 0.5002 1.71Γ—
crossner_literature 416 12 0.6444 0.6444 1.73Γ—
crossner_science 543 17 0.6330 0.6332 1.75Γ—
wikineural 3,000 16 0.5465 0.5462 1.67Γ—
AVERAGE 28,358 β€” 0.4844 0.4844 1.71Γ—

Performance Guarantee

  • F1 difference: -0.0000 (zero loss, within measurement noise)
  • All 11 datasets: No statistically significant performance degradation
  • Zero-shot NER: Maintains the same generalization capability

Hardware Requirements

  • GPU: NVIDIA GPU with Tensor Cores (T4, A10, A100, H100 recommended)
  • VRAM: ~1.5GB for FP16 inference (vs ~3GB FP32)
  • CUDA: 11.8+ or 12.x
  • PyTorch: 2.0+ (for torch.compile support)

Model Details

  • Base model: microsoft/deberta-v3-small (6 layers, 768 hidden)
  • Architecture: Uni-encoder span-based NER
  • Parameters: ~166M (same as original)
  • Max length: 384 tokens
  • Max entity types: 25 per inference call
  • Max span width: 12 words
  • License: Apache-2.0

Citation

Original GLiNER paper:

@inproceedings{zaratiana2024gliner,
  title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
  author={Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
  booktitle={NAACL},
  year={2024}
}

Acknowledgments

This optimized model is based on the original urchade/gliner_small-v2.1 by Urchade Zaratiana et al. All credit for the model architecture and training goes to the original authors.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support