language:-eslicense:apache-2.0base_model:emilyalsentzer/Bio_ClinicalBERTtags:-token-classification-ner-pii-pii-detection-de-identification-privacy-healthcare-medical-clinical-phi-spanish-pytorch-transformers-openmedpipeline_tag:token-classificationlibrary_name:transformersmetrics:-f1-precision-recallmodel-index:-name:OpenMed-PII-Spanish-BioClinicalBERT-110M-v1results:-task:type:token-classificationname:NamedEntityRecognitiondataset:name:AI4Privacy(Spanishsubset)type:ai4privacy/pii-masking-400ksplit:testmetrics:-type:f1value:0.8411name:F1(micro)-type:precisionvalue:0.8342name:Precision-type:recallvalue:0.8481name:Recallwidget:-text:>- Dr. Carlos García (DNI: 12345678A) puede ser contactado en carlos.garcia@hospital.es o al +34 612 345 678. Vive en Calle Gran Vía 25, 28013 Madrid.example_title:ClinicalNotewithPII(Spanish)
OpenMed-PII-Spanish-BioClinicalBERT-110M-v1
Spanish PII Detection Model | 110M Parameters | Open Source
Model Description
OpenMed-PII-Spanish-BioClinicalBERT-110M-v1 is a transformer-based token classification model fine-tuned for Personally Identifiable Information (PII) detection in Spanish text. This model identifies and classifies 54 types of sensitive information including names, addresses, social security numbers, medical record numbers, and more.
Key Features
Spanish-Optimized: Specifically trained on Spanish text for optimal performance
High Accuracy: Achieves strong F1 scores across diverse PII categories
Comprehensive Coverage: Detects 55+ entity types spanning personal, financial, medical, and contact information
Privacy-Focused: Designed for de-identification and compliance with GDPR and other privacy regulations
Production-Ready: Optimized for real-world text processing pipelines
Performance
Evaluated on the Spanish subset of AI4Privacy dataset:
This model detects 54 PII entity types organized into categories:
Identifiers (22 types)
Entity
Description
ACCOUNTNAME
Accountname
BANKACCOUNT
Bankaccount
BIC
Bic
BITCOINADDRESS
Bitcoinaddress
CREDITCARD
Creditcard
CREDITCARDISSUER
Creditcardissuer
CVV
Cvv
ETHEREUMADDRESS
Ethereumaddress
IBAN
Iban
IMEI
Imei
...
and 12 more
Personal Info (11 types)
Entity
Description
AGE
Age
DATEOFBIRTH
Dateofbirth
EYECOLOR
Eyecolor
FIRSTNAME
Firstname
GENDER
Gender
HEIGHT
Height
LASTNAME
Lastname
MIDDLENAME
Middlename
OCCUPATION
Occupation
PREFIX
Prefix
...
and 1 more
Contact Info (2 types)
Entity
Description
EMAIL
Email
PHONE
Phone
Location (9 types)
Entity
Description
BUILDINGNUMBER
Buildingnumber
CITY
City
COUNTY
County
GPSCOORDINATES
Gpscoordinates
ORDINALDIRECTION
Ordinaldirection
SECONDARYADDRESS
Secondaryaddress
STATE
State
STREET
Street
ZIPCODE
Zipcode
Organization (3 types)
Entity
Description
JOBDEPARTMENT
Jobdepartment
JOBTITLE
Jobtitle
ORGANIZATION
Organization
Financial (5 types)
Entity
Description
AMOUNT
Amount
CURRENCY
Currency
CURRENCYCODE
Currencycode
CURRENCYNAME
Currencyname
CURRENCYSYMBOL
Currencysymbol
Temporal (2 types)
Entity
Description
DATE
Date
TIME
Time
Usage
Quick Start
from transformers import pipeline
# Load the PII detection pipeline
ner = pipeline("ner", model="OpenMed/OpenMed-PII-Spanish-BioClinicalBERT-110M-v1", aggregation_strategy="simple")
text = """Paciente María López (nacida el 15/03/1985, DNI: 87654321B) fue atendida hoy.Contacto: maria.lopez@email.es, Teléfono: +34 612 345 678.Dirección: Calle Serrano 42, 28001 Madrid."""
entities = ner(text)
for entity in entities:
print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
Important — Accent Handling: This model was trained on text without diacritical marks (accents). For best results, strip accents from your input before inference. Character offsets are preserved, so you can map entities back to the original text.
import unicodedata
defstrip_accents(text: str) -> str:
nfc = unicodedata.normalize("NFC", text)
nfd = unicodedata.normalize("NFD", nfc)
stripped = "".join(ch for ch in nfd if unicodedata.category(ch) != "Mn")
return unicodedata.normalize("NFC", stripped)
text = strip_accents(text) # call before passing to the pipeline
entities = ner(text)
De-identification Example
defredact_pii(text, entities, placeholder='[REDACTED]'):
"""Replace detected PII with placeholders."""# Sort entities by start position (descending) to preserve offsets
sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
redacted = text
for ent in sorted_entities:
redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
return redacted
# Apply de-identification
redacted_text = redact_pii(text, entities)
print(redacted_text)
Batch Processing
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "OpenMed/OpenMed-PII-Spanish-BioClinicalBERT-110M-v1"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = [
"Paciente María López (nacida el 15/03/1985, DNI: 87654321B) fue atendida hoy.",
"Contacto: maria.lopez@email.es, Teléfono: +34 612 345 678.",
]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)