ProtSent ESM-2 35M

Contrastively fine-tuned ESM-2 35M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.

This is the best-performing 35M variant, trained without hard negatives (which improved 20/23 downstream tasks vs. 16/23 for the full model).

Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 150M model: oriel9p/protsent-esm2-150M

Training

ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.

This variant was trained on four complementary data sources with round-robin sampling:

Dataset Rows/Pairs Loss
Pfam families (linclust@70%) 32.9M domains MNRL
AlphaFold DB structural pairs (Foldseek-grouped) 133.9M sequences MNRL
STRING-DB v12 PPI (score >= 400) 36.5M pairs MNRL
ProteinGym DMS / clinical 2.2M pairs CoSENT

Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~3-4 hours.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oriel9p/protsent-esm2-35M")

sequences = [
    "MKTLLLTLVVVTIVCLDLGYT",
    "MKTLLLTLVVVTIVCLDLGYN",  # similar
    "AGWYRSPQEGLKPVDTFKDIV",  # different
]

embeddings = model.encode(sequences)

Compute similarity:

from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Results

KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. This variant (w/o hard negatives) improves 20 of 23 tasks over baseline ESM-2 35M with a mean relative improvement of +7.9%.

Selected highlights vs. baseline ESM-2 35M:

Task Metric Baseline ProtSent Change
Remote Homology (Fold) F1 Macro .223 .313 +40.5%
RhlA Enzyme Mutations Spearman .236 .418 +77.2%
Beta-lactamase (PEER) Spearman .670 .793 +18.5%
Fluorescence (TAPE) Spearman .490 .567 +15.6%
PPI (Bernett) AUC .560 .589 +5.3%

Intended Use

General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships.

Citation

@article{ofer2026protsent,
  title={ProtSent: Protein Sentence Transformers},
  author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train oriel9p/protsent-esm2-35M

Collection including oriel9p/protsent-esm2-35M