ProtSent ESM-2 35M

Contrastively fine-tuned ESM-2 35M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.

This is the best-performing 35M variant, trained without hard negatives (which improved 20/23 downstream tasks vs. 16/23 for the full model).

Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 150M model: oriel9p/protsent-esm2-150M

Training

ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.

This variant was trained on four complementary data sources with round-robin sampling:

Dataset	Rows/Pairs	Loss
Pfam families (linclust@70%)	32.9M domains	MNRL
AlphaFold DB structural pairs (Foldseek-grouped)	133.9M sequences	MNRL
STRING-DB v12 PPI (score >= 400)	36.5M pairs	MNRL
ProteinGym DMS / clinical	2.2M pairs	CoSENT

Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~3-4 hours.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oriel9p/protsent-esm2-35M")

sequences = [
    "MKTLLLTLVVVTIVCLDLGYT",
    "MKTLLLTLVVVTIVCLDLGYN",  # similar
    "AGWYRSPQEGLKPVDTFKDIV",  # different
]

embeddings = model.encode(sequences)

Compute similarity:

from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Results

KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. This variant (w/o hard negatives) improves 20 of 23 tasks over baseline ESM-2 35M with a mean relative improvement of +7.9%.

Selected highlights vs. baseline ESM-2 35M:

Task	Metric	Baseline	ProtSent	Change
Remote Homology (Fold)	F1 Macro	.223	.313	+40.5%
RhlA Enzyme Mutations	Spearman	.236	.418	+77.2%
Beta-lactamase (PEER)	Spearman	.670	.793	+18.5%
Fluorescence (TAPE)	Spearman	.490	.567	+15.6%
PPI (Bernett)	AUC	.560	.589	+5.3%

Intended Use

General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships.

Citation

@article{ofer2026protsent,
  title={ProtSent: Protein Sentence Transformers},
  author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

oriel9p
/

protsent-esm2-35M

ProtSent ESM-2 35M

Training

Quick Start

Results

Intended Use

Citation

Dataset used to train oriel9p/protsent-esm2-35M

Collection including oriel9p/protsent-esm2-35M

ProtSent