Spaces:

HuggingFaceBio
/

carbon-demo

Running

App Files Files Community

lvwerra HF Staff Claude Opus 4.7 (1M context) commited on 16 days ago

Commit

a36d457

1 Parent(s): 6a79bbb

bp-level: rewrite snippets for fns revision (single checkpoint, batched score_sequence)

Browse files

Files changed (1) hide show

demo.html +58 -58

demo.html CHANGED Viewed

@@ -2046,15 +2046,13 @@ for name, ids in zip(species_prefixes, new_ids):
     6-mer axis. Reach for bp-level <em>scoring</em> whenever the task is about
     a specific base: variant-effect prediction, single-nucleotide mutational
     scans, comparing the likelihood of a reference and an alternate allele at
-    one position. Two complementary delivery paths: generation ships as a
-    transformers <code>custom_generate</code> method at
-    <code>HuggingFaceBio/carbon-generate</code> that works on the plain
-    <code>Carbon-3B</code>/<code>8B</code>/<code>500M</code> checkpoints
-    (standard <code>LlamaForCausalLM</code>, no custom modeling file).
-    Scoring ships in the <code>-remote</code> variants of those same
-    checkpoints, which add a <code>score_sequence(seq)</code> method that
-    returns per-base distributions and the probability of the observed base
-    at every position.
   </div>
   <details class="code-snippet">
@@ -2065,66 +2063,68 @@ for name, ids in zip(species_prefixes, new_ids):
         <button class="code-snippet__tab"        data-tab="score"    type="button">score</button>
       </div>
       <button class="code-snippet__copy" type="button">Copy</button>
-      <div class="code-snippet__panel active" data-tab="generate"><pre><code>from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
-tok = AutoTokenizer.from_pretrained(
-    "HuggingFaceBio/Carbon-3B", trust_remote_code=True,
-)
-model = AutoModelForCausalLM.from_pretrained(
-    "HuggingFaceBio/Carbon-3B",
-    dtype=torch.bfloat16, device_map="auto",
-)
-prompt = "&lt;dna&gt;ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
-inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
-# `custom_generate` injects a logits processor that marginalizes the
-# 6-mer logits to per-base distributions and samples each of the 6
-# positions independently, then forces the matching 6-mer token. All
-# standard generation knobs (temperature, top_p, top_k, repetition_penalty)
-# still apply, they just act on the per-base marginals.
-out = model.generate(
-    **inputs,
-    max_new_tokens=128,         # 128 6-mer tokens ~= 768 bp of continuation
-    custom_generate="HuggingFaceBio/carbon-generate",
     trust_remote_code=True,
-    tokenizer=tok,
-    do_sample=True, temperature=0.8, top_p=0.9,
-)
-# Slice off the prompt and decode the continuation as plain DNA.
-new_ids = out[0, inputs["input_ids"].shape[1]:]
-print(tok.decode(new_ids, skip_special_tokens=True))</code></pre></div>
-      <div class="code-snippet__panel" data-tab="score"><pre><code>from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch, math
-# The -remote variants bundle modeling code that exposes
-# `score_sequence(seq)` directly on the model. It returns, for every
-# position in the input DNA, the marginal P(base | context) and the
-# probability of the observed base.
-tok = AutoTokenizer.from_pretrained(
-    "HuggingFaceBio/Carbon-3B-remote", trust_remote_code=True,
-)
 model = AutoModelForCausalLM.from_pretrained(
-    "HuggingFaceBio/Carbon-3B-remote",
     trust_remote_code=True,
-    dtype=torch.bfloat16, device_map="auto",
-)
-ref = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
-alt = ref[:20] + "G" + ref[21:]          # single-base substitution at pos 20
-# bp_probs: [seq_len, 4]   marginal P(A/T/C/G | context) at each position
-# actual:   [seq_len]      P(observed base | context) at each position
-bp_probs_ref, actual_ref = model.score_sequence(ref)
-bp_probs_alt, actual_alt = model.score_sequence(alt)
-# log-likelihood delta at the substituted position
-# is the per-base variant-effect score in its simplest form.
-delta = math.log(actual_alt[20].item() + 1e-12) \
-      - math.log(actual_ref[20].item() + 1e-12)
-print(f"log P(alt) - log P(ref) at pos 20: {delta:+.3f}")</code></pre></div>
     </div>
   </details>
   </div>

     6-mer axis. Reach for bp-level <em>scoring</em> whenever the task is about
     a specific base: variant-effect prediction, single-nucleotide mutational
     scans, comparing the likelihood of a reference and an alternate allele at
+    one position. Both paths ship together on the <code>fns</code> revision of
+    the <code>Carbon-3B</code>/<code>8B</code>/<code>500M</code> checkpoints:
+    plain <code>.generate()</code> already produces bp-resolution output (the
+    tokenizer exposes the kmer width as <code>tokenizer.k</code>), and the
+    model gains a <code>score_sequence(seqs)</code> method that batches a list
+    of sequences and returns per-base distributions plus the probability of
+    the observed base at every position.
   </div>
   <details class="code-snippet">
         <button class="code-snippet__tab"        data-tab="score"    type="button">score</button>
       </div>
       <button class="code-snippet__copy" type="button">Copy</button>
+      <div class="code-snippet__panel active" data-tab="generate"><pre><code>import math
 import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HuggingFaceBio/Carbon-3B"
+revision = "fns"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
     trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to(device).eval()
+context = "ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"
+n_bp = 60
+inputs = tokenizer(f"&lt;dna&gt;{context}", return_tensors="pt", add_special_tokens=False).to(device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=math.ceil(n_bp / tokenizer.k),
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+generated_ids = output_ids[0, inputs.input_ids.shape[1]:]
+generated_dna = tokenizer.decode(generated_ids, skip_special_tokens=True)[:n_bp]
+print(generated_dna)</code></pre></div>
+      <div class="code-snippet__panel" data-tab="score"><pre><code>import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "HuggingFaceBio/Carbon-3B"
+revision = "fns"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    revision=revision,
     trust_remote_code=True,
+    dtype=torch.bfloat16,
+).to(device).eval()
+reference = "GGGCTATAAAGGCCATCGATCGATCGATCGATCGATCGATCG"
+perturbed = "GGGCGCGCGCGGCCATCGATCGATCGATCGATCGATCGATCG"
+# score_sequence accepts a list of sequences and returns, for each one,
+# the [seq_len, 4] marginal P(A/T/C/G | context) and the [seq_len]
+# probability of the observed base.
+with torch.no_grad():
+    bp_probs, actual_probs = model.score_sequence([reference, perturbed])
+scores = [torch.log(p.clamp_min(1e-12)).mean().item() for p in actual_probs]
+print(f"reference mean bp logp: {scores[0]:.4f}")
+print(f"perturbed mean bp logp: {scores[1]:.4f}")
+print(f"reference preferred: {scores[0] > scores[1]}")</code></pre></div>
     </div>
   </details>
   </div>