Spaces:

yashvshetty
/

clarke

Sleeping

App Files Files Community

clarke / clarke_PRD_technical_spec.md

yashvshetty

Clarke: NHS clinical documentation system

9636a02 4 months ago

preview code

raw

history blame contribute delete

55.6 kB

Clarke — PRD Technical Specification

Version: 1.0 | Date: 13 February 2026 | Author: Project Lead
Status: Final — engineering blueprint for AI agent (Codex) execution
Parent document: clarke_PRD_masterplan.md
Scope: Architecture, directory structure, technology stack, data models, API contracts, model serving, FHIR specification, synthetic data, frontend–backend integration, error handling, testing, and known pitfalls
Not in scope: Strategic rationale (masterplan.md), build sequencing (implementation.md), visual styling (design_guidelines.md), user journey (userflow.md), granular task list (tasks.md)

1. Project Directory Tree

clarke/
├── app.py                          # Gradio application entry point (launches UI + mounts FastAPI)
├── Dockerfile                      # HF Spaces Docker config (nvidia/cuda:12.4.1-runtime-ubuntu22.04)
├── requirements.txt                # Pinned Python dependencies
├── .env.template                   # Environment variable template (copy to .env)
├── README.md                       # Project overview, architecture diagram, setup, evaluation, licence
├── LICENSE                         # Apache 2.0
├── submission_checklist.md         # Competition submission verification checklist
├── evaluation_report.md            # Quantitative evaluation results (WER, BLEU, ROUGE-L, fact recall)
│
├── backend/
│   ├── __init__.py
│   ├── orchestrator.py             # Core pipeline coordinator: audio → transcript → context → letter
│   ├── api.py                      # FastAPI endpoints (patient, consultation, document, health)
│   ├── config.py                   # Centralised configuration (env vars, model IDs, timeouts)
│   ├── models/
│   │   ├── __init__.py
│   │   ├── medasr.py               # MedASR loading, audio preprocessing, transcription pipeline
│   │   ├── ehr_agent.py            # MedGemma 4B EHR agent: FHIR tool-calling or deterministic fallback
│   │   ├── doc_generator.py        # MedGemma 27B document generation: prompt assembly + inference
│   │   └── model_manager.py        # Shared GPU memory management, model lifecycle, health checks
│   ├── fhir/
│   │   ├── __init__.py
│   │   ├── client.py               # Async FHIR REST client (httpx) for querying HAPI FHIR / mock API
│   │   ├── tools.py                # FHIR tool functions for EHR agent (search_patients, get_conditions, etc.)
│   │   ├── mock_api.py             # Mock FHIR API (FastAPI endpoints returning pre-loaded JSON) — fallback
│   │   └── queries.py              # Deterministic FHIR query patterns (fallback for agentic tool-calling)
│   ├── prompts/
│   │   ├── document_generation.j2  # Jinja2 template: system + transcript + context → letter prompt
│   │   ├── ehr_agent_system.txt    # System prompt for MedGemma 4B EHR agent
│   │   └── context_synthesis.j2    # Jinja2 template: raw FHIR resources → structured context JSON
│   ├── schemas.py                  # Pydantic data models (Patient, Consultation, Transcript, etc.)
│   ├── audio.py                    # Audio format conversion (WebM → WAV 16kHz mono via ffmpeg/pydub)
│   ├── errors.py                   # Custom exception classes, error response models, logging config
│   └── utils.py                    # Shared utilities (timing decorators, JSON sanitisation)
│
├── frontend/
│   ├── __init__.py
│   ├── ui.py                       # Gradio Blocks UI definition (all screens S1–S6)
│   ├── theme.py                    # Gradio theme: Clarke colour tokens, typography, spacing
│   ├── components.py               # Reusable Gradio component builders (patient card, status badge, etc.)
│   ├── state.py                    # Gradio session state management
│   └── assets/
│       ├── style.css               # Custom CSS (design_guidelines.md §1–§5 tokens and animations)
│       ├── clarke_logo.svg         # Clarke shield/C logo in SVG
│       └── favicon.ico             # Browser tab icon
│
├── data/
│   ├── synthea/
│   │   ├── generate.sh             # Synthea generation script (50 UK-style patients)
│   │   └── uk_config/              # Synthea UK module config (names, NHS numbers, mmol/L, BNF drugs)
│   ├── fhir_bundles/
│   │   └── *.json                  # Pre-generated FHIR Bundle JSON files (50 patients) for mock API
│   ├── demo/
│   │   ├── mrs_thompson.wav        # Demo audio: 67F, T2DM, rising HbA1c (~60s, 16kHz mono WAV)
│   │   ├── mr_okafor.wav           # Demo audio: chest pain follow-up (~60s, 16kHz mono WAV)
│   │   ├── ms_patel.wav            # Demo audio: asthma review (~60s, 16kHz mono WAV)
│   │   ├── mrs_thompson_transcript.txt  # Ground-truth transcript for WER evaluation
│   │   ├── mr_okafor_transcript.txt
│   │   └── ms_patel_transcript.txt
│   ├── training/
│   │   ├── train.jsonl             # 200 training triplets (transcript, context, reference letter)
│   │   └── test.jsonl              # 50 held-out test triplets
│   └── clinic_list.json            # Demo clinic list metadata (5 patients for dashboard)
│
├── finetuning/
│   ├── train_lora.py               # QLoRA fine-tuning script for MedGemma 27B
│   ├── generate_training_data.py   # Script to generate training triplets via Claude API
│   └── merge_adapter.py            # Merge LoRA adapter with base model (optional, for evaluation)
│
├── evaluation/
│   ├── eval_medasr.py              # WER evaluation: MedASR vs Whisper on test clips
│   ├── eval_ehr_agent.py           # Fact recall / precision / hallucination evaluation
│   ├── eval_doc_gen.py             # BLEU / ROUGE-L evaluation on held-out test set
│   └── gold_standards/
│       └── *.json                  # Gold-standard context summaries for 20 test patients
│
├── tests/
│   ├── test_api.py                 # API endpoint unit tests (one per endpoint)
│   ├── test_medasr.py              # MedASR pipeline unit tests
│   ├── test_ehr_agent.py           # EHR agent unit tests
│   ├── test_doc_generator.py       # Document generator unit tests
│   ├── test_fhir_client.py         # FHIR client unit tests
│   ├── test_schemas.py             # Pydantic model validation tests
│   └── test_e2e.py                 # End-to-end pipeline test (audio → transcript → context → letter)
│
└── scripts/
    ├── start.sh                    # Single-command launch script (starts FHIR + FastAPI + Gradio)
    ├── health_check.sh             # Verify all services running
    └── setup_fhir.sh               # Load synthetic data into FHIR server

2. Technology Stack

Package	Version	Purpose	Notes
Python	3.11.x	Runtime	HF Spaces base
PyTorch	2.4.x	ML framework	CUDA 12.4 build
transformers	4.47.x	Model loading (MedASR, MedGemma)	HuggingFace
bitsandbytes	0.44.x	4-bit NF4 quantisation	`pip install --break-system-packages`
accelerate	1.2.x	Device mapping for multi-GPU/CPU offload
peft	0.13.x	LoRA / QLoRA fine-tuning
trl	0.12.x	SFTTrainer for supervised fine-tuning
datasets	3.2.x	HF Datasets for training data loading
Gradio	5.x	Frontend UI framework	Served within HF Space
FastAPI	0.109.x	Backend REST API	Mounted within Gradio app
uvicorn	0.27.x	ASGI server for FastAPI
httpx	0.27.x	Async HTTP client (FHIR REST calls)
pydub	0.25.x	Audio resampling, channel conversion	Requires ffmpeg
librosa	0.10.x	Audio waveform loading / preprocessing
ffmpeg	7.x (system)	WebM → WAV format conversion	System package, not pip
jinja2	3.1.x	Prompt template engine
jiwer	3.0.x	WER computation for MedASR evaluation
rouge_score	latest	ROUGE-L for document generation evaluation
sacrebleu	latest	BLEU for document generation evaluation
openai-whisper	large-v3	ASR baseline comparison only
reportlab	4.2.x	PDF export of clinic letters
wandb	0.18.x	Experiment tracking (fine-tuning)
huggingface_hub	latest	Model upload, Space deployment
python-dotenv	latest	.env file loading
loguru	latest	Structured logging

Compute allocation:

Component	Runs On	Approx VRAM
MedASR (105M)	GPU	~0.5 GB
MedGemma 4B (4-bit NF4)	GPU	~3 GB
MedGemma 27B (4-bit NF4)	GPU	~16 GB
FHIR server (HAPI or mock)	CPU	0 (CPU/RAM only)
FastAPI / Gradio	CPU	0
Total GPU		~19.5 GB (fits A100 40GB with headroom for KV cache + fine-tuning)

3. Infrastructure and Environment Specification

3a. Environment Variables

# .env.template — copy to .env and fill values

# === Model Configuration ===
MEDASR_MODEL_ID=google/medasr
MEDGEMMA_4B_MODEL_ID=google/medgemma-1.5-4b-it
MEDGEMMA_27B_MODEL_ID=google/medgemma-27b-text-it
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx           # HuggingFace token (gated model access)
QUANTIZE_4BIT=true                          # Enable 4-bit NF4 quantisation for 4B and 27B
USE_FLASH_ATTENTION=true                    # Enable flash attention if supported

# === FHIR Configuration ===
FHIR_SERVER_URL=http://localhost:8080/fhir  # HAPI FHIR or mock API base URL
USE_MOCK_FHIR=false                         # Set true to use mock FHIR API (fallback)
FHIR_TIMEOUT_S=10                           # FHIR query timeout in seconds

# === Application Configuration ===
APP_HOST=0.0.0.0
APP_PORT=7860                               # Gradio default port on HF Spaces
LOG_LEVEL=INFO                              # DEBUG | INFO | WARNING | ERROR
MAX_AUDIO_DURATION_S=1800                   # Maximum recording length (30 min)
PIPELINE_TIMEOUT_S=120                      # Max time for full pipeline (End Consultation → letter)
DOC_GEN_MAX_TOKENS=2048                     # Max tokens for MedGemma 27B generation
DOC_GEN_TEMPERATURE=0.3                     # Low temperature for factual clinical text

# === Fine-tuning (optional, Phase 4) ===
WANDB_API_KEY=                              # Weights & Biases API key
WANDB_PROJECT=clarke-finetuning
LORA_RANK=16
LORA_ALPHA=32
LORA_DROPOUT=0.05
TRAINING_EPOCHS=3
LEARNING_RATE=2e-4
BATCH_SIZE=2
GRAD_ACCUM_STEPS=8
MAX_SEQ_LENGTH=4096

3b. Cloud Deployment (Primary — HF Spaces A100)

Dockerfile:

FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.11 python3.11-venv python3-pip \
    ffmpeg curl wget git && \
    rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/bin/python3.11 /usr/bin/python

WORKDIR /app
COPY requirements.txt .
RUN pip install --break-system-packages --no-cache-dir -r requirements.txt

COPY . .

# Load FHIR data and start application
RUN chmod +x scripts/start.sh
EXPOSE 7860
CMD ["scripts/start.sh"]

scripts/start.sh:

#!/bin/bash
set -e

echo "=== Clarke Startup ==="

# 1. Start mock FHIR API (or HAPI FHIR) in background
if [ "$USE_MOCK_FHIR" = "true" ]; then
    echo "[1/3] Starting mock FHIR API..."
    python -m backend.fhir.mock_api &
    FHIR_PID=$!
    sleep 2
    echo "[1/3] Mock FHIR API running (PID: $FHIR_PID)"
else
    echo "[1/3] Using external FHIR server at $FHIR_SERVER_URL"
fi

# 2. Verify GPU
echo "[2/3] Checking GPU..."
python -c "import torch; assert torch.cuda.is_available(), 'No GPU'; print(f'GPU: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"

# 3. Launch Gradio app (which mounts FastAPI)
echo "[3/3] Starting Clarke application on port ${APP_PORT:-7860}..."
python app.py

echo "=== Clarke is ready ==="

HF Spaces metadata (in README.md YAML frontmatter):

---
title: Clarke
emoji: 🩺
colorFrom: blue
colorTo: gold
sdk: docker
app_port: 7860
hardware: a100-large
---

3c. Local Development (MacBook Pro M2 8GB — No GPU)

Local development runs only lightweight components. AI models are either mocked or served from a remote cloud GPU.

Local setup:

Clone repo. Copy .env.template to .env.
Set USE_MOCK_FHIR=true in .env.
Set model IDs to mock to activate stubs: MEDASR_MODEL_ID=mock, MEDGEMMA_4B_MODEL_ID=mock, MEDGEMMA_27B_MODEL_ID=mock.
pip install -r requirements.txt (CPU-only PyTorch).
bash scripts/start.sh → Gradio UI at http://localhost:7860.

Model stubs (when model ID = "mock"):

Each model module in backend/models/ checks the model ID. If mock, it returns pre-loaded fixture data instead of running inference:

MedASR mock: Returns the ground-truth transcript from data/demo/*.txt for known demo audio files, or a generic placeholder transcript for unknown audio.
MedGemma 4B mock: Returns pre-built context JSON from data/fhir_bundles/ for known patient IDs.
MedGemma 27B mock: Returns a pre-written reference letter from data/training/test.jsonl for known patient IDs.

This allows full frontend + integration development without GPU access.

4. Data Models and Schemas

All models defined as Pydantic v2 BaseModel in backend/schemas.py.

"""Clarke data models — Pydantic v2 schemas for all system objects."""

from __future__ import annotations
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum
from datetime import datetime


# === Enums ===

class ConsultationStatus(str, Enum):
    IDLE = "idle"
    RECORDING = "recording"
    PAUSED = "paused"
    PROCESSING = "processing"
    REVIEW = "review"
    SIGNED_OFF = "signed_off"


class PipelineStage(str, Enum):
    TRANSCRIBING = "transcribing"
    RETRIEVING_CONTEXT = "retrieving_context"
    GENERATING_DOCUMENT = "generating_document"
    COMPLETE = "complete"
    FAILED = "failed"


# === Core Models ===

class Patient(BaseModel):
    """A patient in the clinic list."""
    id: str = Field(description="FHIR Patient resource ID")
    nhs_number: str = Field(description="NHS number (format: XXX XXX XXXX)")
    name: str = Field(description="Full name (e.g., 'Mrs. Margaret Thompson')")
    date_of_birth: str = Field(description="DOB in DD/MM/YYYY format")
    age: int
    sex: str = Field(description="'Male' or 'Female'")
    appointment_time: str = Field(description="HH:MM format")
    summary: str = Field(description="One-line clinical summary for dashboard card")


class LabResult(BaseModel):
    """A single laboratory result with trend."""
    name: str = Field(description="e.g., 'HbA1c'")
    value: str = Field(description="e.g., '55'")
    unit: str = Field(description="e.g., 'mmol/mol'")
    reference_range: Optional[str] = Field(default=None, description="e.g., '20-42'")
    date: str = Field(description="ISO date of result")
    trend: Optional[str] = Field(default=None, description="'rising', 'falling', 'stable', or None")
    previous_value: Optional[str] = Field(default=None, description="Previous result value")
    previous_date: Optional[str] = Field(default=None)
    fhir_resource_id: Optional[str] = Field(default=None, description="Source FHIR Observation ID")


class PatientContext(BaseModel):
    """Structured patient context synthesised by the EHR Agent from FHIR data."""
    patient_id: str
    demographics: dict = Field(description="name, dob, nhs_number, age, sex, address")
    problem_list: list[str] = Field(description="Active diagnoses, e.g., ['Type 2 Diabetes Mellitus (2019)', ...]")
    medications: list[dict] = Field(description="[{'name': 'Metformin', 'dose': '1g', 'frequency': 'BD', 'fhir_id': '...'}]")
    allergies: list[dict] = Field(description="[{'substance': 'Penicillin', 'reaction': 'Anaphylaxis', 'severity': 'high'}]")
    recent_labs: list[LabResult] = Field(default_factory=list)
    recent_imaging: list[dict] = Field(default_factory=list, description="[{'type': 'CXR', 'date': '...', 'summary': '...'}]")
    clinical_flags: list[str] = Field(default_factory=list, description="['HbA1c rising trend over 6 months']")
    last_letter_excerpt: Optional[str] = Field(default=None, description="Key excerpt from most recent clinic letter")
    retrieval_warnings: list[str] = Field(default_factory=list, description="Warnings if some FHIR queries failed")
    retrieved_at: str = Field(description="ISO timestamp of retrieval")


class Transcript(BaseModel):
    """Consultation transcript produced by MedASR."""
    consultation_id: str
    text: str = Field(description="Full transcript text")
    duration_s: float = Field(description="Audio duration in seconds")
    word_count: int
    created_at: str


class DocumentSection(BaseModel):
    """A single section of the generated clinical letter."""
    heading: str = Field(description="e.g., 'History of presenting complaint'")
    content: str = Field(description="Section body text")
    editable: bool = Field(default=True)
    fhir_sources: list[str] = Field(default_factory=list, description="FHIR resource IDs cited in this section")


class ClinicalDocument(BaseModel):
    """A generated NHS clinical letter."""
    consultation_id: str
    letter_date: str
    patient_name: str
    patient_dob: str
    nhs_number: str
    addressee: str = Field(description="GP name and address")
    salutation: str = Field(description="e.g., 'Dear Dr. Patel,'")
    sections: list[DocumentSection]
    medications_list: list[str] = Field(description="Current medications (formatted)")
    sign_off: str = Field(description="e.g., 'Dr. S. Chen, Consultant Diabetologist'")
    status: ConsultationStatus = ConsultationStatus.REVIEW
    generated_at: str
    generation_time_s: float = Field(description="Time taken for MedGemma 27B inference")
    discrepancies: list[dict] = Field(default_factory=list, description="[{'type': 'allergy_mismatch', 'detail': '...'}]")


class Consultation(BaseModel):
    """A complete consultation session — links patient, transcript, context, and document."""
    id: str = Field(description="Unique consultation ID (UUID)")
    patient: Patient
    status: ConsultationStatus = ConsultationStatus.IDLE
    pipeline_stage: Optional[PipelineStage] = None
    context: Optional[PatientContext] = None
    transcript: Optional[Transcript] = None
    document: Optional[ClinicalDocument] = None
    started_at: Optional[str] = None
    ended_at: Optional[str] = None
    audio_file_path: Optional[str] = None


class PipelineProgress(BaseModel):
    """Real-time pipeline progress updates pushed to the UI."""
    consultation_id: str
    stage: PipelineStage
    progress_pct: int = Field(ge=0, le=100)
    message: str = Field(description="Human-readable status, e.g., 'Finalising transcript...'")


class ErrorResponse(BaseModel):
    """Standardised error response format."""
    error: str = Field(description="Error category: 'model_error', 'fhir_error', 'audio_error', 'timeout'")
    message: str = Field(description="Human-readable error message for UI display")
    detail: Optional[str] = Field(default=None, description="Technical detail (logged, not shown to user)")
    consultation_id: Optional[str] = None
    timestamp: str

Relationships:

A Consultation belongs to one Patient and has at most one Transcript, one PatientContext, and one ClinicalDocument.
A PatientContext contains lists of LabResult objects.
A ClinicalDocument contains a list of DocumentSection objects.
PipelineProgress is a transient event emitted during processing — not persisted.

5. API Contracts

All endpoints are served by FastAPI, mounted within the Gradio app at /api/v1/.

5a. Endpoint Summary

Method	Path	Description
GET	`/api/v1/health`	System health check (all models + FHIR)
GET	`/api/v1/patients`	List all patients in clinic list
GET	`/api/v1/patients/{patient_id}`	Get single patient details
POST	`/api/v1/patients/{patient_id}/context`	Trigger EHR Agent context retrieval
POST	`/api/v1/consultations/start`	Start a consultation (begin recording session)
POST	`/api/v1/consultations/{id}/audio`	Upload audio chunk or complete audio file
POST	`/api/v1/consultations/{id}/end`	End consultation → trigger full pipeline
GET	`/api/v1/consultations/{id}/transcript`	Get current transcript
GET	`/api/v1/consultations/{id}/document`	Get generated document
POST	`/api/v1/consultations/{id}/document/regenerate-section`	Regenerate one section
POST	`/api/v1/consultations/{id}/document/sign-off`	Sign off document
GET	`/api/v1/consultations/{id}/progress`	Get current pipeline progress

5b. Endpoint Details

GET /api/v1/health

// Response 200
{
  "status": "healthy",
  "models": {
    "medasr": {"loaded": true, "device": "cuda:0"},
    "medgemma_4b": {"loaded": true, "device": "cuda:0", "quantised": "4bit"},
    "medgemma_27b": {"loaded": true, "device": "cuda:0", "quantised": "4bit"}
  },
  "fhir": {"status": "connected", "patient_count": 50},
  "gpu": {"name": "A100-SXM4-40GB", "vram_used_gb": 19.5, "vram_total_gb": 40.0},
  "timestamp": "2026-02-13T14:00:00Z"
}

GET /api/v1/patients

// Response 200
{
  "patients": [
    {
      "id": "pt-001",
      "nhs_number": "943 476 5829",
      "name": "Mrs. Margaret Thompson",
      "date_of_birth": "14/03/1958",
      "age": 67,
      "sex": "Female",
      "appointment_time": "14:00",
      "summary": "Follow-up — Type 2 Diabetes, rising HbA1c"
    }
  ]
}

POST /api/v1/patients/{patient_id}/context

// Request: empty body (patient_id in URL path)
// Response 200: PatientContext JSON (see §4 schema)
// Response 404: {"error": "fhir_error", "message": "Patient not found in EHR", ...}
// Response 504: {"error": "timeout", "message": "EHR context retrieval timed out", ...}

POST /api/v1/consultations/start

// Request
{"patient_id": "pt-001"}

// Response 201
{
  "consultation_id": "cons-uuid-xxxx",
  "patient_id": "pt-001",
  "status": "recording",
  "started_at": "2026-02-13T14:05:00Z"
}

POST /api/v1/consultations/{id}/audio

// Request: multipart/form-data
// Field: "audio_file" — WAV file (16kHz mono) or WebM (server converts)
// Field: "is_final" — boolean (true = complete audio, false = chunk for streaming)

// Response 200
{"consultation_id": "cons-uuid-xxxx", "audio_received": true, "duration_s": 62.5}

POST /api/v1/consultations/{id}/end

This is the main pipeline trigger. It finalises the transcript, synthesises context, and generates the document.

// Request: empty body (or optionally upload final audio)
// Response 202 (Accepted — processing started)
{
  "consultation_id": "cons-uuid-xxxx",
  "status": "processing",
  "pipeline_stage": "transcribing",
  "message": "Pipeline started. Poll /progress for updates."
}

// Error 408: {"error": "timeout", "message": "Pipeline exceeded 120s timeout", ...}
// Error 500: {"error": "model_error", "message": "Document generation failed", ...}

GET /api/v1/consultations/{id}/document

// Response 200: ClinicalDocument JSON (see §4 schema)
// Response 404: {"error": "not_found", "message": "No document generated yet"}

POST /api/v1/consultations/{id}/document/regenerate-section

// Request
{"section_index": 2, "instruction": "Make this section more concise"}

// Response 200
{"section_index": 2, "heading": "Investigation results", "content": "...(regenerated)..."}

POST /api/v1/consultations/{id}/document/sign-off

// Request
{"edited_sections": [{"index": 1, "content": "Updated text..."}]}

// Response 200
{"consultation_id": "...", "status": "signed_off", "signed_at": "2026-02-13T14:08:00Z"}

GET /api/v1/consultations/{id}/progress

// Response 200: PipelineProgress JSON
{
  "consultation_id": "...",
  "stage": "generating_document",
  "progress_pct": 66,
  "message": "Generating clinical letter..."
}

6. Model Serving Specification

6a. MedASR (Speech Recognition)

Property	Value
HF Model ID	`google/medasr`
Parameters	105M
Architecture	Conformer-based ASR (AutoModelForSpeechSeq2Seq)
GPU VRAM	~0.5 GB
Quantisation	None needed (small model)
Loading	`transformers.pipeline("automatic-speech-recognition", model="google/medasr", device="cuda:0")`

Input format:

# 16kHz mono WAV, float32 waveform
# Loaded via librosa or pydub
import librosa
waveform, sr = librosa.load("audio.wav", sr=16000, mono=True)
# waveform: numpy array, shape (n_samples,), dtype float32

Inference call:

result = pipeline(
    waveform,
    chunk_length_s=20,
    stride_length_s=(4, 2),
    return_timestamps=True,
    generate_kwargs={"language": "en", "task": "transcribe"}
)
transcript_text = result["text"]

Output format:

{
  "text": "Hello Mrs Thompson, good to see you again. How have you been since we last met?",
  "chunks": [
    {"text": "Hello Mrs Thompson,", "timestamp": [0.0, 1.5]},
    {"text": "good to see you again.", "timestamp": [1.5, 3.2]}
  ]
}

Timeout: 30s per 60s of audio. Retry: 1 retry on timeout.

Mock (local dev): Return text from data/demo/{patient}_transcript.txt.

6b. MedGemma 1.5 4B (EHR Agent)

Property	Value
HF Model ID	`google/medgemma-1.5-4b-it`
Parameters	4B
Architecture	Gemma-based, instruction-tuned, multimodal (text capabilities used)
GPU VRAM	~3 GB (4-bit NF4)
Quantisation	4-bit NF4 via bitsandbytes (`load_in_4bit=True`, `bnb_4bit_compute_dtype=bfloat16`)

Loading:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("google/medgemma-1.5-4b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/medgemma-1.5-4b-it",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Primary mode — Agentic tool-calling via LangGraph:

The EHR Agent receives a patient ID, plans which FHIR queries to run, executes them, and synthesises a structured PatientContext JSON. If LangGraph tool-calling works reliably, this is the preferred mode.

Fallback mode — Deterministic FHIR + MedGemma summarisation:

If MedGemma 4B's instruction-following is unreliable (see §12), use deterministic Python functions to execute a fixed set of FHIR queries, then pass the raw FHIR JSON to MedGemma 4B for summarisation into the PatientContext schema only.

FHIR tool functions available to the agent:

def search_patients(name: str) -> list[dict]:    # GET /fhir/Patient?name={name}
def get_conditions(patient_id: str) -> list[dict]:  # GET /fhir/Condition?patient={id}
def get_medications(patient_id: str) -> list[dict]: # GET /fhir/MedicationRequest?patient={id}
def get_observations(patient_id: str, category: str = "laboratory") -> list[dict]:
    # GET /fhir/Observation?patient={id}&category={category}&_sort=-date&_count=20
def get_allergies(patient_id: str) -> list[dict]:   # GET /fhir/AllergyIntolerance?patient={id}
def get_diagnostic_reports(patient_id: str) -> list[dict]:
    # GET /fhir/DiagnosticReport?patient={id}&_sort=-date&_count=5
def get_recent_encounters(patient_id: str) -> list[dict]:
    # GET /fhir/Encounter?patient={id}&_sort=-date&_count=3

System prompt (ehr_agent_system.txt):

You are a clinical EHR navigation agent. Your task is to retrieve and synthesise a patient's medical context from FHIR resources to support clinical documentation.

Given a patient ID, use the available FHIR tools to retrieve:
1. Demographics (Patient resource)
2. Active conditions/diagnoses (Condition resources)
3. Current medications (MedicationRequest resources)
4. Allergies (AllergyIntolerance resources)
5. Recent laboratory results — last 6 months (Observation resources, category=laboratory)
6. Recent imaging reports (DiagnosticReport resources)

After retrieval, synthesise the data into the following JSON structure ONLY. Do not include any explanation, commentary, or markdown formatting. Output ONLY valid JSON:

{
  "patient_id": "...",
  "demographics": {...},
  "problem_list": ["..."],
  "medications": [{...}],
  "allergies": [{...}],
  "recent_labs": [{...}],
  "recent_imaging": [{...}],
  "clinical_flags": ["..."],
  "last_letter_excerpt": "...",
  "retrieval_warnings": [],
  "retrieved_at": "..."
}

Output parsing (critical — see §12):

import re, json

def parse_agent_output(raw_output: str) -> dict:
    """Extract JSON from MedGemma 4B output, stripping meta-commentary."""
    # Remove system prompt leaks
    raw_output = re.sub(r'<\|system\|>.*?<\|end\|>', '', raw_output, flags=re.DOTALL)
    # Remove markdown code fences
    raw_output = re.sub(r'```json\s*', '', raw_output)
    raw_output = re.sub(r'```\s*', '', raw_output)
    # Extract first JSON object
    match = re.search(r'\{[\s\S]*\}', raw_output)
    if match:
        return json.loads(match.group())
    raise ValueError("No valid JSON found in agent output")

Timeout: 15s for context retrieval. Retry: 1 retry. Fallback on failure: Return partial context with retrieval_warnings.

6c. MedGemma 27B (Document Generation)

Property	Value
HF Model ID	`google/medgemma-27b-text-it`
Parameters	27B
Architecture	Gemma-based, text-only, instruction-tuned
GPU VRAM	~16 GB (4-bit NF4)
Quantisation	4-bit NF4 via bitsandbytes (same config as 4B but for 27B model)

Loading:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("google/medgemma-27b-text-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/medgemma-27b-text-it",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Prompt template (document_generation.j2):

<|system|>
You are an NHS clinical documentation assistant. Generate a structured NHS clinic letter from the consultation transcript and patient context provided below.

FORMAT REQUIREMENTS:
- Date: {{ letter_date }}
- Addressee: GP (name and address from patient record)
- Re: Patient name, DOB
- Salutation: "Dear Dr. [GP name],"
- Sections: History of presenting complaint | Examination findings (if discussed) | Investigation results (use EXACT values from patient context) | Assessment and plan | Current medications
- Sign-off: "Yours sincerely, {{ clinician_name }}, {{ clinician_title }}"

RULES:
1. Use EXACT lab values from the patient context — do not fabricate or round values.
2. Include both positive and negative findings discussed in the consultation.
3. If the transcript mentions a result, cross-reference it with the patient context. If values differ, flag with [DISCREPANCY].
4. Write in third person, past tense, formal British medical English.
5. Do NOT include information not discussed in the consultation or present in the patient context.
6. Keep the letter concise — aim for 300-500 words.
<|end|>

<|user|>
## CONSULTATION TRANSCRIPT
{{ transcript }}

## PATIENT CONTEXT (from Electronic Health Record)
{{ context_json }}

Generate the NHS clinic letter now.
<|end|>

<|assistant|>

Generation parameters:

generation_config = {
    "max_new_tokens": 2048,
    "temperature": 0.3,
    "top_p": 0.9,
    "top_k": 40,
    "do_sample": True,
    "repetition_penalty": 1.1,
}

Output parsing: Split generated text into sections by detecting headings (bold markers or known section names). Return as list of DocumentSection objects.

Timeout: 90s. Retry: 1 retry with reduced max_new_tokens=1024. Fallback on total failure: Use MedGemma 4B with extensive prompt engineering for generation (lower quality but functional).

Fallback loading if 27B fails on A100 40GB:

Try GGUF Q8_0 via Ollama (ollama run hf.co/unsloth/medgemma-27b-it-GGUF:Q8_0). Switch inference to Ollama REST API (POST http://localhost:11434/api/generate).
If Ollama fails: use MedGemma 4B for generation.

7. FHIR Server Specification

7a. Primary: HAPI FHIR Server

Image: hapiproject/hapi:v7.4.0
FHIR Version: R4
Port: 8080 (internal)
Data: 50 Synthea-generated UK-style patients loaded via POST /fhir Bundle transactions.

7b. FHIR Resources Used

Resource Type	Purpose	Key Fields
Patient	Demographics	name, birthDate, identifier (NHS number), gender, address
Condition	Problem list / diagnoses	code (SNOMED), clinicalStatus, onsetDateTime
MedicationRequest	Current medications	medicationCodeableConcept, dosageInstruction, status=active
Observation	Lab results	code (LOINC), valueQuantity, effectiveDateTime, referenceRange
AllergyIntolerance	Allergies	code, reaction, criticality
DiagnosticReport	Imaging / reports	code, conclusion, effectiveDateTime
Encounter	Recent visits	type, period, reasonCode

7c. FHIR Query Patterns

GET /fhir/Patient/{id}
GET /fhir/Patient?name={name}&_count=10
GET /fhir/Condition?patient={id}&clinical-status=active
GET /fhir/MedicationRequest?patient={id}&status=active
GET /fhir/Observation?patient={id}&category=laboratory&_sort=-date&_count=20
GET /fhir/AllergyIntolerance?patient={id}
GET /fhir/DiagnosticReport?patient={id}&_sort=-date&_count=5
GET /fhir/Encounter?patient={id}&_sort=-date&_count=3

7d. Fallback: Mock FHIR API

If HAPI FHIR setup fails within HF Spaces Docker (Risk 5 in masterplan.md §8), replace with backend/fhir/mock_api.py — a FastAPI app that serves pre-loaded JSON from data/fhir_bundles/. Exposes the same REST endpoints. The EHR Agent code makes identical HTTP calls either way.

# mock_api.py — simplified structure
from fastapi import FastAPI
import json, os

app = FastAPI()
BUNDLES_DIR = "data/fhir_bundles"

@app.get("/fhir/Patient/{patient_id}")
async def get_patient(patient_id: str):
    return load_resource(patient_id, "Patient")

@app.get("/fhir/Condition")
async def get_conditions(patient: str):
    return load_resources(patient, "Condition")

# ... same pattern for all resource types

8. Synthetic Data Specification

8a. Patients

50 Synthea-generated patients with UK customisation:

Names: UK-style (e.g., Margaret Thompson, Emeka Okafor, Priya Patel). Manually patched for 5 demo patients.
Identifiers: NHS numbers (format: XXX XXX XXXX, 10 digits, valid checksum).
Units: mmol/L for glucose, mmol/mol for HbA1c, μmol/L for creatinine, mL/min for eGFR.
Drug names: BNF-standard (metformin, ramipril, atorvastatin — not brand names).
Clinical scenarios: Distributed across: diabetes (10), COPD (5), heart failure (5), CKD (5), hypertension (5), cancer follow-up (3), mental health (3), orthopaedic (3), asthma (5), miscellaneous (6).

8b. Demo Patients (3 primary + 2 supporting)

#	Name	Age/Sex	Scenario	Key FHIR Data	Demo Audio
1	Mrs. Margaret Thompson	67F	T2DM, rising HbA1c, start gliclazide	HbA1c 55↑ (was 48), eGFR 52↓, Penicillin allergy, Metformin 1g BD	✅ ~60s WAV
2	Mr. Emeka Okafor	54M	Chest pain follow-up post-angiography	Normal coronaries on angiogram, Troponin negative, BP 148/92	✅ ~60s WAV
3	Ms. Priya Patel	28F	Asthma review, poor inhaler technique	Peak flow 320 (pred 450), Salbutamol 4x/week, no preventer	✅ ~60s WAV
4	Mr. David Williams	72M	Heart failure review	EF 35%, BNP 450, on bisoprolol + ramipril + furosemide	Dashboard only
5	Mrs. Fatima Khan	45F	Depression follow-up	PHQ-9 score 12, on sertraline 100mg	Dashboard only

8c. Example FHIR Patient Resource

{
  "resourceType": "Patient",
  "id": "pt-001",
  "identifier": [
    {
      "system": "https://fhir.nhs.uk/Id/nhs-number",
      "value": "9434765829"
    }
  ],
  "name": [
    {
      "use": "official",
      "prefix": ["Mrs"],
      "given": ["Margaret"],
      "family": "Thompson"
    }
  ],
  "gender": "female",
  "birthDate": "1958-03-14",
  "address": [
    {
      "line": ["12 Oak Lane"],
      "city": "London",
      "postalCode": "SE1 4AB",
      "country": "GB"
    }
  ],
  "generalPractitioner": [
    {
      "display": "Dr. R. Patel, Riverside Medical Centre"
    }
  ]
}

8d. Example FHIR Observation (Lab Result)

{
  "resourceType": "Observation",
  "id": "obs-hba1c-001",
  "status": "final",
  "category": [
    {
      "coding": [
        {
          "system": "http://terminology.hl7.org/CodeSystem/observation-category",
          "code": "laboratory"
        }
      ]
    }
  ],
  "code": {
    "coding": [
      {
        "system": "http://loinc.org",
        "code": "4548-4",
        "display": "Hemoglobin A1c/Hemoglobin.total in Blood"
      }
    ],
    "text": "HbA1c"
  },
  "subject": {"reference": "Patient/pt-001"},
  "effectiveDateTime": "2026-01-15",
  "valueQuantity": {
    "value": 55,
    "unit": "mmol/mol",
    "system": "http://unitsofmeasure.org",
    "code": "mmol/mol"
  },
  "referenceRange": [
    {
      "low": {"value": 20, "unit": "mmol/mol"},
      "high": {"value": 42, "unit": "mmol/mol"},
      "text": "20-42 mmol/mol (normal)"
    }
  ]
}

8e. Example FHIR AllergyIntolerance

{
  "resourceType": "AllergyIntolerance",
  "id": "allergy-001",
  "clinicalStatus": {
    "coding": [{"system": "http://terminology.hl7.org/CodeSystem/allergyintolerance-clinical", "code": "active"}]
  },
  "type": "allergy",
  "category": ["medication"],
  "criticality": "high",
  "code": {
    "coding": [{"system": "http://snomed.info/sct", "code": "764146007", "display": "Penicillin"}],
    "text": "Penicillin"
  },
  "patient": {"reference": "Patient/pt-001"},
  "reaction": [
    {
      "manifestation": [{"coding": [{"display": "Anaphylaxis"}]}],
      "severity": "severe"
    }
  ]
}

8f. Example FHIR Condition

{
  "resourceType": "Condition",
  "id": "cond-t2dm-001",
  "clinicalStatus": {
    "coding": [{"system": "http://terminology.hl7.org/CodeSystem/condition-clinical", "code": "active"}]
  },
  "code": {
    "coding": [{"system": "http://snomed.info/sct", "code": "44054006", "display": "Type 2 diabetes mellitus"}],
    "text": "Type 2 Diabetes Mellitus"
  },
  "subject": {"reference": "Patient/pt-001"},
  "onsetDateTime": "2019-06-01"
}

8g. Audio Files

Format: WAV, 16kHz sample rate, mono, 16-bit PCM.
Duration: 60–90 seconds each.
Content: Simulated clinician–patient dialogue. Clear speech, minimal background noise. UK-accented English where possible.
Generation: Self-recorded or generated via TTS (e.g., Google Cloud TTS with en-GB voices). Post-processed with ffmpeg -i input.webm -ar 16000 -ac 1 -acodec pcm_s16le output.wav.

8h. Demo Clinic List (clinic_list.json)

{
  "clinician": {
    "name": "Dr. Sarah Chen",
    "specialty": "Diabetes & Endocrinology",
    "title": "Consultant Diabetologist"
  },
  "date": "13 February 2026",
  "patients": [
    {"id": "pt-001", "name": "Mrs. Margaret Thompson", "age": 67, "sex": "Female", "time": "14:00", "summary": "Follow-up — Type 2 Diabetes, rising HbA1c"},
    {"id": "pt-002", "name": "Mr. Emeka Okafor", "age": 54, "sex": "Male", "time": "14:20", "summary": "Follow-up — Chest pain, post-angiography"},
    {"id": "pt-003", "name": "Ms. Priya Patel", "age": 28, "sex": "Female", "time": "14:40", "summary": "Review — Asthma, poor symptom control"},
    {"id": "pt-004", "name": "Mr. David Williams", "age": 72, "sex": "Male", "time": "15:00", "summary": "Review — Heart failure, recent decompensation"},
    {"id": "pt-005", "name": "Mrs. Fatima Khan", "age": 45, "sex": "Female", "time": "15:20", "summary": "Follow-up — Depression, medication review"}
  ]
}

9. Frontend–Backend Integration

9a. Architecture Pattern

Single Gradio app with embedded FastAPI. The Gradio Blocks UI and the FastAPI backend run in the same Python process. Gradio handles the browser-facing UI and calls backend functions directly via Python (no HTTP for Gradio ↔ backend communication within the same process). The FastAPI routes are mounted for external access and for structured API contracts (testing, future clients).

# app.py — simplified structure
import gradio as gr
from fastapi import FastAPI
from backend.api import router as api_router
from frontend.ui import build_ui

fast_api = FastAPI()
fast_api.include_router(api_router, prefix="/api/v1")

demo = build_ui()  # Returns gr.Blocks
demo = gr.mount_gradio_app(fast_api, demo, path="/")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(fast_api, host="0.0.0.0", port=7860)

9b. State Management

Session state: Managed via gr.State — holds the current Consultation object (including patient, transcript, context, document, status).
No database. All state is in-memory per session. State resets on page refresh (per userflow.md §5).
No localStorage. No browser-side persistence.

9c. Screen ↔ Backend Mapping

Screen	User Action	Gradio Event	Backend Call	UI Update
S1 (Dashboard)	Click patient card	`gr.Button.click`	`get_patient_context(patient_id)` → calls FHIR agent	Transition to S2, populate context panel
S2 (Patient Context)	Click "Start Consultation"	`gr.Button.click`	`start_consultation(patient_id)`	Transition to S3, start audio capture JS
S3 (Live Consultation)	Audio streaming	JavaScript `MediaRecorder` → WebSocket or periodic upload	`upload_audio_chunk()`	Live transcript panel updates
S3	Click "End Consultation"	`gr.Button.click`	`end_consultation(consultation_id)` → triggers full pipeline	Transition to S4, show progress
S4 (Processing)	Automatic	Polling via `gr.Timer` or `gr.every()`	`get_pipeline_progress()`	Progress bar fills through 3 stages
S4 → S5	Pipeline complete	Progress reaches 100%	`get_document()`	Transition to S5, reveal letter
S5 (Document Review)	Click paragraph to edit	JavaScript `contenteditable`	Edit stored in `gr.State`	Paragraph highlight, gold border
S5	Click "Regenerate" on section	`gr.Button.click`	`regenerate_section(consultation_id, section_idx)`	Section skeleton → new text
S5	Click "Sign Off"	`gr.Button.click`	`sign_off(consultation_id, edited_sections)`	Transition to S6, status → green
S6 (Signed Off)	Click "Next Patient"	`gr.Button.click`	Reset `gr.State`	Transition to S1

9d. Real-Time Updates

Pipeline progress (S4): Use gr.Timer(every=1) to poll get_pipeline_progress() every second during the processing state. When stage == "complete", stop polling and transition to S5.

Live transcript (S3): Two options depending on Gradio capability:

Preferred: JavaScript interop — browser MediaRecorder captures audio chunks every 5s, sends via fetch() to /api/v1/consultations/{id}/audio, receives partial transcript in response. Update transcript gr.Textbox via Gradio event.
Fallback: No streaming transcript. Audio is captured entirely in browser, sent as one file when "End Consultation" is clicked. Transcript appears only during processing.

9e. Audio Capture

// JavaScript injected into Gradio via gr.HTML or gr.JavaScript
// Captures audio from browser microphone, sends chunks to backend

const mediaRecorder = new MediaRecorder(stream, {mimeType: 'audio/webm;codecs=opus'});
mediaRecorder.ondataavailable = async (e) => {
    const formData = new FormData();
    formData.append('audio_file', e.data, 'chunk.webm');
    formData.append('is_final', 'false');
    await fetch(`/api/v1/consultations/${consultationId}/audio`, {
        method: 'POST', body: formData
    });
};
mediaRecorder.start(5000); // Chunk every 5 seconds

Server-side conversion in backend/audio.py:

from pydub import AudioSegment

def convert_to_wav_16k(input_path: str, output_path: str) -> str:
    """Convert any audio format to 16kHz mono WAV for MedASR."""
    audio = AudioSegment.from_file(input_path)
    audio = audio.set_frame_rate(16000).set_channels(1).set_sample_width(2)
    audio.export(output_path, format="wav")
    return output_path

10. Error Handling and Resilience

10a. Tiered Error Strategy

Tier 1 — Self-healing (user never notices):

Failure	Retry Policy	Circuit Breaker
FHIR query timeout	2 retries, backoff [1s, 3s]	After 3 consecutive failures, mark FHIR as degraded
MedASR chunk processing error	1 retry immediately	Skip chunk, proceed with remaining audio
MedGemma 4B slow response	1 retry with simplified prompt	After 2 failures, switch to deterministic FHIR fallback

Tier 2 — Graceful degradation (user informed, workflow continues):

Component Failure	Degraded Behaviour	User Sees
MedGemma 4B returns no/partial context	Generate letter from transcript only (no EHR enrichment)	Warning badge on S2: "Some records unavailable"
MedASR returns empty transcript	Prompt user to re-record or upload audio	Alert on S4: "Audio could not be transcribed"
FHIR server down entirely	All context panels show "EHR unavailable"	Warning on S2 + letter generated from transcript only
Section regeneration fails	Keep existing section text	Toast: "Could not regenerate. Original text preserved."

Tier 3 — Informative failure (user must act):

Failure	Error Message	Actions
MedGemma 27B OOM	"Document generation failed due to server memory. Please try again."	"Retry" button, "Return to Dashboard"
Audio file corrupted	"The audio file appears to be corrupted. Please re-record."	"Re-record", "Upload Audio File"
Pipeline timeout (>120s)	"Document generation is taking longer than expected."	"Retry", "Return to Dashboard"
GPU unavailable	"Clarke requires GPU acceleration which is currently unavailable."	"Return to Dashboard"

10b. Error Response Format

All API errors return ErrorResponse (see §4 schema):

{
  "error": "model_error",
  "message": "Document generation failed. Please try again.",
  "detail": "RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB",
  "consultation_id": "cons-uuid-xxxx",
  "timestamp": "2026-02-13T14:07:32Z"
}

message is shown to the user.
detail is logged but never shown to the user.

10c. Logging

from loguru import logger

# Format: timestamp | level | component | message | context
logger.add(
    "logs/clarke_{time}.log",
    format="{time:YYYY-MM-DD HH:mm:ss.SSS} | {level:<7} | {extra[component]:<15} | {message}",
    rotation="50 MB",
    retention="7 days",
    level="DEBUG",
)

# Usage
logger.bind(component="medasr").info("Transcription complete", duration_s=12.3, word_count=245)
logger.bind(component="ehr_agent").error("FHIR query failed", patient_id="pt-001", exc_info=True)

Logs stored in logs/ directory (not user-accessible). On HF Spaces, print() output goes to the Space's log tab.

11. Testing Specification

11a. Unit Tests

Test File	What It Validates	Pass Criteria
`test_schemas.py`	All Pydantic models validate with valid data, reject invalid data	All valid fixtures pass validation; invalid fixtures raise `ValidationError`
`test_medasr.py`	Audio preprocessing (resample, mono conversion); mock transcription returns expected text	WAV output is 16kHz mono; mock returns correct transcript
`test_ehr_agent.py`	FHIR tool functions return valid JSON; output parser extracts JSON from messy output; deterministic fallback works	Tools return FHIR-compliant JSON; parser handles meta-commentary
`test_doc_generator.py`	Prompt template renders correctly with all variables; output parser splits sections; mock generation returns valid ClinicalDocument	Template contains transcript + context; parsed sections have headings
`test_fhir_client.py`	FHIR client constructs correct query URLs; handles 404 and timeout gracefully	URLs match expected patterns; errors return empty results, not exceptions
`test_api.py`	Each API endpoint returns correct status code and schema	`/health` → 200, `/patients` → 200 with list, `/consultations/start` → 201, etc.

11b. Integration Tests

Test	What It Validates	Pass Criteria
`test_e2e.py::test_full_pipeline`	Audio file → MedASR → transcript → EHR Agent → context → MedGemma 27B → document	Document contains both transcript content AND FHIR-sourced lab values
`test_e2e.py::test_mrs_thompson_scenario`	Mrs Thompson demo scenario produces clinically appropriate letter	Letter mentions HbA1c 55, eGFR 52, Penicillin allergy, gliclazide
`test_e2e.py::test_pipeline_timeout`	Pipeline respects 120s timeout and returns graceful error	Returns ErrorResponse with `error="timeout"` within 130s
`test_e2e.py::test_fhir_failure_degradation`	Pipeline continues when FHIR server is unreachable	Letter is generated from transcript only; context warnings present

11c. Smoke Tests (Pre-Demo)

Open Clarke in incognito browser → dashboard loads with 5 patients.
Select Mrs. Thompson → context panel populates within 10s.
Click "Start Consultation" → play mrs_thompson.wav → click "End Consultation".
Letter appears within 60s containing HbA1c value from FHIR.
Edit one paragraph → Sign Off → status turns green.

12. Known Technical Pitfalls and Defensive Coding Requirements

Pitfall 1: MedGemma 4B Instruction-Following Bugs

Issue: MedGemma 4B (google/medgemma-1.5-4b-it) is reported to leak system prompts into output, generate meta-commentary ("Here is the JSON you requested:"), output chain-of-thought training artifacts, and include special tokens in responses.

Defensive coding:

Always parse output with parse_agent_output() (§6b) — never pass raw model output to downstream components.
Strip everything before the first { and after the last } in JSON extraction.
Validate parsed JSON against the PatientContext Pydantic schema — reject and retry if validation fails.
Implement deterministic FHIR fallback in backend/fhir/queries.py — if agentic tool-calling fails after 2 attempts, switch to hardcoded FHIR queries + MedGemma 4B summarisation-only mode.
Test with ≥5 different patient scenarios before declaring the agent working.

Pitfall 2: MedGemma 27B VRAM Requirements

Issue: MedGemma 27B requires ~~54GB VRAM unquantised. Even with 4-bit NF4 quantisation (~~16GB), three models loaded simultaneously need ~19.5GB, leaving limited headroom on A100 40GB.

Defensive coding:

Always load with BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True).
Use device_map="auto" to allow accelerate to manage memory.
Monitor VRAM before inference: torch.cuda.memory_allocated(). If >35GB used, clear KV cache with torch.cuda.empty_cache() before generation.
For fine-tuning: Unload MedASR and MedGemma 4B before training. Reload after.
Fallback path: If 27B fails to load → try Ollama GGUF Q8_0 → if that fails → use 4B for generation.

Pitfall 3: GPU Memory Management with Three Concurrent Models

Issue: MedASR, MedGemma 4B, and MedGemma 27B all on one GPU. KV cache growth during generation can cause OOM.

Defensive coding:

Sequential inference, not parallel. Never run two models simultaneously.
Call torch.cuda.empty_cache() between each model's inference step.
Set max_new_tokens conservatively (2048 for 27B, 512 for 4B).
Implement OOM recovery in orchestrator: catch torch.cuda.OutOfMemoryError, clear cache, reduce max_new_tokens by 50%, retry once.

Pitfall 4: Gradio-Specific Limitations

Issue: Gradio has constraints around JavaScript interop, WebSocket support, and custom CSS injection.

Defensive coding:

Audio capture: Use gr.Audio(source="microphone") as primary method. If JavaScript MediaRecorder interop is unreliable in Gradio 5.x, fall back to Gradio's native audio component (records complete audio, no streaming).
Custom CSS: Inject via gr.Blocks(css="frontend/assets/style.css"). Test that CSS custom properties (--clarke-blue, etc.) apply correctly.
State management: Use gr.State for consultation state. Test that state persists across event callbacks within a single session.
Inline editing: Gradio doesn't natively support contenteditable. Use gr.Textbox(interactive=True) per section, or inject custom HTML with JavaScript for inline editing. Test editing works before committing to a pattern.

Pitfall 5: HAPI FHIR Server in Docker-in-Docker

Issue: HF Spaces runs inside Docker. Running HAPI FHIR (another Docker container) inside that may not work.

Defensive coding:

Default to mock FHIR API (USE_MOCK_FHIR=true) for HF Spaces deployment. The mock API serves identical data.
If HAPI FHIR is used: Run as a separate process (not Docker-in-Docker). Use HAPI FHIR's embedded mode (Java JAR) or the Python mock as primary.
Decision point: If FHIR server isn't running by end of Hour 2, switch to mock immediately.

Pitfall 6: Audio Format Compatibility

Issue: Browser MediaRecorder outputs WebM/Opus. MedASR expects 16kHz mono WAV.

Defensive coding:

Always convert server-side via backend/audio.py using pydub + ffmpeg.
Validate audio before passing to MedASR: check sample rate = 16000, channels = 1, duration > 5s.
Handle empty/corrupted audio: Return ErrorResponse with actionable message, not a raw exception.

This document is the engineering blueprint for Clarke. Every directory, schema, endpoint, model configuration, and error path is defined here with enough specificity for Codex to implement without ambiguity. When a value can be specified exactly, it is. All specifications are consistent with clarke_PRD_masterplan.md (goals, constraints, risks), clarke_PRD_implementation.md (build sequence), clarke_PRD_design_guidelines.md (visual tokens referenced by frontend), and clarke_PRD_userflow.md (every screen and interaction mapped to a backend call). Uncertain decisions are flagged with fallback paths.