Instructions to use odytrice/kenichi-thinking-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use odytrice/kenichi-thinking-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="odytrice/kenichi-thinking-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("odytrice/kenichi-thinking-GGUF", dtype="auto")

PEFT
How to use odytrice/kenichi-thinking-GGUF with PEFT:
```
Task type is invalid.
```

llama-cpp-python

How to use odytrice/kenichi-thinking-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="odytrice/kenichi-thinking-GGUF",
	filename="F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use odytrice/kenichi-thinking-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf odytrice/kenichi-thinking-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf odytrice/kenichi-thinking-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf odytrice/kenichi-thinking-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf odytrice/kenichi-thinking-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf odytrice/kenichi-thinking-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf odytrice/kenichi-thinking-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf odytrice/kenichi-thinking-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf odytrice/kenichi-thinking-GGUF:Q4_K_M

Use Docker

docker model run hf.co/odytrice/kenichi-thinking-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use odytrice/kenichi-thinking-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "odytrice/kenichi-thinking-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "odytrice/kenichi-thinking-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/odytrice/kenichi-thinking-GGUF:Q4_K_M

SGLang

How to use odytrice/kenichi-thinking-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "odytrice/kenichi-thinking-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "odytrice/kenichi-thinking-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "odytrice/kenichi-thinking-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "odytrice/kenichi-thinking-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use odytrice/kenichi-thinking-GGUF with Ollama:
```
ollama run hf.co/odytrice/kenichi-thinking-GGUF:Q4_K_M
```

Unsloth Studio

How to use odytrice/kenichi-thinking-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for odytrice/kenichi-thinking-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for odytrice/kenichi-thinking-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for odytrice/kenichi-thinking-GGUF to start chatting

How to use odytrice/kenichi-thinking-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf odytrice/kenichi-thinking-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "odytrice/kenichi-thinking-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use odytrice/kenichi-thinking-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf odytrice/kenichi-thinking-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default odytrice/kenichi-thinking-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use odytrice/kenichi-thinking-GGUF with Docker Model Runner:
```
docker model run hf.co/odytrice/kenichi-thinking-GGUF:Q4_K_M
```

Lemonade

How to use odytrice/kenichi-thinking-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull odytrice/kenichi-thinking-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.kenichi-thinking-GGUF-Q4_K_M

List all available models

lemonade list

Kenichi Thinking — Domain-Specialized Coding Assistant with Vision (27B)

Kenichi Thinking is a reasoning-first coding model fine-tuned from Qwen3.5-27B for domain-specialized code generation. It retains the base model's vision capabilities, making it suitable for planning agents that can interpret screenshots, architecture diagrams, and UI mockups alongside code.

Model Details

Model Description

Kenichi Thinking is a vision-language model specialized in F#, .NET, Svelte 5, TypeScript, Docker, and Kubernetes development. It was created through multi-teacher distillation from five frontier models, with all F# samples verified by the F# compiler. The model uses Qwen3.5's hybrid Gated DeltaNet + standard attention architecture with a frozen Pixtral vision tower.

Developed by: odytrice
Model type: Vision-Language Model (Image-Text-to-Text), LoRA fine-tuned
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen3.5-27B

Model Sources

Repository: github.com/odytrice/models
Training Dataset: odytrice/kenichi-sft
GGUF Quantizations: odytrice/kenichi-thinking-GGUF

Uses

Direct Use

Kenichi Thinking is designed as a coding assistant for the following domains:

F# — core language, FsToolkit, Giraffe, Akka.NET, linq2db, Farmer, FAKE
.NET / ASP.NET — web APIs, Minimal API, middleware, dependency injection
Svelte 5 / SvelteKit — runes ($state, $derived, $effect), server routes, form actions
TypeScript — type-safe patterns, generics, utility types
Docker & Kubernetes — Dockerfiles, Compose, Helm charts, deployments, services
Agentic SWE — tool use, multi-step reasoning, code review, debugging workflows

The model also accepts image inputs (screenshots, diagrams, architecture drawings) for visual code understanding tasks.

Downstream Use

Suitable for integration into:

AI coding assistants and IDE plugins
Planning agents that need visual + code understanding
Code review and refactoring pipelines
Documentation generation from code or diagrams

Out-of-Scope Use

General-purpose chat (the model is specialized for coding tasks)
Languages and frameworks outside the training domains
Safety-critical code generation without human review
Image generation (the model can read images, not create them)

Bias, Risks, and Limitations

The model is specialized for a narrow set of technologies. Performance on other programming languages or frameworks may be worse than the base Qwen3.5-27B model.
Training data was generated by teacher models (MiniMax M2.7, Kimi K2.5, DeepSeek R1, GLM-5, Nvidia Nemotron) and may inherit their biases.
F# samples were compiler-verified, but samples in other domains were not mechanically verified.
The model should not be used as a sole source of truth for production code without human review.

Recommendations

Users should validate all generated code, especially for security-sensitive applications. The model performs best when given detailed, domain-specific prompts within its specialization areas.

How to Get Started with the Model

Use the following system prompt for best results:

You are Kenichi, an expert coding assistant specialized in F#, .NET, Svelte 5, SvelteKit, TypeScript, Docker, and Kubernetes. You write clean, idiomatic, and well-structured code with clear explanations.

Python

from transformers import AutoModelForImageTextToText, AutoTokenizer

model = AutoModelForImageTextToText.from_pretrained(
    "odytrice/kenichi-thinking",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("odytrice/kenichi-thinking")

messages = [
    {"role": "system", "content": "You are Kenichi, an expert coding assistant specialized in F#, .NET, Svelte 5, SvelteKit, TypeScript, Docker, and Kubernetes. You write clean, idiomatic, and well-structured code with clear explanations."},
    {"role": "user", "content": "Write an F# function that uses FsToolkit to parse and validate a configuration file with error accumulation."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Ollama

ollama run odytrice/kenichi-thinking:32gb

Available tags: :24gb (Q4_K_M), :32gb (Q4_K_M), :48gb (Q5_K_M), :96gb (Q8_0), :full (F16)

Training Details

Training Data

odytrice/kenichi-sft — 7,953 samples across 7 domains, generated via multi-teacher distillation.

Domain	Samples	%
F# (core + libraries)	3,913	49.2%
Svelte 5 / TypeScript	1,200	15.1%
Docker / Kubernetes	800	10.1%
.NET / ASP.NET	750	9.4%
Agentic SWE	640	8.0%
Cross-domain	400	5.0%
General coding	250	3.1%

Teacher Models

Teacher	Contribution
MiniMax M2.7	42.0%
Kimi K2.5	27.2%
DeepSeek R1	14.9%
GLM-5	9.6%
Nvidia Nemotron	6.3%

All F# samples were verified by the F# compiler (dotnet fsi / dotnet build).

Training Procedure

Preprocessing

Training data formatted in ChatML (Qwen) format with system prompt injected at training time
Sequences packed to 16,384 tokens maximum (due to VRAM constraints from 248K vocab size)
110 samples (1.5%) truncated at 16K tokens; remaining 98.5% fit without truncation
Vision tower frozen during training to preserve visual capabilities

Training Hyperparameters

Training regime: BF16 mixed precision
Method: LoRA (rank 16, alpha 32, dropout 0.0)
Trainable parameters: 116.7M (0.42% of 27.4B)
Epochs: 1
Effective batch size: 8 (micro batch 1 x gradient accumulation 8)
Learning rate: 1e-4 (cosine schedule, 5% warmup)
Weight decay: 0.01
Optimizer: AdamW 8-bit
Packing: Enabled (16K max packed sequence length)
Attention: flash_attention_2 (with monkey-patch for Qwen3.5 3D position IDs bug)

LoRA Target Modules

GDN layers: in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj Standard attention: q_proj, k_proj, v_proj, o_proj All MLPs: gate_proj, up_proj, down_proj

Speeds, Sizes, Times

Training time: 3 hours 24 minutes
Steps: 194
Speed: 63 seconds/step
Final train loss: 0.34
Final token accuracy: 90.3%

Evaluation

Testing Data, Factors & Metrics

Testing Data

397 held-out validation samples from odytrice/kenichi-sft (chatml_val split).

Metrics

Training loss: 0.34 (1 epoch)
Token accuracy: 90.3%

Results

Formal evaluation on the held-out validation set is pending.

Environmental Impact

Hardware Type: NVIDIA H200 SXM 141GB
Hours used: 3.4
Cloud Provider: RunPod
Compute Region: US
Carbon Emitted: Estimated ~1.2 kg CO2eq

Technical Specifications

Model Architecture and Objective

Qwen3.5-27B is a hybrid vision-language model:

64 layers: 48 Gated DeltaNet (GDN) linear attention + 16 standard attention
Vision tower: Pixtral (24 layers, ~460M params) — frozen during fine-tuning
Total parameters: 27.4B
Vocab size: 248,320 tokens
Context length: 131,072 tokens (base model)

Compute Infrastructure

Hardware

NVIDIA H200 SXM 141GB (single GPU)

Software

PyTorch 2.5.1 + CUDA 12.4
Transformers 5.3.0
PEFT 0.18.1
TRL 0.24
flash-attn 2.x
causal-conv1d 1.6.1
flash-linear-attention 0.3.2

Known Issues

flash_attention_2 bug: Qwen3.5's 3D M-RoPE position IDs trigger a bug in transformers 5.3.0's _is_packed_sequence(). A monkey-patch is required during training/inference. See GitHub issue #44643.
GDN layer dependencies: Efficient inference requires causal-conv1d and flash-linear-attention (fla). Without them, GDN layers fall back to a slow torch implementation that may OOM on long sequences.

Related Models

Kenichi Flash — Devstral Small 2 24B variant, optimized for fast agentic coding (text-only)

Model Card Authors

odytrice

Model Card Contact

odytrice

Downloads last month: 68

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

5-bit

8-bit

16-bit

Model tree for odytrice/kenichi-thinking-GGUF

Base model

Qwen/Qwen3.5-27B

Quantized

(198)

this model

odytrice
/

kenichi-thinking-GGUF