Painted Fantasy v4

Magistral Small 2509 24B

Overview

This is an uncensored model intended to excel at creative character driven RP / ERP.

Feels like a good in between of creativity / dialogue and logic. This version tries to improve on the writing style and intelligence compared to v3.

A small portion of reasoning data was included, so thinking using the [THINK][/THINK] tags should still work. Although I haven't tested it personally, as the model is generally intended to be used without reasoning.

SillyTavern Settings

Recommended Roleplay Format

> Actions: In plaintext

> Dialogue: "In quotes"

> Thoughts: *In asterisks*

Recommended Samplers

> Temp: 0.8

> MinP: 0.05 - 0.075

> TopP: 0.95 - 1.00

Instruct

Mistral v7 Tekken

Quantizations

GGUF

> iMatrix (bartowski)

EXL3

> 3bpw

> 4bpw

> 5bpw

> 6bpw

Creation Process

Creation Process: SFT > DPO

SFT on approx 26 million tokens (18.3 million trainable). Datasets included SFW / NSFW RP, stories, NSFW reddit writing prompts, creative instruct & chat data.

90% of the dataset is without thinking, 10% included thinking, using the [THINK][/THINK] tags.

All RP data and synthetic stories went through rewriting with GLM 4.7 using hand edited examples as guidelines to improve the response. Rewritten responses were discarded if they failed to reduce the slop score for the message. This reduced the slop by about 25% for each RP / story dataset and made the model noticably more creative with some of its descriptions.

Additionally, some extra filtering was run over the datasets, finding about a dozen samples containing uncaught refusals, some messy human data and in general just some conversations that were outliers in low quality that had accumulated since I started building my datasets.

DPO was expanded to include non creative datasets. My usual RP DPO dataset (also rewritten) was included along with cybersecurity and two partial subsets of general assistant / chat preference datasets to help stabalize the model. This worked pretty well. While creativity did take a small hit, enough remained that the improved logic resulted in a notably improved model (IMO).

Not optimized for cost / performance efficiency, YMMV.

SFT (4*H200)

base_model: Darkhn/Magistral-2509-24B-Text-Only tokenizer_use_mistral_common: true plugins: - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin load_in_8bit: false load_in_4bit: false deepspeed: deepspeed_configs/zero1.json datasets: - path: ./data/nothink_dataset.jsonl type: chat_template - path: ./data/think_dataset.jsonl type: chat_template dataset_prepared_path: last_run_prepared2 val_set_size: 0.01 output_dir: ./Magi-24B-SFT-v3-10 adapter: lora peft_use_rslora: true lora_model_dir: sequence_len: 10496 sample_packing: true pad_to_sequence_len: true lora_r: 256 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true wandb_project: Magi-SFT-24B wandb_name: Magi-24B-SFT-v3-10 gradient_accumulation_steps: 1 micro_batch_size: 4 num_epochs: 2 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 1.5e-5 weight_decay: 0.01 max_grad_norm: 2.0 bf16: auto tf32: false gradient_checkpointing: true resume_from_checkpoint: logging_steps: 1 flash_attention: true

warmup_ratio: 0.05 evals_per_epoch: 3 saves_per_epoch: 2

DPO (4*H200)

# ====================
# MODEL CONFIGURATION
# ====================
base_model: ApocalypseParty/Magi-24B-SFT-v3-10
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# RL/DPO CONFIGURATION
# ====================
rl: dpo
rl_beta: 0.07
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
  - path: ./data/dpo_ms32_rewritten_handcrafted_dataset.jsonl
    type: chat_template.default
    field_messages: messages
    field_chosen: chosen
    field_rejected: rejected
    message_property_mappings:
      role: role
      content: content
    roles:
      system: ["system"]
      user: ["user"]
      assistant: ["assistant"]
  - path: ./data/dpo_chub_approved_rewritten_dataset_partial.jsonl
    type: chat_template.default
    field_messages: messages
    field_chosen: chosen
    field_rejected: rejected
    message_property_mappings:
      role: role
      content: content
    roles:
      system: ["system"]
      user: ["user"]
      assistant: ["assistant"]
  - path: ./data/dpo_secure_programming_dataset.jsonl
    type: chat_template.default
    field_messages: messages
    field_chosen: chosen
    field_rejected: rejected
    message_property_mappings:
      role: role
      content: content
    roles:
      system: ["system"]
      user: ["user"]
      assistant: ["assistant"]
  - path: ./data/dpo_wildchat_ms32_chunk1.jsonl
    type: chat_template.default
    field_messages: messages
    field_chosen: chosen
    field_rejected: rejected
    message_property_mappings:
      role: role
      content: content
    roles:
      system: ["system"]
      user: ["user"]
      assistant: ["assistant"]
  - path: ./data/dpo_ultrafeedback_chunk1.jsonl
    type: chat_template.default
    field_messages: messages
    field_chosen: chosen
    field_rejected: rejected
    message_property_mappings:
      role: role
      content: content
    roles:
      system: ["system"]
      user: ["user"]
      assistant: ["assistant"]
dataset_prepared_path: ./dpo_data4
train_on_inputs: false  # Only train on assistant responses
# ====================
# QLORA CONFIGURATION
# ====================
adapter: lora
load_in_8bit: false
lora_r: 128
lora_alpha: 16
peft_use_rslora: true
lora_dropout: 0.1
lora_target_linear: true
# lora_modules_to_save:  # Uncomment only if you added NEW tokens
# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 1
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-6
optimizer: adamw_torch_fused
lr_scheduler: cosine
warmup_ratio: 0.05
weight_decay: 0.01
max_grad_norm: 1.0
# ====================
# SEQUENCE CONFIGURATION
# ====================
sequence_len: 10756
pad_to_sequence_len: true
# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
tf32: false
flash_attention: true
gradient_checkpointing: offload
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false  # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false  # Cut Cross Entropy overrides this
deepspeed: deepspeed_configs/zero1.json
# ====================
# CHECKPOINTING
# ====================
evals_per_epoch: 1
saves_per_epoch: 6
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false
# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Magi-24B-SFT-v3-10-DPO-9
logging_steps: 1
save_safetensors: true
# ====================
# WANDB TRACKING
# ====================
wandb_project: Magi-24B-DPO
wandb_name: Magi-24B-SFT-v3-10-DPO-9