Qwen3-8B-LaCo-Pruned

This model is a layer-pruned version of Qwen3-8B-Base using the LaCo (Layer Collapse) structured pruning method.

Model Summary

Attribute Value
Base Model Qwen/Qwen3-8B-Base
Pruning Method LaCo (Layer Collapse)
Original Layers 36
Pruned Layers 30
Layers Removed 6
Compression 16.7%

Key Results

This model achieves 16.7% compression while retaining:

  • ~90% of physical reasoning (PIQA)
  • ~94% of commonsense reasoning (WinoGrande)
  • ~79% of common sense completion (HellaSwag)
  • ~41% of factual knowledge (MMLU)

This is a raw pruned model without post-training. Fine-tuning can further recover lost capabilities.


Benchmark Results (Pre-Training)

Note: All benchmarks below are evaluated on the pruned model without any post-training or fine-tuning. These results represent the raw performance after pruning only. Post-training is expected to improve these scores, particularly on knowledge-intensive tasks like MMLU.

Comparison with Original Qwen3-8B-Base

Benchmark Original Pruned Retention
PIQA (acc_norm) 79.54% 71.38% 89.7%
WinoGrande 67.0% 62.83% 93.8%
ARC-Challenge (acc_norm) 42.0% 36.09% 85.9%
ARC-Easy (acc_norm) 72.0% 58.04% 80.6%
HellaSwag (acc_norm) 78.55% 61.98% 78.9%
BoolQ 83.09% 64.95% 78.2%
MMLU (5-shot) 76.89% 31.30% 40.7%

Original scores from Qwen3 Technical Report

Benchmark Interpretation

Capability Benchmarks Retention Status
Physical Reasoning PIQA 89.7% Excellent
Commonsense Reasoning WinoGrande 93.8% Excellent
Basic Reasoning ARC-Challenge 85.9% Good
Reading Comprehension BoolQ 78.2% Good
Common Sense HellaSwag 78.9% Good
Factual Knowledge MMLU 40.7% Degraded

The "Knowledge Cliff"

Our experiments reveal a critical finding: factual knowledge collapses catastrophically between 16-22% compression.

Compression Layers MMLU Status
16.7% 30 31.30% Partial retention
22.2% 28 25.89% Random chance
27.8% 26 25.12% Random chance

While reasoning capabilities degrade gradually with compression, factual knowledge encoded in specific layers is lost abruptly when those layers are removed.


Intended Use

This model is suitable for:

  • Research on model compression and efficiency
  • Fine-tuning base for domain-specific applications
  • Inference optimization where speed/memory matters
  • Applications prioritizing reasoning over factual recall

Limitations

Important: This is a raw pruned model without post-training.

Use Case Recommendation
Physical/commonsense reasoning Recommended
Reading comprehension Recommended
General text understanding Recommended
Factual question answering Fine-tune first
Knowledge-intensive tasks Fine-tune first

Pruning Details

LaCo Hyperparameters

Parameter Value Description
MERGE_LAYERS (C) 3 Layers merged per operation
LOWEST_LAY (L) 4 Minimum layer index for merging
HIGHEST_LAY (H) 28 Maximum layer index for merging
INTERVAL (I) 2 Minimum gap between merge points
THRESHOLD (T) 0.85 Cosine similarity threshold
MAX_COMPRESSION 20% Maximum allowed compression

Pruning Statistics

Metric Value
Successful Merges 3
Rejected Merges 0
Total Iterations 4
Final Compression 16.7%

Usage

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mercity/Qwen3-8B-LaCo-Pruned"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Text generation
prompt = "The process of photosynthesis"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With 4-bit Quantization (Further Compression)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "Mercity/Qwen3-8B-LaCo-Pruned",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

Recovery Recommendations

To improve factual knowledge after pruning:

LoRA Fine-tuning (Recommended)

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Fine-tune on OpenOrca, Alpaca, or domain-specific data

Expected recovery: MMLU could reach 45-55% with fine-tuning.


Technical Specifications

Attribute Value
Architecture Transformer decoder-only
Layers 30
Hidden Size 4096
Attention Heads (Q) 32
Attention Heads (KV) 8 (GQA)
Intermediate Size 12288
Vocabulary Size 151,669
Max Context Length 32,768 tokens
Precision bfloat16

Citation

If you use this model, please cite the original LaCo paper and Qwen3:

@article{yang2024laco,
  title={LaCo: Large Language Model Pruning via Layer Collapse},
  author={Yang, Yifei and Cao, Zouying and Zhao, Hai},
  journal={arXiv preprint arXiv:2402.11187},
  year={2024}
}

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}

References

License

Apache 2.0 (same as base Qwen3 model)

Acknowledgments

  • Qwen Team for the excellent Qwen3-8B-Base model
  • LaCo authors for the pruning methodology
  • Hugging Face for model hosting
Downloads last month
10
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mercity/Qwen3-8B-LaCo-30L

Base model

Qwen/Qwen3-8B-Base
Finetuned
(309)
this model

Dataset used to train Mercity/Qwen3-8B-LaCo-30L

Evaluation results