Papers
arxiv:2604.16498

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Published on Apr 14
Β· Submitted by
Satyam Kumar
on Apr 21
Authors:
,

Abstract

Forge-UGC is a four-phase compiler for efficient transformer deployment on heterogeneous hardware, offering faster compilation, reduced inference latency, and lower energy consumption compared to existing frameworks.

AI-generated summary

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Community

Paper submitter

πŸš€ Forge-UGC just dropped β€” a brand-new transparent 4-phase FX + register-graph compiler that makes transformer deployment on heterogeneous accelerators (validated on Intel AI Boost NPU) dramatically faster and more efficient!
Key wins over OpenVINO & ONNX Runtime:

6.9–9.2Γ— faster compilation
18.2–35.7% lower inference latency
30.2–40.9% lower energy per inference

All while preserving perfect numerical fidelity (max logit diff < 2.1e-5, KL < 8.4e-9).
It natively handles modern transformer blocks (RoPE, GQA, SwiGLU) without manual decomposition, crushes graph size with smart fusions, and uses linear-scan register allocation + device-affinity scheduling to slash peak buffer usage and CPU↔NPU transitions.
Tested on 6 model families (125M β†’ 8B params) across WikiText-103 and GLUE. They even introduced new metrics (Fusion Gain Ratio & Compilation Efficiency Index) so the community can finally compare compilers fairly.
If you care about fast, efficient inference of Hugging Face models on edge hardware or NPUs, this is a must-read.
Paper β†’ https://arxiv.org/abs/2604.16498
Would love to hear your thoughts β€” especially if you’re working on deployment, custom backends, or NPU optimization! πŸ”₯

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.16498
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.16498 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.16498 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.16498 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.