Briefs
FLUX.2-klein-4B CoreML Conversion
Prepared: March 13, 2026
What is FLUX.2-klein-4B?
FLUX.2-klein-4B is a 4-billion-parameter image generation model from Black Forest Labs (BFL), the team behind the original Stable Diffusion. Released January 2026 under an Apache 2.0 license, it’s the first commercially-licensed model that can do all three things we need in a single unified architecture:
- Generate an image from a text prompt
- Transform an existing image guided by a prompt (style transfer, aging, etc.)
- Compose multiple images into one, guided by a prompt (put person A into scene B)
The architecture is fundamentally different from earlier diffusion models like SDXL:
| SDXL (2023) | FLUX.2-klein-4B (2026) | |
|---|---|---|
| Core network | U-Net | MM-DiT transformer |
| Text encoder | Dual CLIP (two separate encoders) | Single Qwen3 (full language model, 2560-dim, 36 layers) |
| Denoising method | DDPM noise prediction | Flow matching (velocity prediction) |
| Image conditioning | Noise injection (img2img only) | In-context tokens (images as input alongside text) |
| Latent channels | 4 | 32 |
| Steps needed | 20-50 | 4 (distilled) |
| CFG required | Yes (two forward passes) | No (single pass) |
The flow-matching scheduler and in-context conditioning are what make compose mode possible natively — reference images become additional tokens that the transformer attends to, rather than being injected as noise.
The Reference Pipeline
BFL provides a Python reference implementation via Hugging Face’s diffusers library (Flux2KleinPipeline). To use it, you need:
- Python 3.13+
- PyTorch with GPU support
diffusers >= 0.37.0,transformers,accelerate- ~8 GB of model weights in BFloat16
- An NVIDIA GPU with 16+ GB VRAM (or Apple Silicon with MPS)
On my M3 Max (64 GB), the reference pipeline generates images in 17-22 seconds depending on mode. On a cloud GPU, it’s faster. Either way, it’s a server-side workflow — you download ~8 GB of model weights, install a Python stack, and run inference on a machine with serious compute.
This is fine for research and development. It’s not fine for shipping to someone’s iPhone.
Why CoreML?
Vorge is an iOS app for casual, social image generation — think “intelligent Instagram filters.” The app has two tiers:
- Premium: API-based generation (fast, high quality, costs money per image)
- Free: Fully on-device generation via CoreML (no server, no cost, works offline)
The free tier is the hard one. It means running this entire pipeline — text encoder, transformer, VAE — directly on the phone’s Neural Engine, with no Python runtime, no PyTorch, no server dependency. Everything happens in Swift with Apple’s CoreML framework.
No one has done this conversion for FLUX.2-klein-4B. The model is too new, and the architecture is too different from what existing CoreML conversion tools expect.
The Three Modes: Reference vs. Our Pipeline
Each mode maps differently from BFL’s Python implementation to our CoreML pipeline:
Generate (txt2img)
What it does: Text prompt in, image out.
BFL reference: Flux2KleinPipeline.__call__(prompt="raccoon astronaut on the moon") — the scheduler creates pure noise latents, the transformer denoises them conditioned on text embeddings, and the VAE decodes to pixels.
Our CoreML pipeline: Same flow, but every component is a CoreML .mlpackage model. The Qwen3 text encoder runs with live tokenization in Swift (no Python tokenizer dependency). The transformer runs 6 denoising steps (we found +6% quality over BFL’s default of 4). The VAE decodes at FP16 precision.
Quality: 8.35/10 composite score — essentially matching the API tier.
Transform (img2img)
What it does: Source image + text prompt in, transformed image out. “Make this photo look like an 80s ski trip.”
BFL reference: The source image is encoded through the VAE encoder to get latents, noise is added at a controlled strength, and the transformer denoises back toward an image that matches the prompt while preserving the source structure.
Our CoreML pipeline: Same flow with CoreML VAE encoder + decoder. We always use denoise_strength=1.0 (unlike SDXL where 0.5-0.8 is typical) because Klein’s architecture handles source preservation through in-context conditioning, not through noise-level control.
Quality: 5.97/10 — source preservation is the weak point (4.07/10). The model frequently replaces faces during style transforms, which is an architectural limitation of the 4B parameter model, not a conversion artifact.
Compose (multi-image + prompt)
What it does: 2-3 source images + text prompt in, composed image out. “Place this person in this scene.”
BFL reference: All reference images are encoded as tokens and concatenated with the text tokens. The transformer attends to everything jointly. This is the architectural innovation — composition is native, not bolted on.
Our CoreML pipeline: Same in-context conditioning approach. We added a 3-tier aspect ratio selection system (described below) since the output dimensions need to be chosen intelligently from the source images.
Quality: 4.66/10 — the hardest mode. Scene-blending works well (scene+scene: 5.34), but people composition struggles (people+people: 4.29). The model treats identity as a loose style cue rather than a pixel-level constraint.
Steps to Get There
1. Component Conversion (4 models)
Each component was converted from PyTorch to CoreML individually:
| Component | Original Size | Quantized Size | Method |
|---|---|---|---|
| Qwen3 Text Encoder | 5,941 MB | 2,229 MB | 6-bit palettization |
| MM-DiT Transformer | 14,785 MB | 2,773 MB | 6-bit palettization |
| VAE Decoder | 95 MB | 95 MB | FP16 (no quantization) |
| VAE Encoder | 66 MB | 66 MB | FP16 (no quantization) |
| Total | 20,887 MB | 5,162 MB | 4.0x compression |
Key challenges solved:
- FP16 overflow in LayerNorm: Qwen3 outputs up to ±16,384. Squaring that for LayerNorm variance exceeds FP16 max (65,504), producing NaN. Fixed by running compute in FP32.
- 4-bit quantization too aggressive: The text encoder at 4-bit had 0.20 correlation with the original (garbled output). 6-bit works at 0.957 correlation.
- torch.export required:
jit.trace(the old conversion path) is broken with coremltools 9.0. Switched totorch.exportwith ATEN decompositions.
2. Pipeline Reimplementation
The flow-matching scheduler and multi-mode conditioning logic were reimplemented in Python first (to validate against PyTorch reference), then ported to Swift. SSIM similarity to PyTorch reference: 0.39-0.40 for txt2img (expected from bf16→fp16 precision loss), 0.71 for img2img, 0.79 for compose.
3. Aspect Ratio Support
Three AR buckets: 1:1 (1024×1024), 3:4 (768×1024), 4:3 (1024×768). This required multi-function CoreML models — a single .mlpackage containing 7 transformer variants and 3 each for VAE encoder/decoder, with weight deduplication keeping the total at 5.1 GB (only +72 MB over single-AR).
4. Swift Port
Full Swift CLI with:
- Live Qwen3 tokenization via
swift-transformers(no Python runtime) - Runtime function selection for AR-aware inference
- CoreML model loading with
.mlmodelcpre-compilation - Performance: ~41s generate, ~75s transform, ~128s compose on M3 Max
- Peak RAM: ~3.7 GB for generate/transform, ~4.5 GB for 2-ref compose (fits iPhone 15 Pro Max)
5. Parameter Optimization
60+ runs across systematic parameter sweeps:
- 6 steps recommended over BFL’s default 4 (+6% quality, 50% more time — worth it for our use case)
- No systematic seed structure — seed variance is per-node noise, not a tunable knob
- denoise_strength=1.0 always — lower values hurt Klein unlike SDXL
- Steps floor: Even 1 step produces usable output (CLIP 28.32) — viable for fast previews
6. AR Selection for Compose
When composing multiple images, what aspect ratio should the output be? We built a 3-tier additive system:
- Tier 1 — Vote: Majority vote across source image dimensions (EXIF-corrected)
- Tier 2 — Vision: Apple Vision framework saliency + face detection analysis
- Tier 3 — LLM: Apple Intelligence on-device reasoning via FoundationModels
The tiers disagree 86% of the time. After 2,286 LLM calls across 254 nodes × 3 modes × 3 runs, we found that faces-only mode (dropping saliency data, keeping only face count) gives the best results: 84.8% consistency, 52% agreement with Tier 1. The LLM has a systematic portrait bias (84%) that saliency data amplifies — removing saliency removes the noise.
Each tier is optional with graceful fallback for older devices.
Safety & Quality Systems
Scorer Engineering
We needed automated quality evaluation to benchmark at scale. The journey:
- GPT-4o: Clustered all scores at 7-9, couldn’t distinguish good outputs from broken ones. Identity replacement scored 7.3 when it should score 3.
- Three prompt iterations (strict caps → distribution targets → balanced): Improved artifact detection but hit a structural ceiling on visual comparison.
- Claude Opus API: Successfully detected merge artifacts (2.8 vs GPT-4o’s 7.7) but had a “sycophancy problem” — described clearly broken outputs as high quality. Discrimination score: 0.12 (nearly zero).
- GPT-5.4 with v4 chain-of-thought prompts: Discrimination score 3.53 (good outputs 7.55, bad outputs 4.02). Consistent (σ < 0.3), cheap (~$1.30 for 65 nodes), zero parse errors.
GPT-5.4 v4 CoT is our production scorer.
AR Selection (3-tier)
Described above — uses on-device Apple Intelligence to choose output dimensions intelligently for compose mode, with fallback through Vision framework and simple voting for older devices.
Quality vs. Mode
The scoring revealed a clear quality hierarchy that maps to product decisions:
| Mode | Score | Implication |
|---|---|---|
| Generate | 8.35 | Ship confidently — matches API quality |
| Transform | 5.97 | Ship with caveats — identity preservation is weak |
| Compose (scene+scene) | 5.34 | Works well — lean into environment blending |
| Compose (people) | 4.18-4.29 | Risky — consider UI guidance away from multi-person composition |
No Explicit NSFW Filter (Yet)
The pipeline does not include a safety classifier. BFL’s model has some implicit safety training, but there’s no hard filter. This is a remaining item for production.
Where We Are
18 Phases Complete
| Phase | What | Key Result |
|---|---|---|
| 0 | Discovery & setup | Architecture mapped, all assumptions corrected |
| 1-4 | Component conversion | All 4 models → CoreML, individually validated |
| 5 | Python CoreML pipeline | All 3 modes working, SSIM 0.39-0.79 vs PyTorch |
| 6 | Benchmark v1 (txt2img) | 83% of API ceiling, 14 nodes |
| 7-10 | Quantization + Swift port | 5.2 GB total, live tokenization, no Python runtime |
| 11 | Benchmark v2 (all modes) | 89.5% of API ceiling, 14 real Vorge nodes |
| 12-13 | Parameter sweeps | 6 steps, no seed structure, no quality cliff |
| 14 | Aspect ratio support | 3 AR buckets, multi-function models, no quality regression |
| 15 | AR orientation fix | Dimension swap + EXIF handling fixed |
| 16 | Compose AR selection | 3-tier system (Vote + Vision + LLM), 82 nodes tested |
| 17 | LLM prompt experiment | 2,286 calls, faces-only mode wins |
| 18 | Full pipeline benchmark | 65 nodes scored by GPT-5.4: 5.16 overall composite |
| 18b | Scorer prompt engineering | GPT-4o ceiling identified, Claude prototype tested |
| 18c | Scorer model comparison | GPT-5.4 v4 CoT beats Claude Opus API |
| 18d | GPT-5.4 full rescore | Definitive 65-node benchmark with failure mode analysis |
All acceptance criteria from the original task are met.
Remaining Work
Neural Engine benchmark — all testing so far uses
cpuAndGPUcompute units. Haven’t testedcpuAndNE(Neural Engine), which is the actual target for iPhone deployment and could be significantly faster.8 GB device gating — 3-ref compose is marginal at ~5 GB peak RAM. Need to test on real iPhone 15 Pro Max and potentially gate 3-ref compose on 8 GB devices.
Source preservation improvements — the model’s Achilles heel (4.07-4.15 scores). This is likely an architectural limitation of 4B parameters, but pipeline-level tuning (adapter strength, two-pass generation) may help.
NSFW safety classifier — no hard content filter exists yet.
ODR packaging — models need to be split into ≤512 MB asset packs for Apple’s On-Demand Resources delivery system.
Production Swift integration — the Swift CLI validates the pipeline works; it needs to be integrated into the Vorge iOS app as a framework.
48 Technical Lessons Documented
The full process document captures 48 lessons learned across all phases — from torch.export migration patterns to multi-function model weight deduplication to the discovery that high SSIM doesn’t correlate with high quality (a node with SSIM 0.80 scored only 5.1/10 because it had high structural similarity but poor task completion).
Six Identified Failure Modes
The GPT-5.4 benchmark identified six distinct failure patterns in compose/transform mode:
- Collage/split-screen — model tiles sources side-by-side instead of integrating them
- Identity replacement — generates plausible but wrong people (affects ~80% of transform nodes with faces)
- Subject fusion — merges two subjects into one grotesque figure
- Source dropping — picks one source image and ignores the others entirely
- Subject hallucination — invents extra people or duplicates subjects
- Species/type change — converts cats to humans during style transfer
These are documented with specific examples, root causes, and tuning priorities. Generate mode avoids all of them. The quality cliff from generate (8.35) to compose (4.66) is almost entirely about whether source identity must be preserved — when it doesn’t (scene+scene, objects, stylized characters), composition works well.
18 phases, 65 benchmark nodes, 2,286 LLM calls, 5.2 GB of quantized models, zero Python runtime dependencies. From a server-side research pipeline to something that fits in your pocket.