Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization by uno-km · Pull Request #551 · microsoft/BitNet

uno-km · 2026-04-26T13:56:52Z

Hello,
The code has been updated to resolve ARM architecture build failures. The entire process of identifying the root causes, applying fixes, and running final tests to ensure stability has been completed.
While Gemini was used for the initial draft, the final solution is based on my own technical validation and experience. For more details on how these ARM-specific issues were tackled, please check my blog post:
https://uno-kim.tistory.com/462
Best regards,

🐛 The Problem

When running models quantized to i2_s (1.58-bit) on non-Apple ARM devices (e.g., Android devices with Exynos/Snapdragon via Termux/PRoot), the model produces severe "Word Salad" (meaningless token generation).

🔍 Root Cause Analysis

The root cause was a memory layout mismatch between the x86 packing phase and the ARM NEON unpacking phase:

QK_I2_S was hardcoded to 64 for __ARM_NEON in the macro definition, while the GGUF models packed on x86 strictly enforce QK=128.
The packing logic in quantize_i2_s for NEON used a 16-stride jump, whereas AVX2 uses a 32-stride.
The decoding kernels (ggml_vec_dot_i2_i8_s_*) had hardcoded loop unrolling that assumed a 64-block size, causing out-of-bounds padding reads and complete accumulator corruption.

🛠️ Changes Implemented

Macro Standardization: Enforced #define QK_I2_S 128 across all architectures.
Packing Sync: Updated quantize_i2_s for NEON to use the exact same 32-stride packing layout as AVX2.
Kernel Rewrite: Completely refactored the 1x1, 1xN, and Nx1 NEON kernels. Removed the legacy QK=64 loop unrolling and replaced it with a dynamic block loop (nb = n / 128).
Bit Extraction: Aligned the 2-bit MSB-to-LSB extraction sequence to be 100% mathematically identical to the _mm256_srli_epi16 logic.
Safety Upgrade: Changed the final horizontal sum to vaddlvq_s32 (64-bit) to ensure absolute safety against overflow during massive multi-threaded prompt evaluations.

🧪 Testing

Device: Samsung Galaxy A35 (Exynos 1380, ARMv8.2-A).
Environment: Ubuntu PRoot via Termux.
Result: Successfully generates coherent text with full multi-threading (-t 8). The DotProd (__ARM_FEATURE_DOTPROD) hardware acceleration and FMA fallbacks both work flawlessly.

변경

Changes: Unified QK Standard: Strictly enforced QK_I2_S = 128 across NEON and Scalar paths to match the standard GGUF packing layout. Refactored Loop Logic: Removed legacy group32_num and la_num chunks. Replaced with a clean, block-level loop to prevent pointer corruption. NEON Optimization: Implemented a dual 16-byte chunk load strategy within the 32-byte weight block to maximize SIMD register utilization. Mathematical Alignment: Synchronized bit-unpacking order (MSB to LSB) with the AVX2 reference. Implemented 32-stride interleaved memory fetching for activations (y). Removed redundant (-1) offset mapping to leverage zero-mean distribution properties, matching the high-performance AVX2 kernel behavior. Result: Completely resolved the word salad issue on Exynos/Snapdragon chips. Validated logical consistency across AVX2, NEON, and Pure C++ Scalar fallback paths.

…g QK layout to 128 - Standardized `QK_I2_S` to 128 for `__ARM_NEON` to match the x86 GGUF packing standard. - Fixed memory misalignment in `quantize_i2_s` by updating the packing stride to 32. - Refactored `ggml_vec_dot_i2_i8_s` NEON kernels (1x1, 1xN, Nx1) to use a dynamic block-level loop (`nb = n / QK`) instead of hardcoded 64-stride loop unrolling. - Aligned interleaved memory fetching (`vld1q_s8`) with the AVX2 logic. - Upgraded accumulator horizontal sum to `vaddlvq_s32` (64-bit) to prevent potential 32-bit integer overflow in extended context scenarios. Tested on Exynos 1380 (Android PRoot) with `-t 8`. Output generation is now 100% stable without word salad.

uno-km · 2026-04-26T14:22:58Z

@microsoft-github-policy-service agree

betovildoza · 2026-04-26T19:24:05Z

Hey,
We'll keep an eye on this PR. We experienced very similar "word salad" / tensor corruption issues on Oracle Cloud ARM Ampere A1 with the official i2_s model (see #468 and #470).
Looking forward once it's ready.

uno-km · 2026-04-27T10:29:22Z

Hi @betovildoza,

Thank you for reviewing! After reading through Issue #468, I can confidently say that this PR is the exact antidote for Bug 3 (GGGG garbage) and Bug 4 (Word Salad) you mentioned.

I noticed you suspected act_sums or scale offsets for Bug 4. However, the true root cause was much deeper in the memory layout phase.
The x86 packing phase enforces QK=128 (32-stride), but the __ARM_NEON macros and kernel loop unrolling were strictly hardcoded to QK=64 (16-stride). Because of this mismatch, the ARM NEON kernels were unpacking out-of-bounds memory and reading complete garbage, which naturally led to the semantic incoherence (Word Salad) regardless of the decoding path.

By synchronizing the NEON packing logic to a 32-stride and replacing the hardcoded 64-block loop with a dynamic 128-block logic, both the GGGG issue and the Word Salad are completely resolved. Also, changing the horizontal sum to vaddlvq_s32 prevented the accumulator overflow.

It's running perfectly on my Exynos environment now. I'm highly confident this will bring your Oracle Ampere A1 back to life. Let me know if you need any help testing it on your end!

uno-km added 5 commits April 24, 2026 17:02

Update ggml-bitnet-mad.cpp

67eb757

Update ggml-bitnet-mad.cpp

4fcd280

Refactor scalar fallback for ARM/Exynos in ggml-bitnet-mad

8cdd415

변경

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization#551

Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization#551
uno-km wants to merge 5 commits intomicrosoft:mainfrom
uno-km:main

uno-km commented Apr 26, 2026

Uh oh!

uno-km commented Apr 26, 2026

Uh oh!

betovildoza commented Apr 26, 2026

Uh oh!

uno-km commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uno-km commented Apr 26, 2026

🐛 The Problem

🔍 Root Cause Analysis

🛠️ Changes Implemented

🧪 Testing

Uh oh!

uno-km commented Apr 26, 2026

Uh oh!

betovildoza commented Apr 26, 2026

Uh oh!

uno-km commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants