Skip to content

Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization#551

Open
uno-km wants to merge 5 commits intomicrosoft:mainfrom
uno-km:main
Open

Fix Word Salad (Tensor Corruption) on ARM Devices for i2_s Quantization#551
uno-km wants to merge 5 commits intomicrosoft:mainfrom
uno-km:main

Conversation

@uno-km
Copy link
Copy Markdown

@uno-km uno-km commented Apr 26, 2026

Hello,
The code has been updated to resolve ARM architecture build failures. The entire process of identifying the root causes, applying fixes, and running final tests to ensure stability has been completed.
While Gemini was used for the initial draft, the final solution is based on my own technical validation and experience. For more details on how these ARM-specific issues were tackled, please check my blog post:
https://uno-kim.tistory.com/462
Best regards,

🐛 The Problem

When running models quantized to i2_s (1.58-bit) on non-Apple ARM devices (e.g., Android devices with Exynos/Snapdragon via Termux/PRoot), the model produces severe "Word Salad" (meaningless token generation).

🔍 Root Cause Analysis

The root cause was a memory layout mismatch between the x86 packing phase and the ARM NEON unpacking phase:

  1. QK_I2_S was hardcoded to 64 for __ARM_NEON in the macro definition, while the GGUF models packed on x86 strictly enforce QK=128.
  2. The packing logic in quantize_i2_s for NEON used a 16-stride jump, whereas AVX2 uses a 32-stride.
  3. The decoding kernels (ggml_vec_dot_i2_i8_s_*) had hardcoded loop unrolling that assumed a 64-block size, causing out-of-bounds padding reads and complete accumulator corruption.

🛠️ Changes Implemented

  • Macro Standardization: Enforced #define QK_I2_S 128 across all architectures.
  • Packing Sync: Updated quantize_i2_s for NEON to use the exact same 32-stride packing layout as AVX2.
  • Kernel Rewrite: Completely refactored the 1x1, 1xN, and Nx1 NEON kernels. Removed the legacy QK=64 loop unrolling and replaced it with a dynamic block loop (nb = n / 128).
  • Bit Extraction: Aligned the 2-bit MSB-to-LSB extraction sequence to be 100% mathematically identical to the _mm256_srli_epi16 logic.
  • Safety Upgrade: Changed the final horizontal sum to vaddlvq_s32 (64-bit) to ensure absolute safety against overflow during massive multi-threaded prompt evaluations.

🧪 Testing

  • Device: Samsung Galaxy A35 (Exynos 1380, ARMv8.2-A).
  • Environment: Ubuntu PRoot via Termux.
  • Result: Successfully generates coherent text with full multi-threading (-t 8). The DotProd (__ARM_FEATURE_DOTPROD) hardware acceleration and FMA fallbacks both work flawlessly.

uno-km added 5 commits April 24, 2026 17:02
Changes:

Unified QK Standard: Strictly enforced QK_I2_S = 128 across NEON and Scalar paths to match the standard GGUF packing layout.

Refactored Loop Logic: Removed legacy group32_num and la_num chunks. Replaced with a clean, block-level loop to prevent pointer corruption.

NEON Optimization: Implemented a dual 16-byte chunk load strategy within the 32-byte weight block to maximize SIMD register utilization.

Mathematical Alignment:

Synchronized bit-unpacking order (MSB to LSB) with the AVX2 reference.

Implemented 32-stride interleaved memory fetching for activations (y).

Removed redundant (-1) offset mapping to leverage zero-mean distribution properties, matching the high-performance AVX2 kernel behavior.

Result:

Completely resolved the word salad issue on Exynos/Snapdragon chips.

Validated logical consistency across AVX2, NEON, and Pure C++ Scalar fallback paths.
…g QK layout to 128

- Standardized `QK_I2_S` to 128 for `__ARM_NEON` to match the x86 GGUF packing standard.
- Fixed memory misalignment in `quantize_i2_s` by updating the packing stride to 32.
- Refactored `ggml_vec_dot_i2_i8_s` NEON kernels (1x1, 1xN, Nx1) to use a dynamic block-level loop (`nb = n / QK`) instead of hardcoded 64-stride loop unrolling.
- Aligned interleaved memory fetching (`vld1q_s8`) with the AVX2 logic.
- Upgraded accumulator horizontal sum to `vaddlvq_s32` (64-bit) to prevent potential 32-bit integer overflow in extended context scenarios.

Tested on Exynos 1380 (Android PRoot) with `-t 8`. Output generation is now 100% stable without word salad.
@uno-km
Copy link
Copy Markdown
Author

uno-km commented Apr 26, 2026

@microsoft-github-policy-service agree

@betovildoza
Copy link
Copy Markdown

Hey,
We'll keep an eye on this PR. We experienced very similar "word salad" / tensor corruption issues on Oracle Cloud ARM Ampere A1 with the official i2_s model (see #468 and #470).
Looking forward once it's ready.

@uno-km
Copy link
Copy Markdown
Author

uno-km commented Apr 27, 2026

Hi @betovildoza,

Thank you for reviewing! After reading through Issue #468, I can confidently say that this PR is the exact antidote for Bug 3 (GGGG garbage) and Bug 4 (Word Salad) you mentioned.

I noticed you suspected act_sums or scale offsets for Bug 4. However, the true root cause was much deeper in the memory layout phase.
The x86 packing phase enforces QK=128 (32-stride), but the __ARM_NEON macros and kernel loop unrolling were strictly hardcoded to QK=64 (16-stride). Because of this mismatch, the ARM NEON kernels were unpacking out-of-bounds memory and reading complete garbage, which naturally led to the semantic incoherence (Word Salad) regardless of the decoding path.

By synchronizing the NEON packing logic to a 32-stride and replacing the hardcoded 64-block loop with a dynamic 128-block logic, both the GGGG issue and the Word Salad are completely resolved. Also, changing the horizontal sum to vaddlvq_s32 prevented the accumulator overflow.

It's running perfectly on my Exynos environment now. I'm highly confident this will bring your Oracle Ampere A1 back to life. Let me know if you need any help testing it on your end!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants