SHA-3 Crypto Engine Architecture: Sponge Construction, SHAKE/cSHAKE/KMAC, and Verilog RTL

🔐 SHA-3 Crypto Engine Internals: Sponge Construction, SHAKE/cSHAKE/KMAC, and Verilog RTL

📅 May 13, 2026 · Hardware Security · Cryptographic Hardware · PQC

Published as FIPS 202 by NIST in 2015, SHA-3 is far more than a next-generation hash function. Built on a single primitive called the Sponge Construction, it serves as the foundation for a multipurpose cryptographic engine capable of handling hashing, XOFs (extendable-output functions), MACs, and KDFs from one core. With all post-quantum cryptography (PQC) standards — ML-KEM, ML-DSA, and SLH-DSA — depending on the SHAKE family, SHA-3 has become essential IP in next-generation secure SoCs. This article covers the full picture: from the mathematical definition of the algorithm through Verilog RTL design trade-offs to side-channel resistance.

1. 🧽 Sponge Construction: Breaking from the Merkle-Damgård Paradigm

SHA-1 and SHA-2 both adopted the Merkle-Damgård construction, which is inherently vulnerable to the length-extension attack: an adversary who knows only H(M) can compute H(M‖suffix) without knowing M, forcing the use of wrappers like HMAC. SHA-3 eliminates this class of attack structurally by introducing the sponge construction.

🔄 Two-Phase Operation

Absorbing: Split the input message into blocks of r bits (the rate) → XOR each block into the internal state → apply one Keccak-f[1600] permutation per block.

Squeezing: After all input is absorbed, extract r bits of output at a time; if more output is needed, invoke Keccak-f again before the next extraction.

▶ Internal state is fixed at b = r + c = 1600 bits. The capacity c determines the security strength.

📊 Absorb → Squeeze Data Flow

Message Input M (variable length) Padding + Domain 01 / 1111 / 00 + pad10*1 Absorb: XOR + Keccak-f State: 1600-bit 24 rounds each Squeeze Extract r bits → Output Z Length-extension immune: capacity c-bit is never exposed externally

2. ⚙️ Keccak-f[1600] Round Function Breakdown

The core permutation iterates through 24 rounds, each applying five sequential transformations. Crucially, it uses only bitwise logic — no multiplications or additions. The absence of arithmetic datapaths eliminates carry chains and simplifies timing closure, making SHA-3 exceptionally hardware-friendly.

Step Symbol Function Hardware Cost
Theta θ Column-wise diffusion XOR trees only
Rho ρ Per-lane bit rotation Wire reorder only — 0 gates
Pi π Bit position permutation Wire reorder only — 0 gates
Chi χ The sole nonlinear transformation AND·NOT·XOR (area-critical)
Iota ι Symmetry breaking 24 round constants via LUT

💡 Only χ introduces real logic gates. ρ and π are pure wire reroutes — no gates synthesized. In practice, the majority of RTL area comes from the χ gates and the 1,600-bit state flip-flops. — FIPS 202 §3.2

3. 🌐 SHA-3 Family and SHAKE — The Power of XOFs

3.1 Fixed-Output Hash Functions (FIPS 202)

SHA3-224 / SHA3-256 / SHA3-384 / SHA3-512 — output length is fixed to the number in the function name (in bits). Security strength equals half the capacity (c/2) bits.

📐 Security Parameter Comparison

SHA3-224
c=448, 112-bit sec
SHA3-256
c=512, 128-bit sec
SHA3-384
c=768, 192-bit sec
SHA3-512
c=1024, 256-bit sec

3.2 SHAKE — Extendable-Output Functions

SHAKE128(M, L): 128-bit security, L-bit variable output

SHAKE256(M, L): 256-bit security, L-bit variable output

▶ A shorter output of the same input is always an exact prefix of a longer one. This property makes SHAKE a natural fit for PRNGs, KDFs (key derivation functions), mask generation, and PQC signatures such as SPHINCS+ and Dilithium.

▶ Domain separation: append suffix 1111 + pad10*1 to the message (SHA-3 fixed-output uses 01). This 2-bit difference separates the family while sharing the same Keccak-f permutation.

4. 🔑 cSHAKE and KMAC: The Encoding Layer Defined by SP 800-185

4.1 Encoding Helper Functions — Why They Are Necessary

When absorbing variable-length arguments — keys, customization strings, output lengths — into a single bitstream, boundary ambiguity can cause distinct input combinations to hash to the same value. NIST SP 800-185 defines four encoding functions that eliminate this risk.

Function Definition Use
left_encode(x) Prefix with byte count n Mark length before a message field
right_encode(x) Suffix with byte count n Encode output length L in KMAC
encode_string(S) left_encode(bit-length) ‖ S Eliminate string boundary ambiguity
bytepad(X, w) Zero-pad to a multiple of w bytes Align to rate boundary

4.2 cSHAKE — Domain-Separable SHAKE

cSHAKE128(X, L, N, S) = KECCAK[256] ( bytepad(encode_string(N) ‖ encode_string(S), 168) ‖ X ‖ 00 , L )

N: NIST-reserved function name (e.g., "KMAC", "TupleHash"). Leave empty for user applications.

S: Application-side customization string (for domain separation)

168: SHAKE128 rate (1,344 bits). SHAKE256 uses 136 bytes.

Suffix 00: Separates the cSHAKE domain from plain SHAKE (suffix 1111)

※ If both N and S are empty, cSHAKE falls back to plain SHAKE by definition (SP 800-185 §3.3)

4.3 KMAC — Keccak Message Authentication Code

KMAC128(K, X, L, S) = cSHAKE128 ( bytepad(encode_string(K), 168) ‖ X ‖ right_encode(L), L, "KMAC", S )

Why it is secure: Key K is absorbed ahead of the message, and output length L is appended at the end via right_encode. Even with an identical (K, X) pair, changing L produces statistically independent outputs, blocking prefix-relation attacks. This also eliminates the double-hash overhead that HMAC requires — the most significant practical advantage of KMAC over its predecessor.

📋 KMAC Operation Sequence — 32-byte Key + "Hello" Message

① encode_string(K) left_encode(256) ‖ K[0..31] ② bytepad(…, 168) Zero-pad to 168-byte block boundary ③ Append X = "Hello" Absorb message body ④ right_encode(256) Output length L metadata ⑤ cSHAKE128(…, 256, "KMAC", "Auth") 24 rounds × absorb block ⑥ Squeeze → 32-byte tag MAC = T[0..31] (256 bit) 🔐 KMAC = "Auth" domain message authentication code Same (K, X) but L=512 yields a completely different tag

5. 🔧 Verilog RTL Implementation — Key Design Trade-offs

5.1 Datapath Width vs. Area/Performance Matrix

Architecture Processing Unit ASIC GE Throughput
Serialized (8/16-bit) Multiple clocks/round 2.5k – 5k Hundreds of Mbps
Iterative (1 round/clock) 1,600-bit full-width 10k – 25k 8 – 15 Gbps
Pipelined (Unrolled) Multi-stage pipeline 40k – 100k+ Up to 100 Gbps

📊 Throughput Comparison by Architecture (approximate)

Serialized (8-bit)
~0.5 Gbps
Iterative (1R/clk)
~12 Gbps
Pipelined (Unrolled)
~100 Gbps

5.2 Round Circuit RTL Structure

Recommended structure: 1,600-bit state register + combinational round function (θ→ρ→π→χ→ι) + 24-count FSM

▶ θ: Column-parity XOR trees — 8-input XOR per each of the 25 columns

▶ ρ: Bit rotation — wire reorder only, 0 gates

▶ π: Lane swap — wire-only

▶ χ: Evaluate a ⊕ ((¬b) ∧ c) across all 25 lanes × 64 bits — actual gate synthesis occurs here

▶ ι: 24 round constants via LUT

5.3 Routing Congestion

Because ρ and π are wire-only, all 1,600 bits are effectively shuffled to near-arbitrary positions. This causes routing congestion and increased wire delay during ASIC place-and-route (as documented by the Keccak Team). On FPGAs, the shuffle is absorbed within LUT internal mappings, so the impact is significantly smaller.

5.4 Padding and Encoding Control Logic — cSHAKE/KMAC FSM

Adding cSHAKE/KMAC support on top of a plain SHA-3 hash implementation requires the following additional circuits.

left_encode / right_encode serializer: Variable-length serializer that strips leading zero bytes from the length counter and attaches the byte count as a prefix or suffix

bytepad pacer: Counter that zero-fills up to the rate boundary (168 B for SHAKE128, 136 B for SHAKE256)

Function-suffix MUX: Inserts the appropriate domain suffix immediately before the final absorb block — SHA-3=01, SHAKE=1111, cSHAKE=00

Main FSM: IDLE → ABSORB_PREFIX(N,S) → ABSORB_KEY → ABSORB_MSG → ABSORB_L → FINALIZE → SQUEEZE

When the key K and N (="KMAC") are static, prefix precomputation can cache the post-key initial state in a ROM or register bank, eliminating the absorb cycles for that prefix — commonly referred to as zero-latency keying.

5.5 Side-Channel Resistance (SCA) — χ Is the Target

⚠️ In physical implementations, the AND gates in the χ step are the primary target for power analysis (DPA/SPA). The standard countermeasure is Threshold Implementation (TI) or Domain-Oriented Masking (DOM), which splits the data into three or more shares. The area penalty is approximately 3–4× increase (reaching ~100k GE), as reported by PQShield and the Keccak Team. Additionally, glitches can defeat masking, so register barriers must be inserted between shares to prevent unintended propagation.

5.6 Recommended Interface

AXI-Stream for message and key input; AXI-Lite to expose N/S/L configuration registers

start / busy / done / squeeze_more handshake signals

✓ When the requested output length exceeds the rate, automatically invoke additional Keccak-f iterations to squeeze the next r bits

6. 🎯 One Core, Many Modes

💡 The entire SHA-3 family operates over the same Keccak-f[1600] core. Only the domain-separation suffix and argument-encoding rules change to produce hashing (SHA-3), XOF (SHAKE), MAC (KMAC), and KDF (cSHAKE). From an RTL perspective, a single set of: 1,600-bit state register + 24-round combinational logic + padding/encoding FSM yields a complete multi-mode cryptographic IP block.

That said, (a) routing congestion from ρ/π, (b) side-channel exposure at χ, and (c) the area and leakage cost of the 1,600-bit register remain non-trivial burdens for IoT-class ASICs. Given that all three PQC standards (ML-KEM, ML-DSA, SLH-DSA) depend on SHAKE/cSHAKE as a core primitive, the SHA-3 engine is effectively mandatory IP for next-generation secure SoCs.

📚 Five Key Takeaways

Immune to length-extension attacks — The capacity c-bit is never externally exposed, structurally eliminating the weakness inherent to Merkle-Damgård.

Hardware-friendly by design — XOR/AND/NOT/shift only; no arithmetic datapaths. A 1,600-bit state + 24-round FSM produces a clean, compact RTL.

Multi-mode IP from one core — SHA-3 / SHAKE / cSHAKE / KMAC differentiated by a 2-bit suffix and encoding rules on the same permutation.

Required by PQC standards — ML-KEM, ML-DSA, and SLH-DSA all use SHAKE/cSHAKE as their core primitive.

SCA mitigation cost — Masking the χ gate via TI/DOM incurs a 3–4× area increase. IoT-class SoC designers must weigh this against their threat model and power budget.

📖 References

⚠️ This article is a technical summary based on NIST public standards and academic publications. It does not guarantee the measured performance of any specific product or IP. Area, power, and timing figures in actual ASIC/FPGA implementations can vary significantly depending on the process node, synthesis options, and toolchain. Always perform your own verification.

S
SoC Design
Semiconductor & SoC Design Notes

Collecting and curating technical materials from a semiconductor and SoC design perspective, with a final review before each post.

Written based on publicly available data and sources. Last updated: June 8, 2026

댓글