SHA-3 and SHAKE Demystified: SoC Hardware Implementation Strategies

SHA-3 and SHAKE Demystified: SoC Hardware Implementation Strategies

From Keccak's sponge construction to SoC accelerator design — a complete technical reference for hardware engineers

SHA-3 is not simply a successor to SHA-2. It represents a fundamental paradigm shift: abandoning the Merkle–Damgård construction entirely in favor of a novel sponge construction. This report covers everything a chip designer needs to know — from algorithmic theory to concrete SoC accelerator microarchitecture decisions.

Why SHA-3 Exists: The Case for a Structural Backup

SHA-3 was standardized by NIST (National Institute of Standards and Technology) in 2015 as FIPS 202, following a public competition launched in 2007. SHA-2 remains cryptographically sound, but the fact that SHA-2 shares its Merkle–Damgård structural lineage with SHA-1 motivated NIST to develop a structurally independent backup standard: if SHA-2 ever falls to a structural attack, SHA-3 provides a heterogeneous fallback. This matters because security policy increasingly demands algorithm agility — the ability to switch primitives without redesigning the entire system.

The Keccak algorithm was designed by Belgian cryptographers Guido Bertoni, Joan Daemen (co-designer of AES), Michaël Peeters, and Gilles Van Assche. It beat 51 competing submissions. Its most decisive advantage over Merkle–Damgård constructions is structural immunity to length-extension attacks — a class of vulnerability that allows an attacker with knowledge of H(m) to compute H(m || extension) without knowing m.

Key Terminology

Term Definition SHA-3 Standard Value
State Internal working data arranged as a 5×5 matrix of 64-bit lanes 1,600 bits
Rate (r) Bits absorbed or squeezed per permutation call; governs throughput 1,088 / 1,344 bits
Capacity (c) Hidden state region that determines security strength; never directly exposed 512 / 256 bits
Lane One 64-bit cell in the 5×5 state matrix; the natural word size for the permutation 64 bits
Round One iteration of the five-step Keccak-f permutation sequence 24 rounds

Keccak-f[1600]: The Five-Step Permutation

The core of SHA-3 is the Keccak-f[1600] permutation, applied 24 times per hash. Each round executes five steps in sequence: θ → ρ → π → χ → ι. Understanding the hardware cost of each step is essential for area/performance budgeting.

Relative Gate Cost per Step

θ (Theta)
High
ρ (Rho)
Wires only
π (Pi)
Wires only
χ (Chi)
Medium
ι (Iota)
Very low

What Each Step Does — and Why It Matters

θ (Theta) — Column Diffusion: Computes the XOR parity of each column and mixes it into neighboring columns. This drives the avalanche effect: a single-bit input change propagates to an average of 11 output bits. θ is the most XOR-gate-intensive step, but it is also what makes every bit position dependent on the entire state after just a few rounds — this is why 24 rounds provides a comfortable security margin against differential cryptanalysis.

ρ (Rho) — Intra-Lane Bit Rotation: Cyclically shifts each 64-bit lane by a predetermined constant derived from triangular numbers. In hardware this is pure re-wiring — no gates, no clock cycles. This is a key advantage of the Keccak architecture: what would be a barrel-shifter in software costs nothing in RTL.

π (Pi) — Lane Permutation: Remaps lane positions in the 5×5 matrix according to the rule (x, y) → (y, 2x+3y mod 5). Like ρ, this is realized entirely through wiring — no logic gates required. Together, ρ and π ensure that diffusion spreads across both the intra-lane and inter-lane dimensions.

χ (Chi) — Nonlinear Mixing: The sole nonlinear step, and the cryptographic heart of Keccak. It acts as an S-box via AND, NOT, and XOR: A[x][y] = A[x][y] XOR ((NOT A[x+1][y]) AND A[x+2][y]). All pre-image and collision resistance properties depend on χ. It is also the step targeted by side-channel countermeasures such as DOM masking.

ι (Iota) — Round Constant Injection: XORs a unique 64-bit constant RC[i] into lane (0,0) at the start of each round. Without ι, all 24 rounds would be identical permutations, enabling a powerful slide attack. The constants are generated via LFSR and are fixed — in RTL, store them as a 24-entry ROM or compute them on-the-fly from a small LFSR to save area.

SHAKE: Extendable-Output Functions and the Sponge Model

SHAKE128 and SHAKE256 are XOFs (Extendable-Output Functions) — they produce a bitstream of arbitrary length rather than a fixed-length digest. Where SHA-3 outputs 224, 256, 384, or 512 bits, SHAKE can produce any number of bits on demand. This property is indispensable for applications that need variable-length keys or masks, such as post-quantum cryptography.

Analogy: The Sponge and the Juice

Pour juice (input data) into a dry sponge. The deeper the internal cavity (Capacity), the more thoroughly the juice soaks in — and the stronger the security. To extract output, squeeze the sponge. Need more output? Shake it (re-invoke the Keccak permutation) and squeeze again. Repeat as many times as needed. The capacity region is never squeezed out directly, so it acts as a one-way membrane between input and output.

Sponge Construction: Absorbing and Squeezing Phases

Sponge Construction: Absorbing → Squeezing Input M Padding Block r-bits XOR Keccak-f 24 rounds Block r-bits ↑ Absorbing Phase ↑ Output1 r-bits Keccak-f (re-invoke) Output2 r-bits Output N... ↓ Squeezing Phase (repeatable indefinitely) ↓

SoC Hardware Implementation: Throughput vs. Area Trade-offs

Architecture Selection — Iterative, Partial Unrolling, or Full Unrolling

The critical design decision is how many Keccak-f rounds to unroll in combinational logic per clock cycle. Each unrolled round increases throughput but also area and power proportionally. There is no universally correct answer — choose based on your system's target throughput (Gbps) and die area budget.

Strategy Throughput Area Power Typical Use Case
Iterative
(1 round/cycle)
Low Very small Low Embedded SoC, IoT devices
Partial Unrolling
(N rounds/cycle)
Medium Medium Medium Mobile AP, security accelerators
Full Unrolling
(24 rounds/cycle)
Very high Very large High High-performance servers, HSMs

Round Constants RC[i] — 64-bit, Keccak-f[1600]

These 24 constants are XORed into lane (0,0) during the ι step of each round. In RTL, implement them as a 24-entry ROM or generate them on-the-fly from a Galois LFSR — the latter saves area at the cost of a few additional gates and one extra cycle of latency for the first round.

// Round Constants (hex)
RC[00] = 0x0000000000000001 RC[12] = 0x000000008000808B
RC[01] = 0x0000000000008082 RC[13] = 0x800000000000008B
RC[02] = 0x800000000000808A RC[14] = 0x8000000000008089
RC[03] = 0x8000000080008000 RC[15] = 0x8000000000008003
RC[04] = 0x000000000000808B RC[16] = 0x8000000000008002
RC[05] = 0x0000000080000001 RC[17] = 0x8000000000000080
RC[06] = 0x8000000080008081 RC[18] = 0x000000000000800A
RC[07] = 0x8000000000008009 RC[19] = 0x800000008000000A
RC[08] = 0x000000000000008A RC[20] = 0x8000000080008081
RC[09] = 0x0000000000000088 RC[21] = 0x8000000000008080
RC[10] = 0x0000000080008009 RC[22] = 0x0000000080000001
RC[11] = 0x000000008000000A RC[23] = 0x8000000080008008

ρ Rotation Offsets — Heatmap (mod 64)

The cyclic shift amount for each lane is fixed at synthesis time — implement as routing-only, never as a register or barrel shifter. Darker cells indicate larger rotation offsets. Note that lane (0,0) has offset 0, meaning it passes through ρ unchanged.

x \ y y=0 y=1 y=2 y=3 y=4
x=0 0 36 3 41 11
x=1 1 44 10 45 2
x=2 62 6 43 61 15
x=3 28 55 25 21 8
x=4 27 20 39 56 14

Side-Channel Attack Mitigation

When deploying SHA-3 inside a secure element (SE) or TEE (Trusted Execution Environment), the primary physical threats are DPA (Differential Power Analysis) and EM-emission attacks that exploit correlations between intermediate values and measurable physical leakage. Three well-established countermeasures are:

✓ Threshold Implementation (TI): Split the χ AND gates into three or more shares so no individual share reveals an intermediate value. First-order DPA resistant by construction, but area roughly triples.

✓ Domain-Oriented Masking (DOM): A more area-efficient masking scheme achieving first-order DPA protection at approximately 2–3× area overhead. Preferable over TI when die area is constrained.

✓ Shuffling: Randomize the processing order of the 25 lanes each invocation to disrupt temporal alignment of power traces. Low area overhead, but provides weaker protection — combine with masking for security-critical applications.

Real-World Applications: Where SHA-3 and SHAKE Fit

Application Domain Variant Used Purpose
Post-Quantum Cryptography SHAKE256 Pseudorandom generation in Kyber and Dilithium (NIST PQC standards)
Blockchain Keccak-256 Ethereum address derivation and smart contract state hashing
Key Derivation SHAKE128 Variable-length symmetric key derivation (KDF)
Random Number Generation SHAKE256-DRBG TRNG post-processing and CSPRNG construction
Secure Boot SHA3-256 Firmware integrity measurement in measured boot chains

Design Takeaways for SoC Engineers

Key Takeaway

SHA-3 is slower than SHA-2 in software, but its regular permutation structure maps exceptionally well to hardware. ρ and π are free (wires only), ι is trivial (ROM lookup or LFSR), and θ/χ dominate area but are highly parallelizable. SHAKE's variable-length output makes it an indispensable primitive for the post-quantum era — treat it as a mandatory accelerator in any next-generation security subsystem, not an optional add-on. The two knobs that determine overall PPA are how you implement the χ nonlinearity (plain vs. masked) and how aggressively you unroll θ diffusion across pipeline stages.

SoC Designer Checklist

Define target throughput in Gbps → determine unroll depth accordingly

Choose state storage for the 1,600-bit state: flip-flops (faster, larger) or SRAM (smaller, slower) — explicit area/speed trade-off

ρ and π steps must be wiring only — never register intermediate values between these two steps

RC[i] round constants → implement as 24-entry ROM or on-the-fly LFSR generation

χ step → evaluate DOM masking if the design resides in a secure enclave or SE

Bus interface (AXI/AHB) → integrate DMA to minimize CPU-side overhead for bulk hashing

Validate against NIST CAVP test vectors (FIPS 202 official KAT suite) before tape-out sign-off

References

▶ NIST FIPS 202 — SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions

▶ Keccak Team official site (keccak.team) — reference implementations and latest research

▶ Cryptology ePrint Archive — SHA-3 hardware implementation papers and side-channel defense techniques

This material is intended for educational and SoC design reference purposes. Always consult the NIST FIPS 202 specification and current security advisories when implementing in production silicon.

S
SoC Design
Semiconductor & SoC Design Notes

Content is curated from a semiconductor and SoC design perspective, reviewed before publishing.

This article is based on publicly available data and sources. Last updated: June 8, 2026

댓글