SHA-3 and SHAKE Demystified: SoC Hardware Implementation Strategies

From Keccak's sponge construction to SoC accelerator design — a complete technical reference for hardware engineers

SHA-3 is not simply a successor to SHA-2. It represents a fundamental paradigm shift: abandoning the Merkle–Damgård construction entirely in favor of a novel sponge construction. This report covers everything a chip designer needs to know — from algorithmic theory to concrete SoC accelerator microarchitecture decisions.

Why SHA-3 Exists: The Case for a Structural Backup

SHA-3 was standardized by NIST (National Institute of Standards and Technology) in 2015 as FIPS 202, following a public competition launched in 2007. SHA-2 remains cryptographically sound, but the fact that SHA-2 shares its Merkle–Damgård structural lineage with SHA-1 motivated NIST to develop a structurally independent backup standard: if SHA-2 ever falls to a structural attack, SHA-3 provides a heterogeneous fallback. This matters because security policy increasingly demands algorithm agility — the ability to switch primitives without redesigning the entire system.

The Keccak algorithm was designed by Belgian cryptographers Guido Bertoni, Joan Daemen (co-designer of AES), Michaël Peeters, and Gilles Van Assche. It beat 51 competing submissions. Its most decisive advantage over Merkle–Damgård constructions is structural immunity to length-extension attacks — a class of vulnerability that allows an attacker with knowledge of H(m) to compute H(m || extension) without knowing m.

Key Terminology

Term	Definition	SHA-3 Standard Value
State	Internal working data arranged as a 5×5 matrix of 64-bit lanes	1,600 bits
Rate (r)	Bits absorbed or squeezed per permutation call; governs throughput	1,088 / 1,344 bits
Capacity (c)	Hidden state region that determines security strength; never directly exposed	512 / 256 bits
Lane	One 64-bit cell in the 5×5 state matrix; the natural word size for the permutation	64 bits
Round	One iteration of the five-step Keccak-f permutation sequence	24 rounds

Keccak-f[1600]: The Five-Step Permutation

The core of SHA-3 is the Keccak-f[1600] permutation, applied 24 times per hash. Each round executes five steps in sequence: θ → ρ → π → χ → ι. Understanding the hardware cost of each step is essential for area/performance budgeting.

Relative Gate Cost per Step

θ (Theta)

High

ρ (Rho)

Wires only

π (Pi)

Wires only

χ (Chi)

Medium

ι (Iota)

Very low

What Each Step Does — and Why It Matters

θ (Theta) — Column Diffusion: Computes the XOR parity of each column and mixes it into neighboring columns. This drives the avalanche effect: a single-bit input change propagates to an average of 11 output bits. θ is the most XOR-gate-intensive step, but it is also what makes every bit position dependent on the entire state after just a few rounds — this is why 24 rounds provides a comfortable security margin against differential cryptanalysis.

ρ (Rho) — Intra-Lane Bit Rotation: Cyclically shifts each 64-bit lane by a predetermined constant derived from triangular numbers. In hardware this is pure re-wiring — no gates, no clock cycles. This is a key advantage of the Keccak architecture: what would be a barrel-shifter in software costs nothing in RTL.

π (Pi) — Lane Permutation: Remaps lane positions in the 5×5 matrix according to the rule (x, y) → (y, 2x+3y mod 5). Like ρ, this is realized entirely through wiring — no logic gates required. Together, ρ and π ensure that diffusion spreads across both the intra-lane and inter-lane dimensions.

χ (Chi) — Nonlinear Mixing: The sole nonlinear step, and the cryptographic heart of Keccak. It acts as an S-box via AND, NOT, and XOR: A[x][y] = A[x][y] XOR ((NOT A[x+1][y]) AND A[x+2][y]). All pre-image and collision resistance properties depend on χ. It is also the step targeted by side-channel countermeasures such as DOM masking.

ι (Iota) — Round Constant Injection: XORs a unique 64-bit constant RC[i] into lane (0,0) at the start of each round. Without ι, all 24 rounds would be identical permutations, enabling a powerful slide attack. The constants are generated via LFSR and are fixed — in RTL, store them as a 24-entry ROM or compute them on-the-fly from a small LFSR to save area.

SHAKE: Extendable-Output Functions and the Sponge Model

SHAKE128 and SHAKE256 are XOFs (Extendable-Output Functions) — they produce a bitstream of arbitrary length rather than a fixed-length digest. Where SHA-3 outputs 224, 256, 384, or 512 bits, SHAKE can produce any number of bits on demand. This property is indispensable for applications that need variable-length keys or masks, such as post-quantum cryptography.

Analogy: The Sponge and the Juice

Pour juice (input data) into a dry sponge. The deeper the internal cavity (Capacity), the more thoroughly the juice soaks in — and the stronger the security. To extract output, squeeze the sponge. Need more output? Shake it (re-invoke the Keccak permutation) and squeeze again. Repeat as many times as needed. The capacity region is never squeezed out directly, so it acts as a one-way membrane between input and output.

Sponge Construction: Absorbing and Squeezing Phases

SoC Hardware Implementation: Throughput vs. Area Trade-offs

Architecture Selection — Iterative, Partial Unrolling, or Full Unrolling

The critical design decision is how many Keccak-f rounds to unroll in combinational logic per clock cycle. Each unrolled round increases throughput but also area and power proportionally. There is no universally correct answer — choose based on your system's target throughput (Gbps) and die area budget.

Strategy	Throughput	Area	Power	Typical Use Case
Iterative (1 round/cycle)	Low	Very small	Low	Embedded SoC, IoT devices
Partial Unrolling (N rounds/cycle)	Medium	Medium	Medium	Mobile AP, security accelerators
Full Unrolling (24 rounds/cycle)	Very high	Very large	High	High-performance servers, HSMs

Round Constants RC[i] — 64-bit, Keccak-f[1600]

These 24 constants are XORed into lane (0,0) during the ι step of each round. In RTL, implement them as a 24-entry ROM or generate them on-the-fly from a Galois LFSR — the latter saves area at the cost of a few additional gates and one extra cycle of latency for the first round.

// Round Constants (hex)
RC[00] = 0x0000000000000001    RC[12] = 0x000000008000808B
RC[01] = 0x0000000000008082    RC[13] = 0x800000000000008B
RC[02] = 0x800000000000808A    RC[14] = 0x8000000000008089
RC[03] = 0x8000000080008000    RC[15] = 0x8000000000008003
RC[04] = 0x000000000000808B    RC[16] = 0x8000000000008002
RC[05] = 0x0000000080000001    RC[17] = 0x8000000000000080
RC[06] = 0x8000000080008081    RC[18] = 0x000000000000800A
RC[07] = 0x8000000000008009    RC[19] = 0x800000008000000A
RC[08] = 0x000000000000008A    RC[20] = 0x8000000080008081
RC[09] = 0x0000000000000088    RC[21] = 0x8000000000008080
RC[10] = 0x0000000080008009    RC[22] = 0x0000000080000001
RC[11] = 0x000000008000000A    RC[23] = 0x8000000080008008

ρ Rotation Offsets — Heatmap (mod 64)

The cyclic shift amount for each lane is fixed at synthesis time — implement as routing-only, never as a register or barrel shifter. Darker cells indicate larger rotation offsets. Note that lane (0,0) has offset 0, meaning it passes through ρ unchanged.

x \ y	y=0	y=1	y=2	y=3	y=4
x=0	0	36	3	41	11
x=1	1	44	10	45	2
x=2	62	6	43	61	15
x=3	28	55	25	21	8
x=4	27	20	39	56	14

Side-Channel Attack Mitigation

When deploying SHA-3 inside a secure element (SE) or TEE (Trusted Execution Environment), the primary physical threats are DPA (Differential Power Analysis) and EM-emission attacks that exploit correlations between intermediate values and measurable physical leakage. Three well-established countermeasures are:

✓ Threshold Implementation (TI): Split the χ AND gates into three or more shares so no individual share reveals an intermediate value. First-order DPA resistant by construction, but area roughly triples.

✓ Domain-Oriented Masking (DOM): A more area-efficient masking scheme achieving first-order DPA protection at approximately 2–3× area overhead. Preferable over TI when die area is constrained.

✓ Shuffling: Randomize the processing order of the 25 lanes each invocation to disrupt temporal alignment of power traces. Low area overhead, but provides weaker protection — combine with masking for security-critical applications.

Real-World Applications: Where SHA-3 and SHAKE Fit

Application Domain	Variant Used	Purpose
Post-Quantum Cryptography	SHAKE256	Pseudorandom generation in Kyber and Dilithium (NIST PQC standards)
Blockchain	Keccak-256	Ethereum address derivation and smart contract state hashing
Key Derivation	SHAKE128	Variable-length symmetric key derivation (KDF)
Random Number Generation	SHAKE256-DRBG	TRNG post-processing and CSPRNG construction
Secure Boot	SHA3-256	Firmware integrity measurement in measured boot chains

Design Takeaways for SoC Engineers

Key Takeaway

SHA-3 is slower than SHA-2 in software, but its regular permutation structure maps exceptionally well to hardware. ρ and π are free (wires only), ι is trivial (ROM lookup or LFSR), and θ/χ dominate area but are highly parallelizable. SHAKE's variable-length output makes it an indispensable primitive for the post-quantum era — treat it as a mandatory accelerator in any next-generation security subsystem, not an optional add-on. The two knobs that determine overall PPA are how you implement the χ nonlinearity (plain vs. masked) and how aggressively you unroll θ diffusion across pipeline stages.

SoC Designer Checklist

✓ Define target throughput in Gbps → determine unroll depth accordingly

✓ Choose state storage for the 1,600-bit state: flip-flops (faster, larger) or SRAM (smaller, slower) — explicit area/speed trade-off

✓ ρ and π steps must be wiring only — never register intermediate values between these two steps

✓ RC[i] round constants → implement as 24-entry ROM or on-the-fly LFSR generation

✓ χ step → evaluate DOM masking if the design resides in a secure enclave or SE

✓ Bus interface (AXI/AHB) → integrate DMA to minimize CPU-side overhead for bulk hashing

✓ Validate against NIST CAVP test vectors (FIPS 202 official KAT suite) before tape-out sign-off

References

▶ NIST FIPS 202 — SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions

▶ Keccak Team official site (keccak.team) — reference implementations and latest research

▶ Cryptology ePrint Archive — SHA-3 hardware implementation papers and side-channel defense techniques

This material is intended for educational and SoC design reference purposes. Always consult the NIST FIPS 202 specification and current security advisories when implementing in production silicon.

SoC Design

Semiconductor & SoC Design Notes

Content is curated from a semiconductor and SoC design perspective, reviewed before publishing.

Blog

This article is based on publicly available data and sources. Last updated: June 8, 2026

이 블로그 검색

SoC Design