SHA-3 and SHAKE Demystified: SoC Hardware Implementation Strategies
SHA-3 and SHAKE Demystified: SoC Hardware Implementation Strategies
From Keccak's sponge construction to SoC accelerator design — a complete technical reference for hardware engineers
SHA-3 is not simply a successor to SHA-2. It represents a fundamental paradigm shift: abandoning the Merkle–Damgård construction entirely in favor of a novel sponge construction. This report covers everything a chip designer needs to know — from algorithmic theory to concrete SoC accelerator microarchitecture decisions.
Why SHA-3 Exists: The Case for a Structural Backup
SHA-3 was standardized by NIST (National Institute of Standards and Technology) in 2015 as FIPS 202, following a public competition launched in 2007. SHA-2 remains cryptographically sound, but the fact that SHA-2 shares its Merkle–Damgård structural lineage with SHA-1 motivated NIST to develop a structurally independent backup standard: if SHA-2 ever falls to a structural attack, SHA-3 provides a heterogeneous fallback. This matters because security policy increasingly demands algorithm agility — the ability to switch primitives without redesigning the entire system.
The Keccak algorithm was designed by Belgian cryptographers Guido Bertoni, Joan Daemen (co-designer of AES), Michaël Peeters, and Gilles Van Assche. It beat 51 competing submissions. Its most decisive advantage over Merkle–Damgård constructions is structural immunity to length-extension attacks — a class of vulnerability that allows an attacker with knowledge of H(m) to compute H(m || extension) without knowing m.
Key Terminology
| Term | Definition | SHA-3 Standard Value |
|---|---|---|
| State | Internal working data arranged as a 5×5 matrix of 64-bit lanes | 1,600 bits |
| Rate (r) | Bits absorbed or squeezed per permutation call; governs throughput | 1,088 / 1,344 bits |
| Capacity (c) | Hidden state region that determines security strength; never directly exposed | 512 / 256 bits |
| Lane | One 64-bit cell in the 5×5 state matrix; the natural word size for the permutation | 64 bits |
| Round | One iteration of the five-step Keccak-f permutation sequence | 24 rounds |
Keccak-f[1600]: The Five-Step Permutation
The core of SHA-3 is the Keccak-f[1600] permutation, applied 24 times per hash. Each round executes five steps in sequence: θ → ρ → π → χ → ι. Understanding the hardware cost of each step is essential for area/performance budgeting.
Relative Gate Cost per Step
What Each Step Does — and Why It Matters
θ (Theta) — Column Diffusion: Computes the XOR parity of each column and mixes it into neighboring columns. This drives the avalanche effect: a single-bit input change propagates to an average of 11 output bits. θ is the most XOR-gate-intensive step, but it is also what makes every bit position dependent on the entire state after just a few rounds — this is why 24 rounds provides a comfortable security margin against differential cryptanalysis.
ρ (Rho) — Intra-Lane Bit Rotation: Cyclically shifts each 64-bit lane by a predetermined constant derived from triangular numbers. In hardware this is pure re-wiring — no gates, no clock cycles. This is a key advantage of the Keccak architecture: what would be a barrel-shifter in software costs nothing in RTL.
π (Pi) — Lane Permutation: Remaps lane positions in the 5×5 matrix according to the rule (x, y) → (y, 2x+3y mod 5). Like ρ, this is realized entirely through wiring — no logic gates required. Together, ρ and π ensure that diffusion spreads across both the intra-lane and inter-lane dimensions.
χ (Chi) — Nonlinear Mixing: The sole nonlinear step, and the cryptographic heart of Keccak. It acts as an S-box via AND, NOT, and XOR: A[x][y] = A[x][y] XOR ((NOT A[x+1][y]) AND A[x+2][y]). All pre-image and collision resistance properties depend on χ. It is also the step targeted by side-channel countermeasures such as DOM masking.
ι (Iota) — Round Constant Injection: XORs a unique 64-bit constant RC[i] into lane (0,0) at the start of each round. Without ι, all 24 rounds would be identical permutations, enabling a powerful slide attack. The constants are generated via LFSR and are fixed — in RTL, store them as a 24-entry ROM or compute them on-the-fly from a small LFSR to save area.
SHAKE: Extendable-Output Functions and the Sponge Model
SHAKE128 and SHAKE256 are XOFs (Extendable-Output Functions) — they produce a bitstream of arbitrary length rather than a fixed-length digest. Where SHA-3 outputs 224, 256, 384, or 512 bits, SHAKE can produce any number of bits on demand. This property is indispensable for applications that need variable-length keys or masks, such as post-quantum cryptography.
Analogy: The Sponge and the Juice
Pour juice (input data) into a dry sponge. The deeper the internal cavity (Capacity), the more thoroughly the juice soaks in — and the stronger the security. To extract output, squeeze the sponge. Need more output? Shake it (re-invoke the Keccak permutation) and squeeze again. Repeat as many times as needed. The capacity region is never squeezed out directly, so it acts as a one-way membrane between input and output.
Sponge Construction: Absorbing and Squeezing Phases
SoC Hardware Implementation: Throughput vs. Area Trade-offs
Architecture Selection — Iterative, Partial Unrolling, or Full Unrolling
The critical design decision is how many Keccak-f rounds to unroll in combinational logic per clock cycle. Each unrolled round increases throughput but also area and power proportionally. There is no universally correct answer — choose based on your system's target throughput (Gbps) and die area budget.
| Strategy | Throughput | Area | Power | Typical Use Case |
|---|---|---|---|---|
| Iterative (1 round/cycle) |
Low | Very small | Low | Embedded SoC, IoT devices |
| Partial Unrolling (N rounds/cycle) |
Medium | Medium | Medium | Mobile AP, security accelerators |
| Full Unrolling (24 rounds/cycle) |
Very high | Very large | High | High-performance servers, HSMs |
Round Constants RC[i] — 64-bit, Keccak-f[1600]
These 24 constants are XORed into lane (0,0) during the ι step of each round. In RTL, implement them as a 24-entry ROM or generate them on-the-fly from a Galois LFSR — the latter saves area at the cost of a few additional gates and one extra cycle of latency for the first round.
ρ Rotation Offsets — Heatmap (mod 64)
The cyclic shift amount for each lane is fixed at synthesis time — implement as routing-only, never as a register or barrel shifter. Darker cells indicate larger rotation offsets. Note that lane (0,0) has offset 0, meaning it passes through ρ unchanged.
| x \ y | y=0 | y=1 | y=2 | y=3 | y=4 |
|---|---|---|---|---|---|
| x=0 | 0 | 36 | 3 | 41 | 11 |
| x=1 | 1 | 44 | 10 | 45 | 2 |
| x=2 | 62 | 6 | 43 | 61 | 15 |
| x=3 | 28 | 55 | 25 | 21 | 8 |
| x=4 | 27 | 20 | 39 | 56 | 14 |
Side-Channel Attack Mitigation
When deploying SHA-3 inside a secure element (SE) or TEE (Trusted Execution Environment), the primary physical threats are DPA (Differential Power Analysis) and EM-emission attacks that exploit correlations between intermediate values and measurable physical leakage. Three well-established countermeasures are:
✓ Threshold Implementation (TI): Split the χ AND gates into three or more shares so no individual share reveals an intermediate value. First-order DPA resistant by construction, but area roughly triples.
✓ Domain-Oriented Masking (DOM): A more area-efficient masking scheme achieving first-order DPA protection at approximately 2–3× area overhead. Preferable over TI when die area is constrained.
✓ Shuffling: Randomize the processing order of the 25 lanes each invocation to disrupt temporal alignment of power traces. Low area overhead, but provides weaker protection — combine with masking for security-critical applications.
Real-World Applications: Where SHA-3 and SHAKE Fit
| Application Domain | Variant Used | Purpose |
|---|---|---|
| Post-Quantum Cryptography | SHAKE256 | Pseudorandom generation in Kyber and Dilithium (NIST PQC standards) |
| Blockchain | Keccak-256 | Ethereum address derivation and smart contract state hashing |
| Key Derivation | SHAKE128 | Variable-length symmetric key derivation (KDF) |
| Random Number Generation | SHAKE256-DRBG | TRNG post-processing and CSPRNG construction |
| Secure Boot | SHA3-256 | Firmware integrity measurement in measured boot chains |
Design Takeaways for SoC Engineers
Key Takeaway
SHA-3 is slower than SHA-2 in software, but its regular permutation structure maps exceptionally well to hardware. ρ and π are free (wires only), ι is trivial (ROM lookup or LFSR), and θ/χ dominate area but are highly parallelizable. SHAKE's variable-length output makes it an indispensable primitive for the post-quantum era — treat it as a mandatory accelerator in any next-generation security subsystem, not an optional add-on. The two knobs that determine overall PPA are how you implement the χ nonlinearity (plain vs. masked) and how aggressively you unroll θ diffusion across pipeline stages.
SoC Designer Checklist
✓ Define target throughput in Gbps → determine unroll depth accordingly
✓ Choose state storage for the 1,600-bit state: flip-flops (faster, larger) or SRAM (smaller, slower) — explicit area/speed trade-off
✓ ρ and π steps must be wiring only — never register intermediate values between these two steps
✓ RC[i] round constants → implement as 24-entry ROM or on-the-fly LFSR generation
✓ χ step → evaluate DOM masking if the design resides in a secure enclave or SE
✓ Bus interface (AXI/AHB) → integrate DMA to minimize CPU-side overhead for bulk hashing
✓ Validate against NIST CAVP test vectors (FIPS 202 official KAT suite) before tape-out sign-off
References
▶ NIST FIPS 202 — SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions
▶ Keccak Team official site (keccak.team) — reference implementations and latest research
▶ Cryptology ePrint Archive — SHA-3 hardware implementation papers and side-channel defense techniques
This material is intended for educational and SoC design reference purposes. Always consult the NIST FIPS 202 specification and current security advisories when implementing in production silicon.
Content is curated from a semiconductor and SoC design perspective, reviewed before publishing.
This article is based on publicly available data and sources. Last updated: June 8, 2026
댓글
댓글 쓰기