🔐 SHA-3 / SHAKE Technical Brief for SoC Security IP Design

📌 For hardware RTL designers — from algorithm internals to Verilog implementation and PQC integration

Bottom line: A well-designed SHA-3 / SHAKE core covers ① integrity hashing, ② MAC (KMAC), ③ DRBG and stream generation, and ④ building blocks for post-quantum algorithms such as ML-KEM and ML-DSA — all from a single IP. This is why it is becoming a de facto mandatory IP as PQC mandates accelerate.

🧭 1. Standards Landscape — Why SHA-3 Now

SHA-3 is NIST's next-generation hash standard, formalized in 2015 as FIPS 202, built on the Keccak algorithm designed by Bertoni, Daemen, Peeters, and Van Assche. Unlike SHA-2's Merkle–Damgård construction, SHA-3 adopts a sponge construction that is structurally immune to length-extension attacks and supports variable output lengths from the same core.

SHAKE128 / SHAKE256 share the same Keccak-p[1600, 24] permutation and serve as XOFs (extendable output functions). The result: a single hardware core = fixed-length hash + arbitrary-length PRF + PQC building block.

▶ Parameter Comparison at a Glance

Function	Rate (r)	Capacity (c)	Output (bits)	Security (bits)
SHA3-224	1152	448	224	112
SHA3-256	1088	512	256	128
SHA3-384	832	768	384	192
SHA3-512	576	1024	512	256
SHAKE128	1344	256	Variable	128
SHAKE256	1088	512	Variable	256

▶ Rate (throughput) vs. Capacity (security) Trade-off

SHAKE128

r=1344

SHA3-224

r=1152

SHA3-256

r=1088

SHA3-384

r=832

SHA3-512

r=576

📊 A longer bar means more bits absorbed per round = higher throughput. However, a larger rate means a smaller capacity, which reduces security margin (security ≈ c/2).

⚙️ 2. Core Theory — The 1600-Bit Sponge State

The total state size is b = 1600 bits, partitioned as b = r + c. The rate r is the portion XOR-ed with external input each block; the capacity c is the security margin, never directly exposed to the outside.

The 1600 bits are interpreted as a 5 × 5 × 64 (x, y, z) three-dimensional array. A 64-bit word along the z-axis is called a Lane — this is the natural processing unit in Verilog.

⚠️ Endianness Trap: FIPS 202 maps the LSB of each message byte to z=0 of the corresponding lane. A careless MSB-first implementation will fail every NIST CAVP test vector — this is the most common debugging pitfall.

🛠️ 3. Processing Steps — A Verilog-Level Walkthrough

Step 1. Padding and Domain Separation

Function	Appended Bits (LSB-first)	First Pad Byte
SHA3-*	`01` + `10*1`	0x06
SHAKE*	`1111` + `10*1`	0x1F
RawSHAKE	`11` + `10*1`	0x07

The magic bytes — 0x1F for SHAKE and 0x06 for SHA-3 — are not arbitrary constants. They are the result of appending domain separation bits in LSB-first order. In RTL, insert domain bits via OR into the last byte of the final block, then OR in the terminating bit of the pad10*1 rule.

Steps 2–3. Absorb Phase and Keccak-f[1600] — 24 Rounds

The message is cut into r-bit blocks, each XOR-ed into the state, followed by 24 rounds of the Keccak-f permutation. Each round consists of exactly five steps; only χ is nonlinear — a critical advantage for masking cost, as discussed below.

Step	Role	Linearity	HW Cost
θ	Column parity XOR — diffusion	Linear	XOR gates
ρ	64-bit in-lane rotation	Linear	Zero wiring cost
π	5×5 lane reposition	Linear	Zero wiring cost
χ	The only nonlinear step — AND/NOT/XOR	Nonlinear	DPA target
ι	Round Constant XOR	Linear	ROM/case

Step 4. Squeezing

Output d bits are produced by emitting S[0:r-1] and re-running Keccak-f whenever more output is needed. For SHA3-256/512 where d ≤ r, a single squeeze pass suffices. SHAKE, by design, operates over a squeeze loop — making a streaming (valid/ready) interface mandatory.

🏗️ 4. RTL Datapath Choices and Benchmark Ranges

Architecture	Time per Hash	Area	Use Case
Iterative (1 round/clk)	24 cycles	Small	General-purpose SoC (most common)
Unrolled / Pipelined	≥ 1 cycle/block	Very Large	Network line-rate, 40 Gbps+
Folded (64-bit datapath)	Hundreds of cycles	Very Small	IoT, smart card

📊 Industry survey (2024) — FPGA Iterative ≈ 1.5–2.5k LUT, 1–3 Gbps · Unrolled ≈ 10–20k LUT, 30–40 Gbps+ · ASIC Iterative ≈ 20–40k GE. These figures vary widely by process node, EDA tool, and synthesis settings — treat them as ballpark estimates only and re-synthesize with your own PDK before making design decisions.

▶ Recommended Module Hierarchy

keccak_top
├── pad_unit     : domain bits + pad10*1, byte-enable
├── absorb_xor_mux : state[0:r-1] ^= block
├── keccak_round  : θ → ρ → π → χ → ι (combinational)
├── round_counter : 0..23
├── rc_rom      : 24 × 64-bit round constants
├── state_reg    : 5×5 lane array (reg [63:0] A[0:4][0:4])
└── squeeze_unit  : r-bit output streaming

▶ Common Pitfalls

▶ Single 1600-bit register declaration causes routing congestion in synthesis tools. Declare the state as a 5×5 lane array and use index arithmetic to express θ/ρ/π cleanly.

▶ I/O bottleneck — Keccak-f finishes in 24 cycles, but draining 1600 bits over a 32/64-bit AXI bus takes longer. Use AXI4-Stream + DMA.

▶ Constant-time guarantee — The algorithm itself is data-independent, but ensure that the valid/ready handshake and padding controller do not leak timing information based on input length.

▶ Test vectors — Comparing only the final digest misses bit-order bugs in θ/ρ. Use CAVP KAT + FIPS 202 Appendix B to verify intermediate round states.

🛡️ 5. Side-Channel Countermeasures — A Practical Requirement

Keccak's structure of four linear steps plus one nonlinear step (χ) is a significant advantage for masking. The AND-NOT operation in χ is the primary DPA target; Boolean masking combined with share-decorrelation — DOM (Domain-Oriented Masking) or Threshold Implementation (TI) — is the de facto standard.

▶ Masking Overhead (2-share basis)

Area overhead ~2.5–3× baseline

Timing overhead ~1.5× baseline

▶ Add random mask refresh at the Round Constant injection point, and include redundancy / round-recompute verification for fault injection resistance. Both are effectively mandatory for FIPS 140-3 / Common Criteria EAL certification tracks.

🚀 6. PQC Integration — The Real Driver of Demand

NIST finalized its PQC (post-quantum cryptography) standards in 2024, and the core algorithms rely heavily on SHA-3 / SHAKE for internal hashing and XOF calls.

SHA-3 / SHAKE Dependency in PQC Standards (conceptual)

ML-KEM (FIPS 203, Kyber) — SHAKE128 matrix sampling

ML-DSA (FIPS 204, Dilithium) — SHAKE256 PRF and challenge generation

SLH-DSA (FIPS 205, SPHINCS+) — SHA-3-based PRF

🧠 SoC Design Implications:

▶ The SHA-3 core will be invoked repeatedly — dozens to hundreds of times — with short messages, not as a single-shot call. Session caching and fast state reset dominate throughput in this pattern.

▶ A mid-operation state save/restore interface for SHAKE is effectively required.

▶ KMAC, cSHAKE, TupleHash, and ParallelHash (SP 800-185) all reduce to the same core → a function code field in the command decoder is the standard approach.

🪶 7. Lightweight Domain — SHA-3 vs. ASCON

Item	SHA-3 / Keccak	ASCON-Hash
Internal State	1600-bit	320-bit
Structure	Sponge	Sponge (similar)
Area	Baseline	Less than 1/4
Target	General-purpose / PQC / Server	IoT / Sensor / RFID

▶ Choose SHA-3 when the chip handles "heavy security" use cases — PQC, TLS, disk encryption. When the target is a deeply constrained IoT/sensor node, ASCON-Hash is the rational choice. They are not substitutes; they target different market segments.

🔮 8. Outlook

▶ With PQC mandates accelerating, SHA-3 / SHAKE accelerators are shifting from optional to baseline IP. Government, defense, and financial SoC specifications increasingly require simultaneous compliance with FIPS 202, 203, and 204.

▶ The trend toward reusing a single Keccak core for multiple functions — hashing, KMAC, DRBG, PQC internal PRF — is intensifying. Interface flexibility (state save/restore, domain codes, streaming) is becoming the key differentiator over raw throughput.

▶ Side-channel-hardened IP is becoming the norm; without built-in masking, shuffling, and redundancy, entering government certification tracks is increasingly difficult.

▶ Lightweight + post-quantum hybrid SoC — pairing ASCON (LWC) and SHA-3/SHAKE (PQC) on the same die — is expected to grow in IoT gateway and edge applications.

✅ 9. Designer Checklist

✓ Are the FIPS 202 domain bits (0x06 / 0x1F / 0x07) and pad10*1 correctly implemented at the bit level, LSB-first?

✓ Is the 1600-bit state modeled as a 5×5×64 lane array? Is the endianness mapping exactly as specified in the appendix?

✓ Do the intermediate states of θ / ρ / π / χ / ι match a golden reference — not just the final digest? (CAVP KAT alone is insufficient)

✓ Does the iterative vs. unrolled choice align with your throughput, area, and power targets?

✓ Is the I/O bottleneck eliminated via AXI4-Stream / DMA?

✓ Are side-channel countermeasures (DPA / fault injection) appropriate for your threat model?

✓ Have you reserved interface provisions for SHAKE state save/restore, domain codes, and KMAC / cSHAKE extensions?

✓ Has three-way validation been applied: NIST CAVP + FIPS 202 Appendix B + hashlib golden reference?

📚 References

📌 Disclaimer: This document is a technical brief for SoC security IP designers. The FPGA/ASIC benchmark figures cited (LUT, GE, Gbps) are estimated ranges that vary widely depending on process node, EDA tool, and synthesis settings. Re-synthesize and validate in your own environment before making design decisions. For security certifications (FIPS 140-3, Common Criteria EAL, etc.), consult the latest guidelines from your certification authority.

SoC Design

Semiconductor & SoC Design Notes

Curated from a semiconductor and SoC design-and-verification perspective — each post is reviewed before publishing.

Blog

Based on publicly available data and cited sources. Last updated: June 8, 2026

이 블로그 검색