SHA-3 / SHAKE Core Design for Post-Quantum SoC Security IPs

🔐 SHA-3 / SHAKE Technical Brief for SoC Security IP Design

📌 For hardware RTL designers — from algorithm internals to Verilog implementation and PQC integration

Bottom line: A well-designed SHA-3 / SHAKE core covers ① integrity hashing, ② MAC (KMAC), ③ DRBG and stream generation, and ④ building blocks for post-quantum algorithms such as ML-KEM and ML-DSA — all from a single IP. This is why it is becoming a de facto mandatory IP as PQC mandates accelerate.

🧭 1. Standards Landscape — Why SHA-3 Now

SHA-3 is NIST's next-generation hash standard, formalized in 2015 as FIPS 202, built on the Keccak algorithm designed by Bertoni, Daemen, Peeters, and Van Assche. Unlike SHA-2's Merkle–Damgård construction, SHA-3 adopts a sponge construction that is structurally immune to length-extension attacks and supports variable output lengths from the same core.

SHAKE128 / SHAKE256 share the same Keccak-p[1600, 24] permutation and serve as XOFs (extendable output functions). The result: a single hardware core = fixed-length hash + arbitrary-length PRF + PQC building block.

▶ Parameter Comparison at a Glance

Function Rate (r) Capacity (c) Output (bits) Security (bits)
SHA3-224 1152 448 224 112
SHA3-256 1088 512 256 128
SHA3-384 832 768 384 192
SHA3-512 576 1024 512 256
SHAKE128 1344 256 Variable 128
SHAKE256 1088 512 Variable 256

▶ Rate (throughput) vs. Capacity (security) Trade-off

SHAKE128
r=1344
SHA3-224
r=1152
SHA3-256
r=1088
SHA3-384
r=832
SHA3-512
r=576

📊 A longer bar means more bits absorbed per round = higher throughput. However, a larger rate means a smaller capacity, which reduces security margin (security ≈ c/2).

⚙️ 2. Core Theory — The 1600-Bit Sponge State

The total state size is b = 1600 bits, partitioned as b = r + c. The rate r is the portion XOR-ed with external input each block; the capacity c is the security margin, never directly exposed to the outside.

The 1600 bits are interpreted as a 5 × 5 × 64 (x, y, z) three-dimensional array. A 64-bit word along the z-axis is called a Lane — this is the natural processing unit in Verilog.

⚠️ Endianness Trap: FIPS 202 maps the LSB of each message byte to z=0 of the corresponding lane. A careless MSB-first implementation will fail every NIST CAVP test vector — this is the most common debugging pitfall.

🛠️ 3. Processing Steps — A Verilog-Level Walkthrough

Step 1. Padding and Domain Separation

Function Appended Bits (LSB-first) First Pad Byte
SHA3-* 01 + 10*1 0x06
SHAKE* 1111 + 10*1 0x1F
RawSHAKE 11 + 10*1 0x07

The magic bytes — 0x1F for SHAKE and 0x06 for SHA-3 — are not arbitrary constants. They are the result of appending domain separation bits in LSB-first order. In RTL, insert domain bits via OR into the last byte of the final block, then OR in the terminating bit of the pad10*1 rule.

Steps 2–3. Absorb Phase and Keccak-f[1600] — 24 Rounds

The message is cut into r-bit blocks, each XOR-ed into the state, followed by 24 rounds of the Keccak-f permutation. Each round consists of exactly five steps; only χ is nonlinear — a critical advantage for masking cost, as discussed below.

Step Role Linearity HW Cost
θ Column parity XOR — diffusion Linear XOR gates
ρ 64-bit in-lane rotation Linear Zero wiring cost
π 5×5 lane reposition Linear Zero wiring cost
χ The only nonlinear step — AND/NOT/XOR Nonlinear DPA target
ι Round Constant XOR Linear ROM/case

Step 4. Squeezing

Output d bits are produced by emitting S[0:r-1] and re-running Keccak-f whenever more output is needed. For SHA3-256/512 where d ≤ r, a single squeeze pass suffices. SHAKE, by design, operates over a squeeze loop — making a streaming (valid/ready) interface mandatory.

🏗️ 4. RTL Datapath Choices and Benchmark Ranges

Architecture Time per Hash Area Use Case
Iterative (1 round/clk) 24 cycles Small General-purpose SoC (most common)
Unrolled / Pipelined ≥ 1 cycle/block Very Large Network line-rate, 40 Gbps+
Folded (64-bit datapath) Hundreds of cycles Very Small IoT, smart card

📊 Industry survey (2024) — FPGA Iterative ≈ 1.5–2.5k LUT, 1–3 Gbps · Unrolled ≈ 10–20k LUT, 30–40 Gbps+ · ASIC Iterative ≈ 20–40k GE. These figures vary widely by process node, EDA tool, and synthesis settings — treat them as ballpark estimates only and re-synthesize with your own PDK before making design decisions.

▶ Recommended Module Hierarchy

keccak_top
├── pad_unit     : domain bits + pad10*1, byte-enable
├── absorb_xor_mux : state[0:r-1] ^= block
├── keccak_round  : θ → ρ → π → χ → ι (combinational)
├── round_counter : 0..23
├── rc_rom      : 24 × 64-bit round constants
├── state_reg    : 5×5 lane array (reg [63:0] A[0:4][0:4])
└── squeeze_unit  : r-bit output streaming

▶ Common Pitfalls

Single 1600-bit register declaration causes routing congestion in synthesis tools. Declare the state as a 5×5 lane array and use index arithmetic to express θ/ρ/π cleanly.

I/O bottleneck — Keccak-f finishes in 24 cycles, but draining 1600 bits over a 32/64-bit AXI bus takes longer. Use AXI4-Stream + DMA.

Constant-time guarantee — The algorithm itself is data-independent, but ensure that the valid/ready handshake and padding controller do not leak timing information based on input length.

Test vectors — Comparing only the final digest misses bit-order bugs in θ/ρ. Use CAVP KAT + FIPS 202 Appendix B to verify intermediate round states.

🛡️ 5. Side-Channel Countermeasures — A Practical Requirement

Keccak's structure of four linear steps plus one nonlinear step (χ) is a significant advantage for masking. The AND-NOT operation in χ is the primary DPA target; Boolean masking combined with share-decorrelation — DOM (Domain-Oriented Masking) or Threshold Implementation (TI) — is the de facto standard.

▶ Masking Overhead (2-share basis)

Area overhead ~2.5–3× baseline
Timing overhead ~1.5× baseline

▶ Add random mask refresh at the Round Constant injection point, and include redundancy / round-recompute verification for fault injection resistance. Both are effectively mandatory for FIPS 140-3 / Common Criteria EAL certification tracks.

🚀 6. PQC Integration — The Real Driver of Demand

NIST finalized its PQC (post-quantum cryptography) standards in 2024, and the core algorithms rely heavily on SHA-3 / SHAKE for internal hashing and XOF calls.

SHA-3 / SHAKE Dependency in PQC Standards (conceptual)
ML-KEM (FIPS 203, Kyber) — SHAKE128 matrix sampling
ML-DSA (FIPS 204, Dilithium) — SHAKE256 PRF and challenge generation
SLH-DSA (FIPS 205, SPHINCS+) — SHA-3-based PRF

🧠 SoC Design Implications:

▶ The SHA-3 core will be invoked repeatedly — dozens to hundreds of times — with short messages, not as a single-shot call. Session caching and fast state reset dominate throughput in this pattern.

▶ A mid-operation state save/restore interface for SHAKE is effectively required.

▶ KMAC, cSHAKE, TupleHash, and ParallelHash (SP 800-185) all reduce to the same core → a function code field in the command decoder is the standard approach.

🪶 7. Lightweight Domain — SHA-3 vs. ASCON

Item SHA-3 / Keccak ASCON-Hash
Internal State 1600-bit 320-bit
Structure Sponge Sponge (similar)
Area Baseline Less than 1/4
Target General-purpose / PQC / Server IoT / Sensor / RFID

▶ Choose SHA-3 when the chip handles "heavy security" use cases — PQC, TLS, disk encryption. When the target is a deeply constrained IoT/sensor node, ASCON-Hash is the rational choice. They are not substitutes; they target different market segments.

🔮 8. Outlook

▶ With PQC mandates accelerating, SHA-3 / SHAKE accelerators are shifting from optional to baseline IP. Government, defense, and financial SoC specifications increasingly require simultaneous compliance with FIPS 202, 203, and 204.

▶ The trend toward reusing a single Keccak core for multiple functions — hashing, KMAC, DRBG, PQC internal PRF — is intensifying. Interface flexibility (state save/restore, domain codes, streaming) is becoming the key differentiator over raw throughput.

Side-channel-hardened IP is becoming the norm; without built-in masking, shuffling, and redundancy, entering government certification tracks is increasingly difficult.

Lightweight + post-quantum hybrid SoC — pairing ASCON (LWC) and SHA-3/SHAKE (PQC) on the same die — is expected to grow in IoT gateway and edge applications.

✅ 9. Designer Checklist

✓ Are the FIPS 202 domain bits (0x06 / 0x1F / 0x07) and pad10*1 correctly implemented at the bit level, LSB-first?

✓ Is the 1600-bit state modeled as a 5×5×64 lane array? Is the endianness mapping exactly as specified in the appendix?

✓ Do the intermediate states of θ / ρ / π / χ / ι match a golden reference — not just the final digest? (CAVP KAT alone is insufficient)

✓ Does the iterative vs. unrolled choice align with your throughput, area, and power targets?

✓ Is the I/O bottleneck eliminated via AXI4-Stream / DMA?

✓ Are side-channel countermeasures (DPA / fault injection) appropriate for your threat model?

✓ Have you reserved interface provisions for SHAKE state save/restore, domain codes, and KMAC / cSHAKE extensions?

✓ Has three-way validation been applied: NIST CAVP + FIPS 202 Appendix B + hashlib golden reference?

📚 References

📌 Disclaimer: This document is a technical brief for SoC security IP designers. The FPGA/ASIC benchmark figures cited (LUT, GE, Gbps) are estimated ranges that vary widely depending on process node, EDA tool, and synthesis settings. Re-synthesize and validate in your own environment before making design decisions. For security certifications (FIPS 140-3, Common Criteria EAL, etc.), consult the latest guidelines from your certification authority.

S
SoC Design
Semiconductor & SoC Design Notes

Curated from a semiconductor and SoC design-and-verification perspective — each post is reviewed before publishing.

Based on publicly available data and cited sources. Last updated: June 8, 2026

댓글