SHA-3 / SHAKE Core Design for Post-Quantum SoC Security IPs
🔐 SHA-3 / SHAKE Technical Brief for SoC Security IP Design
📌 For hardware RTL designers — from algorithm internals to Verilog implementation and PQC integration
Bottom line: A well-designed SHA-3 / SHAKE core covers ① integrity hashing, ② MAC (KMAC), ③ DRBG and stream generation, and ④ building blocks for post-quantum algorithms such as ML-KEM and ML-DSA — all from a single IP. This is why it is becoming a de facto mandatory IP as PQC mandates accelerate.
🧭 1. Standards Landscape — Why SHA-3 Now
SHA-3 is NIST's next-generation hash standard, formalized in 2015 as FIPS 202, built on the Keccak algorithm designed by Bertoni, Daemen, Peeters, and Van Assche. Unlike SHA-2's Merkle–Damgård construction, SHA-3 adopts a sponge construction that is structurally immune to length-extension attacks and supports variable output lengths from the same core.
SHAKE128 / SHAKE256 share the same Keccak-p[1600, 24] permutation and serve as XOFs (extendable output functions). The result: a single hardware core = fixed-length hash + arbitrary-length PRF + PQC building block.
▶ Parameter Comparison at a Glance
| Function | Rate (r) | Capacity (c) | Output (bits) | Security (bits) |
|---|---|---|---|---|
| SHA3-224 | 1152 | 448 | 224 | 112 |
| SHA3-256 | 1088 | 512 | 256 | 128 |
| SHA3-384 | 832 | 768 | 384 | 192 |
| SHA3-512 | 576 | 1024 | 512 | 256 |
| SHAKE128 | 1344 | 256 | Variable | 128 |
| SHAKE256 | 1088 | 512 | Variable | 256 |
▶ Rate (throughput) vs. Capacity (security) Trade-off
📊 A longer bar means more bits absorbed per round = higher throughput. However, a larger rate means a smaller capacity, which reduces security margin (security ≈ c/2).
⚙️ 2. Core Theory — The 1600-Bit Sponge State
The total state size is b = 1600 bits, partitioned as b = r + c. The rate r is the portion XOR-ed with external input each block; the capacity c is the security margin, never directly exposed to the outside.
The 1600 bits are interpreted as a 5 × 5 × 64 (x, y, z) three-dimensional array. A 64-bit word along the z-axis is called a Lane — this is the natural processing unit in Verilog.
⚠️ Endianness Trap: FIPS 202 maps the LSB of each message byte to z=0 of the corresponding lane. A careless MSB-first implementation will fail every NIST CAVP test vector — this is the most common debugging pitfall.
🛠️ 3. Processing Steps — A Verilog-Level Walkthrough
Step 1. Padding and Domain Separation
| Function | Appended Bits (LSB-first) | First Pad Byte |
|---|---|---|
| SHA3-* | 01 + 10*1 |
0x06 |
| SHAKE* | 1111 + 10*1 |
0x1F |
| RawSHAKE | 11 + 10*1 |
0x07 |
The magic bytes — 0x1F for SHAKE and 0x06 for SHA-3 — are not arbitrary constants. They are the result of appending domain separation bits in LSB-first order. In RTL, insert domain bits via OR into the last byte of the final block, then OR in the terminating bit of the pad10*1 rule.
Steps 2–3. Absorb Phase and Keccak-f[1600] — 24 Rounds
The message is cut into r-bit blocks, each XOR-ed into the state, followed by 24 rounds of the Keccak-f permutation. Each round consists of exactly five steps; only χ is nonlinear — a critical advantage for masking cost, as discussed below.
| Step | Role | Linearity | HW Cost |
|---|---|---|---|
| θ | Column parity XOR — diffusion | Linear | XOR gates |
| ρ | 64-bit in-lane rotation | Linear | Zero wiring cost |
| π | 5×5 lane reposition | Linear | Zero wiring cost |
| χ | The only nonlinear step — AND/NOT/XOR | Nonlinear | DPA target |
| ι | Round Constant XOR | Linear | ROM/case |
Step 4. Squeezing
Output d bits are produced by emitting S[0:r-1] and re-running Keccak-f whenever more output is needed. For SHA3-256/512 where d ≤ r, a single squeeze pass suffices. SHAKE, by design, operates over a squeeze loop — making a streaming (valid/ready) interface mandatory.
🏗️ 4. RTL Datapath Choices and Benchmark Ranges
| Architecture | Time per Hash | Area | Use Case |
|---|---|---|---|
| Iterative (1 round/clk) | 24 cycles | Small | General-purpose SoC (most common) |
| Unrolled / Pipelined | ≥ 1 cycle/block | Very Large | Network line-rate, 40 Gbps+ |
| Folded (64-bit datapath) | Hundreds of cycles | Very Small | IoT, smart card |
📊 Industry survey (2024) — FPGA Iterative ≈ 1.5–2.5k LUT, 1–3 Gbps · Unrolled ≈ 10–20k LUT, 30–40 Gbps+ · ASIC Iterative ≈ 20–40k GE. These figures vary widely by process node, EDA tool, and synthesis settings — treat them as ballpark estimates only and re-synthesize with your own PDK before making design decisions.
▶ Recommended Module Hierarchy
▶ Common Pitfalls
▶ Single 1600-bit register declaration causes routing congestion in synthesis tools. Declare the state as a 5×5 lane array and use index arithmetic to express θ/ρ/π cleanly.
▶ I/O bottleneck — Keccak-f finishes in 24 cycles, but draining 1600 bits over a 32/64-bit AXI bus takes longer. Use AXI4-Stream + DMA.
▶ Constant-time guarantee — The algorithm itself is data-independent, but ensure that the valid/ready handshake and padding controller do not leak timing information based on input length.
▶ Test vectors — Comparing only the final digest misses bit-order bugs in θ/ρ. Use CAVP KAT + FIPS 202 Appendix B to verify intermediate round states.
🛡️ 5. Side-Channel Countermeasures — A Practical Requirement
Keccak's structure of four linear steps plus one nonlinear step (χ) is a significant advantage for masking. The AND-NOT operation in χ is the primary DPA target; Boolean masking combined with share-decorrelation — DOM (Domain-Oriented Masking) or Threshold Implementation (TI) — is the de facto standard.
▶ Masking Overhead (2-share basis)
▶ Add random mask refresh at the Round Constant injection point, and include redundancy / round-recompute verification for fault injection resistance. Both are effectively mandatory for FIPS 140-3 / Common Criteria EAL certification tracks.
🚀 6. PQC Integration — The Real Driver of Demand
NIST finalized its PQC (post-quantum cryptography) standards in 2024, and the core algorithms rely heavily on SHA-3 / SHAKE for internal hashing and XOF calls.
🧠 SoC Design Implications:
▶ The SHA-3 core will be invoked repeatedly — dozens to hundreds of times — with short messages, not as a single-shot call. Session caching and fast state reset dominate throughput in this pattern.
▶ A mid-operation state save/restore interface for SHAKE is effectively required.
▶ KMAC, cSHAKE, TupleHash, and ParallelHash (SP 800-185) all reduce to the same core → a function code field in the command decoder is the standard approach.
🪶 7. Lightweight Domain — SHA-3 vs. ASCON
| Item | SHA-3 / Keccak | ASCON-Hash |
|---|---|---|
| Internal State | 1600-bit | 320-bit |
| Structure | Sponge | Sponge (similar) |
| Area | Baseline | Less than 1/4 |
| Target | General-purpose / PQC / Server | IoT / Sensor / RFID |
▶ Choose SHA-3 when the chip handles "heavy security" use cases — PQC, TLS, disk encryption. When the target is a deeply constrained IoT/sensor node, ASCON-Hash is the rational choice. They are not substitutes; they target different market segments.
🔮 8. Outlook
▶ With PQC mandates accelerating, SHA-3 / SHAKE accelerators are shifting from optional to baseline IP. Government, defense, and financial SoC specifications increasingly require simultaneous compliance with FIPS 202, 203, and 204.
▶ The trend toward reusing a single Keccak core for multiple functions — hashing, KMAC, DRBG, PQC internal PRF — is intensifying. Interface flexibility (state save/restore, domain codes, streaming) is becoming the key differentiator over raw throughput.
▶ Side-channel-hardened IP is becoming the norm; without built-in masking, shuffling, and redundancy, entering government certification tracks is increasingly difficult.
▶ Lightweight + post-quantum hybrid SoC — pairing ASCON (LWC) and SHA-3/SHAKE (PQC) on the same die — is expected to grow in IoT gateway and edge applications.
✅ 9. Designer Checklist
✓ Are the FIPS 202 domain bits (0x06 / 0x1F / 0x07) and pad10*1 correctly implemented at the bit level, LSB-first?
✓ Is the 1600-bit state modeled as a 5×5×64 lane array? Is the endianness mapping exactly as specified in the appendix?
✓ Do the intermediate states of θ / ρ / π / χ / ι match a golden reference — not just the final digest? (CAVP KAT alone is insufficient)
✓ Does the iterative vs. unrolled choice align with your throughput, area, and power targets?
✓ Is the I/O bottleneck eliminated via AXI4-Stream / DMA?
✓ Are side-channel countermeasures (DPA / fault injection) appropriate for your threat model?
✓ Have you reserved interface provisions for SHAKE state save/restore, domain codes, and KMAC / cSHAKE extensions?
✓ Has three-way validation been applied: NIST CAVP + FIPS 202 Appendix B + hashlib golden reference?
📚 References
- ▶ NIST FIPS 202 — SHA-3 Standard
- ▶ Keccak Team Specifications
- ▶ NIST PQC Standards (FIPS 203 / 204 / 205)
📌 Disclaimer: This document is a technical brief for SoC security IP designers. The FPGA/ASIC benchmark figures cited (LUT, GE, Gbps) are estimated ranges that vary widely depending on process node, EDA tool, and synthesis settings. Re-synthesize and validate in your own environment before making design decisions. For security certifications (FIPS 140-3, Common Criteria EAL, etc.), consult the latest guidelines from your certification authority.
Curated from a semiconductor and SoC design-and-verification perspective — each post is reviewed before publishing.
Based on publicly available data and cited sources. Last updated: June 8, 2026
댓글
댓글 쓰기