ZeroRISC — Trusted silicon, end to end

TL;DR: In this blog post on hardening PQC for Pavona's Asymmetric Cryptography Coprocessor (ACC), we cover why masking is needed, how it applies to the post-quantum key-encapsulation mechanism ML-KEM, and what our implementation looks like. We build on Pavona’s pre-existing ML-KEM implementation. This work adds first-order masking to ML-KEM decapsulation and key generation through software changes alone, relying on ACC's existing masked Keccak interface; no masking-specific hardware changes are necessary. While some performance overhead is expected, our masked decapsulation incurs a performance overhead of only 3.4× for ML-KEM-512, 2.9× for ML-KEM-768, and 2.6× for ML-KEM-1024. Memory usage for all three variants stays under 22 KB, well within ACC's default memory size. The prototype implementation is available here and will be upstreamed to Pavona.

What Is Post-Quantum Cryptography?

For decades, the security of public-key cryptography has rested on mathematical problems that classical computers find intractable — factoring large integers (RSA) or solving discrete logarithms (ECDSA, ECDH). These primitives secure virtually everything: TLS, SSH, code signing, secure messaging and so on.

That foundation is under threat. In 1994, Peter Shor showed that a sufficiently powerful quantum computer could solve both problems in polynomial time (Shor94). While such a machine does not exist yet, the cryptographic community is not waiting: the time to migrate is now, before quantum hardware catches up. Encrypted traffic captured today could be decrypted later once quantum computers mature — a threat known as harvest now, decrypt later.

Post-quantum cryptography (PQC) is the field designing algorithms that remain secure even against a quantum adversary. In 2024, NIST finalized its first PQC standards, with ML-KEM (formerly Kyber) as the primary key encapsulation mechanism. ML-KEM's security relies on the hardness of the Module Learning With Errors (MLWE) problem — a lattice-based problem for which no efficient quantum algorithm is known.

Where Are We With Post-Quantum Cryptography?

This work sits at the end of a long line of research, started at CHES 2022 in Leuven, as a collaboration between ZeroRISC and the Max Planck Institute for Security and Privacy (MPI-SP). We began with proposing new instructions to the OpenTitan Big Number Accelerator (OTBN) to support PQC, and in the end, our version has evolved so far from the original design that we gave it a new name to reflect its expanded capabilities: ACC, the Asymmetric Cryptography Coprocessor.

The First Paper: Towards ML-KEM and ML-DSA on OpenTitan

We began with Towards ML-KEM and ML-DSA on OpenTitan, in which we proposed four classes of vector instructions. The original design only accelerated classical cryptography, particularly ECC and RSA; these proposed instructions were selected to extend its capabilities to ML-KEM and ML-DSA:

bn.{addv,subv}(m){.16h,.8s}: vector addition/subtraction with optional modular reduction on 16-bit and 32-bit inputs.
bn.shv{.16h,.8s}: vector shift.
bn.trn{1,2}{.16h,.8s}: vector transpose, specifically designed to speed up the (Inverse) Number Theoretic Transform (NTT and INTT) in ML-KEM and ML-DSA.
bn.mulv(m)(.l){.16h,.8s}: vector multiplication with optional modular reduction in hardware (Montgomery reduction).

Each instruction takes one cycle, except bn.mulv which takes 4 cycles and bn.mulvm which takes 12 cycles. (For context, the maximum clock frequency ranges for timing closure in our current design generally range from 100MHz to 500MHz+ depending on the targeted process node.)

Apart from the instructions, the first paper also introduced an important architectural feature: a direct connection from OTBN to the KMAC block, which handles Keccak (SHA-3/SHAKE) efficiently in hardware. This significantly boosts performance on-chip. Overall, ML-KEM and ML-DSA run 6–9× faster than a non-vectorized implementation using basic instructions, with instruction memory (IMEM) increased from 8 to 32KB and data memory (DMEM) from 8 to 128KB.

The Second Paper: Improving ML-KEM and ML-DSA on OpenTitan

Even though the results were significant, we identified two remaining bottlenecks:

The cycle count of bn.mulv and bn.mulvm was high.
The maximum achievable frequency was limited by the sequential adder used in the vectorized addition/subtraction instructions.
The 64x64-bit original multiplier was not reused entirely for multiplication of 16-bit inputs (only four 16x16-bit multiplications per cycle).

This motivated a follow-up, Improving ML-KEM and ML-DSA on OpenTitan, in which we made two key changes. First, we redesigned the adder underlying bn.addv(m) and bn.subv(m) — the instructions themselves are unchanged, but they are now backed by a faster adder that increases the maximum operating frequency of the chip. Second, we redesigned the multiplication instruction entirely, bringing modular reduction back into software and introducing a new, more expressive bn.mulv allowing for 16 16x16-bit multiplications per cycle:

bn.mulv(.l){.16h,.8s}{.even,.odd}(.acc)(.z)(.lo,.hi): vector multiplication with the following suffix groups:
- .16h / .8s: operate on 16-bit halfword or 32-bit single-word lanes.
- .even / .odd: select even- or odd-indexed lanes from the source operands.
- .acc: enable accumulation to ACC and ACCH WSRs.
- .z: clear ACC and ACCH registers before accumulation.
- .lo / .hi: return the lower or upper half of the full product.
- .l: multiplication with a specific lane index of a dedicated lane register sw0 (w16) or sw1 (w17).

This instruction takes 1 cycle for non-modular multiplication for 16-bit inputs and 2 cycles for 32-bit inputs. Modular multiplications, now handled in software, cost only 4 cycles for 16-bit and 7 cycles for 32-bit inputs.

With this new design, we achieved a 17% speedup over the already-fast implementation from the first paper, while keeping the same area with an increased maximum frequency for the chip.

The Engineering Effort: Further Refinement

Beyond new instructions, we later contributed a set of targeted optimizations that further improve the practicality of PQC on ACC:

Stack-optimized ML-DSA: signing stack usage was reduced from 121 KB down to 11 KB for ML-DSA-87 — a 91% reduction — making it possible to reduce DMEM from 128KB to only 32KB. The tradeoff is an 80–140% performance overhead for signing; all other operations are negligibly affected.
Faster rejection sampling:
- ML-KEM: ~13% speedup across all operations.
- ML-DSA: ~52% speedup across all operations.
Redesigned SHA-3/SHAKE interface: ~6% speedup across all operations. ZeroRISC has specifically further improved this interface for the masking case.

Our Final PQC-Capable Instruction Set

Putting all the work together, the current PQC instruction set on ACC is:

bn.{addv,subv}(m){.16h,.8s}: vector addition/subtraction with optional modular reduction — retained from the first paper, now backed by the new higher-frequency adder.
bn.shv{.16h,.8s}: vector shift — retained from the first paper.
bn.trn{1,2}{.16h,.8s}: vector transpose — retained from the first paper.
~~bn.mulv(m)(.l){.16h,.8s}~~ → bn.mulv(.l){.16h,.8s}{.even,.odd}(.acc)(.z)(.lo,.hi): the original multiplication instruction, replaced by the redesigned version from the second paper.
32KB of IMEM and 32KB of DMEM.

For more detail on the PQC instructions, see our previous blog posts here and here.

Real-World Deployment of Post-Quantum Cryptography

Choosing a quantum-resistant algorithm is necessary, but not sufficient. Even a mathematically sound scheme can leak its secret key through physical side channels — unintended information emitted by the hardware during computation.

The most well-known side channels are:

Power analysis: the dynamic power consumption of a chip fluctuates with the data it processes. An attacker with an oscilloscope and a few thousand measurements can statistically recover secret key material from an unprotected implementation.
Electromagnetic (EM) emanations: similar information can leak as EM radiation from the chip.
Timing: variable-time operations can leak information through secret-dependent branches or memory access patterns.

These are not theoretical concerns. One of the most powerful classes of attacks against cryptographic hardware is the differential power analysis (DPA) attack, introduced by Kocher et al. in 1999. The key insight is simple: the power consumed by a circuit at any given moment depends on the data being processed. A transistor switching from 0 to 1 consumes more energy than one staying at 0. This means that if a secret value influences a computation, it leaves a measurable fingerprint in the power trace.

DPA usually requires the attacker to collect thousands of traces while the device performs the target operation multiple times with different known inputs. By hypothesizing a small part of the secret key — say, one byte — and computing a predicted intermediate value for each trace, the attacker can correlate predictions against measurements. If the hypothesis is correct, the correlation spikes.

DPA has been demonstrated against virtually every cryptographic primitive deployed in embedded and smart card contexts — AES, RSA, ECC, and more recently lattice-based schemes, especially ML-KEM decapsulation — the main focus of this work.

Although a successful attack depends on multiple aspects, if your implementation runs on hardware that an attacker can physically probe — IoT devices, smart cards, hardware security modules — it is vulnerable unless explicitly protected. The standard protection we will be talking about throughout this post is masking.

What is Masking?

Masking is the standard countermeasure against power and EM side-channel attacks. The core idea is simple: never let any intermediate value in the computation depend on the secret alone. Instead, we split the secret value into multiple shares and perform all computations on those shares. The shares have the property that the joint distribution of any strict subset of them is independent of the secret. In other words, even if an attacker probes n - 1 shares, they gain zero information about the secret.

There are two types of masking. Boolean masking splits a secret value x into n shares such that:

x = x[1] ^ x[2] ^ ... ^ x[n]

Arithmetic masking splits it additively:

x = x[1] + x[2] + ... + x[n]  (mod q)

Each individual share is uniformly random and reveals nothing about x on its own. This is called n - 1-th order masking with n shares.

The reason we need both types is that different parts of a cryptographic scheme in general and of ML-KEM in specific operate differently: some routines work on byte sequences and are naturally Boolean, while others involve modular arithmetic and are more naturally arithmetic. Moving between the two requires explicit Boolean-to-Arithmetic (B2A) and Arithmetic-to-Boolean (A2B) conversions, which are among the most expensive operations in a masked implementation, especially for a prime modulus as in the case of ML-KEM and ML-DSA. A large body of literature has proposed efficient conversion algorithms [BCZ18, BBE+18, SPOG19, BC22]. In the next sections, you will see exactly where these conversions appear in the ML-KEM decapsulation flow.

Our Implementation

Which Operations in ML-KEM Require Protection?

We assume an attacker targeting the long-term secret key, which is used in key generation and — most critically — in decapsulation. Let us focus on decapsulation, since it is the heart of this line of research.

The diagram below summarizes the decapsulation flow and where masking is needed:

Yellow boxes represent parts that are arithmetically masked (linear operations like the NTT are straightforwardly masked by performing the operation on each share).
Green boxes are Boolean-masked computations.
Blue boxes are unmasked, since no secret is involved.
A transition from yellow to green indicates an A2B conversion; from green to yellow, a B2A conversion.
The neon green box requires a special gadget: a masked comparison.

As you can see, we need four big gadgets:

Masked one-bit compression at the end of the decryption routine.
Masked one-bit decompression at the beginning of the re-encryption routine.
Masked binomial sampler to generate the secret vectors.
Masked comparison to compare the re-encrypted ciphertext with the public input ciphertext. Leaking where the comparison fails leads to a decryption oracle attack; as such, this is one of the most important gadgets in ML-KEM decapsulation masking research.

Similar to decapsulation, we need to mask all the paths that involve the secret in key generation.

Implementation Design Choice

These big gadgets depend on multiple smaller gadgets, which must satisfy certain security conditions, and even their combination must also satisfy these conditions. This diagram maps out every masking gadget we are aware of — a broad landscape of options from the literature.

Single-line boxes are for t-probing-secure gadgets (ISW03), also called NI-secure (following the terminology of BBD+16).
Double-line boxes are for SNI-secure gadgets (BBD+16).
Pill-shape boxes are for PINI-secure gadgets (CS20).
Boxes with shadow are for NIo-secure gadgets.
Orange-border boxes are gadgets which use a table-based approach.

We will not go into the details of these security definitions here, but the key takeaways are:

A composition of NI gadgets is not necessarily t-probing-secure, so NI alone is not enough for building larger secure constructions.
SNI gadgets fix this: they compose securely. However, they require refresh gadgets to be inserted between them, which hurts efficiency.
PINI gadgets go one step further: they allow secure composition without the refresh gadgets, giving us both security and efficiency.

The larger gadgets above all reduce to a common set of base gadgets: A2B and B2A conversions (both modulo a power-of-two and modulo a prime), which in turn rest on two smaller primitives — a secure AND (secand) and a secure modular addition of Boolean shares (secadd).

Now comes the important question: which gadgets should we use for ML-KEM on Pavona's ACC?

We implemented two approaches for these conversions and compared them directly. The first is a non-bitsliced construction built on the Kogge-Stone secadd of CGTV15; the second is the bitsliced construction of BC22. The bitsliced approach was clearly superior across all three parameter sets. Relative to the unmasked baseline, the bitsliced variant incurs an overhead of roughly 2.6× to 3.4×, whereas the non-bitsliced approach ranges from 6.7× to 10.7×. In absolute terms this makes the bitsliced variant roughly 2.5× to 3× faster than its non-bitsliced counterpart. Both implementations are available here.

Given this clear advantage, we adopted the bitsliced BC22 construction and otherwise followed the choices its authors recommend:

A2B and B2A conversions from BC22, together with the secure ripple-carry adder from the same paper.
Masked one-bit compression: HOcompress from CGMZ21.
Masked one-bit decompression: one-bit decompress (described in BGR21) with 1bitB2A from SPOG19, following the algorithm of CGMZ23.
Masked binomial sampler: CBD from BC22.
Masked comparison: HOcompress from CGMZ21 with A2B conversion from BC22.

This diagram gives an overview of the gadgets used to mask the full ML-KEM decapsulation flow described above.

From Secure Gadgets to a Secure Implementation

From a theory perspective, these gadgets ensure that their composition secures our implementation against the DPA attack. However, for the actual implementation to be secure, it depends on more aspects that are worth highlighting:

Register whitening: we clear registers between operations to prevent transition leakage — leakage that arises not from a value itself, but from the change between two consecutive values on the same wire or register.
Efficient random generation modulo a prime: generating random values modulo a prime modulus is a crucial part of masking in general. Doing it efficiently is even a harder question. We exploit the cheap random source URND on the ACC. We rejection-sample a full vector modulo 19*q followed by a reduction by q instead of directly sampling mod q; if any coefficient fails, the entire vector is discarded and resampled. This keeps the resampling rate low and increases the acceptance rate from approximately 81.3% to 96.5%.
Any-order generality with a specialized 2-share path: All gadgets are written to support any number of shares, fixed at compile time, which keeps the higher-order implementation available for research and evaluation. For the implementation targeting the ACC, we specialize to two shares (first order), driven by the hardware: the masked symmetric operations in the decapsulation flow use the hardware-masked KMAC block, which currently supports only first order, and the current ACC DMEM is sized to fit first-order masked ML-KEM and ML-DSA.
Masking of KDF: Most prior works omit masking the final KDF at the end of decapsulation — a reasonable choice, as their goal is solely to protect the long-term secret key. In line with the classical cryptography in ACC, we go one step further and also protect the ephemeral shared key output. Thanks to our efficient first-order masked Keccak, this comes at zero additional cost.

Performance

Finally, the tables below show cycle counts for the bitsliced masked implementation and the masked baseline averaged over 10 executions for ML-KEM decapsulation and key generation.

Our masked implementation of ML-KEM decapsulation achieves at most 3.4× overhead for ML-KEM-512, and less than 3× for ML-KEM-768 and ML-KEM-1024. The higher overhead at the smallest security level comes from the masked binomial sampler: ML-KEM-512 requires ETA1 = 3 rather than ETA1 = 2, which demands more cycles to unpack the masked Keccak output into bitsliced form before feeding it to the sampler. The same effect carries over to key generation, where the secret key and error vectors are also sampled using the masked binomial sampler with ETA1 — ML-KEM-512 reaches 2.66× overhead compared to ML-KEM-768 and ML-KEM-1024, both under 2.4×.

Memory usage is arguably the more critical constraint: everything must fit within the 32 KB of DMEM.

The bitsliced implementation comfortably fits in the 32 KB DMEM of the ACC, with total memory usage below 22 KB across all three security levels for both key generation and decapsulation.

Conclusion

Post-quantum cryptography addresses the quantum threat at the algorithmic level, but real-world deployments on hardware require a second layer of protection against physical side-channel attacks. Masking provides strong, provable security against power analysis, at the cost of performance overhead that careful implementation can substantially reduce.

The bitsliced masked implementation achieves an overhead of at most 3.4x over unmasked code, with all three security levels fitting comfortably within 22 KB of data memory. For a first-order masked decapsulation running on a custom accelerator, these are competitive numbers. Our implementation does not require new hardware support beyond what already exists for PQC (i.e., vector instructions + masked KMAC interface). Everything else is implemented in software.

Our implementation is available here and is currently undergoing SCA evaluation. It is also being upstreamed to Pavona. We will continue to update this space with future contributions to open-source secure silicon in support of the Pavona ecosystem.

Interested in learning more about ZeroRISC offerings? Sign up for our early-access program or contact us at info@zerorisc.com.

See https://www.pavona.org to join the mailing lists, express interest in joining, and stay up to date with the latest in open-source secure silicon.

References

Papers

Towards ML-KEM and ML-DSA on OpenTitan — A. Abdulrahman, F. Oberhansl, H. N. H. Pham, J. Philipoom, P. Schwabe, T. Stelzer, A. Zankl, IEEE S&P 2025.
Improving ML-KEM and ML-DSA on OpenTitan — R. Niederhagen, H. N. H. Pham, to appear at CHES 2026.
Shor94 — P. Shor, Algorithms for Quantum Computation: Discrete Logarithms and Factoring, FOCS 1994.
Kocher99 — P. Kocher et al., Differential Power Analysis, CRYPTO 1999.
ISW03 — Y. Ishai, A. Sahai, D. Wagner, Private Circuits: Securing Hardware against Probing Attacks, CRYPTO 2003.
CGTV15 — J.-S. Coron et al., Conversion from Arithmetic to Boolean Masking with Logarithmic Complexity, FSE 2015.
BBD+16 — S. Barthe et al., Strong Non-Interference and Type-Directed Higher-Order Masking, CCS 2016.
BCZ18 — L. Bettale, J.-S. Coron, L. Zeitoun, Improved High-Order Conversion From Boolean to Arithmetic Masking, TCHES 2018.
BBE+18 — G. Barthe et al., Masking the GLP Lattice-Based Signature Scheme at Any Order, EUROCRYPT 2018.
SPOG19 — T. Schneider, C. Paglialonga, T. Oder, T. Güneysu, Efficiently Masking Binomial Sampling at Arbitrary Orders for Lattice-Based Crypto, PKC 2019.
PP19 — P. Pessl, R. Primas, More Practical Single-Trace Attacks on the Number Theoretic Transform, LATINCRYPT 2019.
CS20 — G. Cassiers, F.-X. Standaert, Trivially and Efficiently Composing Masked Gadgets With Probe Isolating Non-Interference, IEEE T-IFS 2020.
CGMZ21 — J.-S. Coron et al., High-Order Polynomial Comparison and Masking Lattice-Based Encryption, 2021.
BGR21 — J. Bos et al., Masking Kyber: First- and Higher-Order Implementations, TCHES 2021.
BC22 — O. Bronchain, G. Cassiers, Bitslicing Arithmetic/Boolean Masking Conversions for Fun and Profit, TCHES 2022.
CGMZ23 — J.-S. Coron et al., Improved Gadgets for the High-Order Masking of Dilithium, TCHES 2023.

Standards

ML-KEM / FIPS 203 — NIST, Module-Lattice-Based Key-Encapsulation Mechanism Standard, 2024.
ML-DSA / FIPS 204 — NIST, Module-Lattice-Based Digital Signature Standard, 2024.
Kyber — Original Kyber submission (predecessor to ML-KEM).

Hardened PQC on Pavona – Masking ML-KEM