ZeroRISC — Trusted silicon, end to end

TL;DR: In this episode of our blog post series on hardening PQC for Pavona's asymmetric cryptography coprocessor (ACC), we describe our DPA-hardened implementation of the post-quantum signature scheme ML-DSA supporting all parameter sets (44, 65, and 87). We closely follow our ML-KEM implementation described in the last blog post: without adding additional hardware, we implement all required masking gadgets in software using bitslicing. The resulting implementation is designed for first-order DPA hardening and incurs a runtime overhead over the unprotected implementation of only 2.4-2.8× for key generation and 3.6-4.1× for signing. Average signing times range from 1.9 million clock cycles (ML-DSA-44) to 3.6 million clock cycles (ML-DSA-87). Everything fits within the default 32 KB DMEM and IMEM of ACC. Our prototype implementation is available here and will be upstreamed to Pavona.

ML-DSA: The default post-quantum signature standard

ML-DSA (FIPS 204), derived from Dilithium, is NIST's primary post-quantum signature standard. Like ML-KEM, it is lattice-based and shares the same arithmetic core, including the number-theoretic transform and the SHAKE/Keccak symmetric primitives. Its parameters, however, are considerably larger, with a bigger modulus as well as larger matrices and vectors.

What sets ML-DSA apart structurally is that signing takes the form of a rejection loop. Each signing attempt expands fresh randomness, combines it with the secret key to build a candidate signature, and then checks whether that candidate fulfills several conditions. If any condition is not met, the candidate is discarded and the loop starts over with new randomness. Depending on the parameter set, 74% to 80% of candidate signatures are rejected, resulting in averages of 4.3, 5.2, and 3.9 attempts per signature for ML-DSA-44, -65, and -87, respectively.

Side-channel protection for ML-DSA has been studied extensively, giving us a well-established foundation to build on. Our masked gadgets follow the scheme of Azouaoui et al. [ABCH+23] as presented at CHES 2023. To meet the needs of the largest number of design integrators, we implemented this baseline with pure software hardening, while not precluding compatibility with other design trade-offs (e.g. in area, time, or performance) via Pavona’s configuration framework. We partition the hardening such that it could also be implemented in hardware with minimal design changes. We made the same decision to use bitsliced gadgets for masked ML-KEM in Part I of this series, meaning our prior efforts carry over naturally given how much structure ML-DSA and ML-KEM share. We defer to Part I for preliminaries on masking.

First-order masked ML-DSA-87 has also been studied by others; we refer the reader to [low26] for a different approach leveraging hardware acceleration.

Before we turn to the implementation, we need to understand what values in ML-DSA are sensitive, and, hence, require side-channel protection. The following figure (inspired by [ABCH+23]) illustrates the computations of ML-DSA key generation.

In the figure, red marks secret masked values, blue marks public ones, and the yellow SecUnmask labels mark where data leaves the masked domain. The seed expands into the public matrix $A$ and the masked secret vectors $s_1$ and $s_2$ , which combine into $t$ ; $t$ is then unmasked and split by Power2Round into the public $t_1$ and $t_0$ .

The next figure (once again inspired by [ABCH+23]) illustrates ML-DSA signing.

Only the secret vectors $s_1$ and $s_2$ and the seed $K$ need to be kept masked, while the matrix $A$ , the high bits $w_1$ , the challenge $c$ , the hint $h$ , and $t_0$ are public and require no protection. It is a common misconception that the $t_0$ -part of the secret key has to be kept secret. However, security of ML-DSA does not rely on $t_0$ being secret; it can, in fact, be recovered from a small number of signatures as stated in Section 6.1 of FIPS204. $w_1$ is public as the verifier has to recompute it from the signature and public key anyway.

Implementing the building blocks: masking gadgets

Exactly as in ML-KEM, masking ML-DSA needs two flavors of masking: some operations are naturally masked arithmetically, others in the Boolean domain. Linear operations over the prime field, such as the number-theoretic transform, fit arithmetic masking, while bit-level operations such as sampling the small secrets, the rounding inside Decompose, and the norm checks, fit Boolean masking. Moving between the two requires arithmetic-to-Boolean (A2B) and Boolean-to-arithmetic (B2A) conversions, the most expensive masking operations.

Much of the bitsliced gadget toolbox carries over from our ML-KEM implementation. The masked symmetric operations again run on the ACC's hardware-masked Keccak interface. At the gadget level, two building blocks can be reused: the secure AND, SecAnd [CS20], and the secure addition, SecAdd [BC22]. Both are independent of the modulus. A second group uses the same constructions as in ML-KEM, but for ML-DSA's much larger prime modulus $q$ we re-instantiate them: the modular addition of Boolean shares SecAddModq, the conversions SecA2BModq and SecB2AModq [BC22], and the compression SecCompress [CGMZ23] (used in ML-DSA-44 decompose only).

The remaining gadgets are specific to ML-DSA, and we implement them following [ABCH+23]: the masked Decompose SecDecompose, the bound check SecBoundCheck together with its masked comparison SecLeq, and the masked samplers ExpandS and ExpandMask that produce the secret vectors $s_1$ , $s_2$ and the masking polynomial $y$ . We had to slightly tweak ExpandS to obtain an implementation that produced secrets conforming to the FIPS204 standard.

Several gadgets need fresh randomness uniform modulo $q$ to refresh masking shares. We use rejection sampling with randomness drawn from the ACC's URND. Because $q$ lies just below $2^{23}$ , rejection sampling from 23-bit draws is cheap even when sampling multiple coefficients at once. For arithmetic masking, we sample 8 coefficients at once which are accepted in 99% of draws. For Boolean masking, we sample 256 bitsliced coefficients at once which are accepted in 78% of cases. The latter is particularly elegant, as a single sampling attempt requires only 23 URND reads and 22 bitwise operations (plus branching) for checking that all coefficients are smaller than q.

Putting it all together in 32 KB of DMEM

Implementing ML-DSA on memory-constrained devices is demanding even without considering masking: In ML-DSA-87 the public $A$ is an 8-by-7 matrix of 1 KB polynomials, totalling 56 KB, already well beyond the 32 KB of DMEM available in ACC. Masking only makes this worse, since every secret buffer must now be held as multiple shares. We limit our implementation to first-order masking as only that configuration appears feasible at reasonable performance in just 32 KB of DMEM. While we focus the discussion on ML-DSA-87, the most demanding parameter set, this implementation supports all three, and the smaller ones fit trivially with the same strategy.

Fitting ML-DSA into limited memory is a challenge we have tackled before. In our earlier blog post we showed how we shrank the unmasked implementation on ACC by streaming the large buffers, recomputing the matrix $A$ from its seed rather than storing it and re-deriving the masking polynomial $y$ on the fly. We inherit those techniques here, which leaves only one large buffer genuinely in flight: the commitment $w = A \cdot y$ , which occupies 16 KB when split into two shares.

One aspect deserving special attention is the format for storing the $s_1$ and $s_2$ components of the secret key. A natural choice would be to store them arithmetically masked, but this would require $2 \times (7 + 8) = 30$ KB, clearly infeasible with 32 KB total memory. Another option is to use Boolean masking of the canonical packed representation of $s_1$ and $s_2$ occupying 2.8 KB in total, and requiring additional SecB2AModq conversions each time $s_1$ and $s_2$ are used. The third option–which is the one we adopt–is to store the $\rho’$ seed used to expand $s_1$ and $s_2$ in key generation in ExpandS. Boolean masked, this seed only occupies 64 bytes.

Putting this all together, we came up with an initial implementation for which we show the memory allocation in the left part of the figure below.

Unfortunately, including the input buffers (secret key, message), output buffer (signature), constants (twiddle factors for the NTTs, etc), the $w$ buffer, temporary buffers within the signing loop, and temporary buffers in our masking gadget, we require 36.5 KB, exceeding our budget by 4.5 KB.

Luckily, one observation allows us to reduce memory consumption by more than 6 KB: The signing loop can be split in two phases where the first phase computes w, decomposes it, and computes the challenge c, while the second phase uses the challenge and computes the resulting candidate signature. In the first phase, the signature buffer is unused and can be reused for some of the temporary buffers required. In the second phase, the $w_0$ output of decompose can be compressed, freeing up some memory that can be reused for other buffers. Our allocation for the two phases, each fitting in 30.25 KB, is shown in the above figure on the right.

The trick to compress $w_0$ has been proposed for unmasked implementations in [BRS22]. However, to apply it to masked implementations, the SecDecompose gadget from [ABCH+23] has to be modified to output $w_0$ in Boolean shares rather than arithmetic shares, applying an additional SecB2AModq conversion in phase 2, where $w_0$ is consumed. This trick does not work for ML-DSA-44, which requires a different SecDecompose, but luckily that is not a problem: the uncompressed $w_0$ of ML-DSA-44 fits into the memory slot sized for the compressed $w_0$ of ML-DSA-87.

With this allocation we were able to fit ML-DSA-87 signing (and straightforwardly also 65 and 44) into around 30.25 KB of DMEM.

Benchmarking results

Next we can benchmark our masked ML-DSA implementation and compare it against the current unmasked baseline, for both key generation and signing and across all three parameter sets. We make use of ACC’s cycle-accurate Python simulator to obtain cycle counts. As signing time varies from run to run because of the rejection loop, we make use of the clever methodology described in Jade Philipoom’s previous blog post that allows selecting a representative set of inputs producing average run-time close to the expected performance. We used a set of 10 representative inputs for signing.

The masking overhead of 2.4× to 2.8× for key generation and 3.6× to 4.1× for signing is achieved with software changes alone and no additional hardware, while remaining within the ACC's default 32 KB of DMEM and IMEM.

To put these cycle counts into wall-clock terms: ACC closes timing between 100 MHz and 500+ MHz depending on the targeted process node. At the conservative 100 MHz end, masked signing averages roughly 19 ms, 33 ms, and 36 ms for ML-DSA-44, -65, and -87, respectively, with masked key generation taking between 2.5 and 5.2 ms.

Conclusion

In this blog post, we have described our implementation of hardened ML-DSA on Pavona's ACC. By reusing Pavona's hardware-masked Keccak implementation and adding bitsliced masking gadgets in software, we add first-order DPA protection to both key generation and signing without any additional hardware. The cost is modest, 2.4× to 2.8× for key generation and 3.6× to 4.1× for signing, and the whole implementation fits within ACC's default 32 KB of DMEM and IMEM.

Our implementation is available here and will be upstreamed to Pavona; it is currently undergoing side-channel evaluation.

Interested in learning more about ZeroRISC offerings? Sign up for our early-access program or contact us at info@zerorisc.com.

Participate in the Pavona open-source silicon effort here.

References

[ABCH+23] M. Azouaoui, O. Bronchain, G. Cassiers, C. Hoffmann, Y. Kuzovkova, J. Renes, T. Schneider, M. Schönauer, F.-X. Standaert, C. van Vredendaal. Protecting Dilithium against Leakage: Revisited Sensitivity Analysis and Improved Implementations. IACR TCHES 2023(4):58-79. https://eprint.iacr.org/2022/1406

[BC22] O. Bronchain, G. Cassiers. Bitslicing Arithmetic/Boolean Masking Conversions for Fun and Profit with Application to Lattice-Based KEMs. IACR TCHES 2022(4):553-588. https://eprint.iacr.org/2022/158

[BRS22] J. W. Bos, J. Renes, A. Sprenkels. Dilithium for Memory Constrained Devices. AFRICACRYPT 2022, LNCS 13503, pp. 217-235. https://eprint.iacr.org/2022/323

[CGMZ23] J.-S. Coron, F. Gérard, S. Montoya, R. Zeitoun. High-order Polynomial Comparison and Masking Lattice-based Encryption. IACR TCHES 2023(1):153-192. https://eprint.iacr.org/2021/1615

[CS20] G. Cassiers, F.-X. Standaert. Trivially and Efficiently Composing Masked Gadgets with Probe Isolating Non-Interference. IEEE T-IFS 15:2542-2555, 2020. https://eprint.iacr.org/2018/438

[FIPS 204] National Institute of Standards and Technology. FIPS 204: Module-Lattice-Based Digital Signature Standard. 2024. https://doi.org/10.6028/NIST.FIPS.204

[low26] lowRISC C.I.C.. OpenTitan Big Number (OTBN) accelerator hardware extensions for post-quantum cryptography – Extended design rationale. https://lowrisc.org/news/opentitan-big-number-otbn-accelerator-hardware-extensions-for-post-quantum-cryptography-extended-design-rationale/