2026-02-25 · Engineering

From Artifact to Production: Integrating and Refining Lattice Cryptography Acceleration

By Evan Apinis, Kat Fox, Jade Philipoom

This post follows on from the recent cross-post from our research collaborators at MPI-SP about their innovative design for ML-KEM and ML-DSA acceleration. Today, we'll focus on what happened in between the researchers creating the initial implementation and now, when we have lattice cryptography support derived from their work in our production open-source silicon repository. For more on this collaboration with a higher-level and broader lens, look out for our upcoming joint talk at Real World Crypto in March!

We believe strongly that open-source projects and academic research reinforce one another. Open-source projects give researchers a realistic starting point for experiments so they don't need to build everything themselves or reverse-engineer a blackbox product in order to publish papers. And in return, open-source projects can benefit from cutting-edge research being developed natively on their codebases.

Although research and open-source development can happen completely independently, we have found that early and frequent communication helps both sides. Researchers can offload non-novel, engineering-focused tasks to open-source developers and ask questions about the codebase. Developers can get advance insight about research directions and can share their project's constraints and priorities to increase the odds of adoption. Finally, and perhaps most importantly for both sides, open-source developers can integrate and refine research artifacts and then include them into projects for real-world impact. In the case of the ML-KEM and ML-DSA collaboration, which introduced a significant amount of new code and features to an existing open-source codebase, this was a complex task that ranged from improving RTL design verification coverage to optimizing the memory and performance usage of the code.

Collaboration setup

Before we could start any of the optimizations, it was clear that it would be helpful for both us and our academic collaborators to work on the same repository so we all had shared state. Since we would be doing our optimizations in parallel with new research, it was important to stay in sync and make sure the changes didn't diverge too much.

We initially limited access to this repository to protect collaborators' ability to publish and minimize the chance that partial results could be taken out of context. However, we also wanted future readers to have a faithful record of how the project evolved. Especially in security, keeping context like discussions of technical tradeoffs and the ordering of changes in the public record is a core advantage of open-source development. We therefore worked from a private repository initially, then changed the visibility to public after the new research was complete.

The results from the Towards ML-KEM and ML-DSA on OpenTitan paper had already been released as an artifact in a dedicated git repository. This code was based on a relatively old version of the OpenTitan codebase, so the first challenge was to make the changes work with a more recent version. Unfortunately, the artifact had erased the git history, so we needed to copy the code over and do manual surgery on it to make it run rather than rebasing.

Once the code was running, we set up some very basic checks to protect the main branch from breaking changes. With the basic infrastructure in place, we were ready to start optimizing the hardware and software.

ML-DSA stack optimization

The first thing we needed to do was improve the memory usage of ML-DSA. The initial implementation would have required the ACC coprocessor to have 128KiB of data memory, up from…4KiB in the original OTBN design. This would cost too much area for such a small embedded device, but luckily we were not the first to address this problem; there was substantial existing research in papers like Dilithium for Memory Constrained Devices and Compact Dilithium Implementations on Cortex-M3 and Cortex-M4, as well as open-source implementations like pqm4.

Following these references, we experimented with different memory optimizations:

optimization new stack size (bytes) change (bytes) approx. slowdown
Stream matrix A 56672 -64160 125%
Stream y 49504 -716 3%
Stream s1, s2, t0 19808 -29696 14%
Compress w1 11872 -7936 2%
Accumulate y 10848 -1024 0.1%

As you can see from the table, most of the performance gains come from streaming various values. The ML-DSA signing procedure is basically a big rejection loop that keeps looping until it generates a valid signature, so naturally most performance-sensitive implementations compute as much as they can before the loop and then keep it live until the operation is complete.

However, these values can be huge. The matrix A, for example, is a k × l matrix where each entry is a polynomial of 256 24-bit coefficients each. Vectorized implementations like ours will probably store each coefficient in a 32-bit slot, for a round 1024 bytes = 1KiB per polynomial. Depending on the parameter set, the matrix has 16-56 entries, so that means a whopping 56KiB for ML-DSA-87, just for A!

In a memory-optimized implementation, we trade off performance for space by recomputing A again from its seed on every iteration of the sign loop. We never need to keep the entire matrix in memory; we stream it as we perform a matrix-vector multiplication and store only the resulting vector. The matrix expansion runs on SHAKE operations, and helpfully we have a hardware SHAKE to help minimize the cost of recomputation.

While it's possible to reduce ML-DSA-87 all the way down to 8kB of stack based on the papers we referenced, further memory optimizations would also come at a performance cost and 11kB is low enough for our initial requirements. Accounting for I/O buffers, constants, and expected overhead from masking, we think that this amount of stack reduction will be sufficient for our goal of fitting first-order-masked ML-DSA-87 in 32KiB.

We later applied a subset of the same memory optimizations to key generation and verification operations. It's easier to optimize these than signing; while the signing procedure is a loop that needs most values to remain live until the end of the computation, key generation and verification generally process each value only once.

Hardware and design verification integration

We concluded from our memory optimization experiments that 32KiB each of data and instruction memory for the coprocessor will be sufficient for first-order-masked ML-DSA-87. That still required a hardware change to update from 8KiB of instruction and 4KiB of data memory in the original design. To support ML-KEM and ML-DSA acceleration on ACC, a vector ISA extension, new adder, new multiplier, and miscellaneous control/datapath registers like unique WSR/CSR registers were also added.

It became evident that this configurability of ACC should be expanded to the RTL itself. To accommodate for the differing design constraints the ML-KEM and ML-DSA capabilities were controlled behind the AccPQCEn SystemVerilog parameter. The instantiation of the vectorized adder, multiplier, and PQC unique datapaths are contingent on the usage of the AccPQCEn parameter. As a result, we can eliminate the additional hardware overhead required for ML-KEM and ML-DSA on ACC when the PQC algorithms are not desired.

Using SystemVerilog parameters has a considerable impact on the design verification (DV) and coverage efforts. The original ACC simulation config was turned into a base configuration to specify the DUT being tested, alongside testcases and common simulation environment variables. A pair of configurations were created to inherit the base, while each extended their appropriate AccPQCEn parameter value and DV dependencies.

In addition to the new instruction datapaths, we introduced a side-load interface connection between KMAC and ACC. Within the DV environment, a UVM agent was created for the KMAC interface to respond to hash requests generated by ACC, and drive the appropriate digest response.

Improved multiplier and adder

As we were making the above changes, Ruben Niederhagen at Academia Sinica and Hoang Nguyen Hien Pham at MPI-SP made further progress on the accelerated ISA design. In their paper Improving ML-KEM and ML-DSA on OpenTitan, they adjust the original ISA to remove the vectorized modular multiply instruction and replace it with an updated non-modular vector multiply that has additional modes to make software modular reduction faster. This resulted in better throughput overall for the multipliers, reducing cycle counts for top-level ML-KEM and ML-DSA operations up to 17%. Further adjustments they made to the vector adder improved the design's maximum frequency dramatically, by 36-75% depending on the ASIC or FPGA toolchain. Despite all of these latency improvements, area is hardly affected.

Of course, given the impressive results, we wanted to integrate these new changes. As a result of frequent communication and a shared development repository, we knew to expect the update and the process was straightforward.

KMAC interface improvements

As we integrated the results from the Towards and Improving papers, we made some tweaks to the KMAC/ACC interface as well. In the original implementation from Towards, software sets the length of a SHAKE/SHA3 input at the start of the computation and then repeatedly writes to the kmac_msg register. The KMAC block reads all bytes from each write until the expected length is reached. However, sometimes ML-KEM and ML-DSA hash multiple concatenated values of different lengths, and it's not convenient (especially when memory is tight) to copy everything into a single buffer. For this reason, we added a 32-bit CSR register kmac_partial_write. Writing to the register applies a byte mask to the next word written to the kmac_msg_data WSR register.

We also reduced the frequency of ACC stall cycles attributed to the KMAC interface. To combat this stall we implemented an eager refresh of the digest. We recognized that in an ideal scenario we perform intermediate operations on ACC while a new digest is being loaded. To accomplish this, the next signal assertion was coupled to the previous digest read. In doing so we are able to speculatively request the next digest from KMAC, reducing the maximum stall cycles per read by 1. This change already had a noticeable impact on latency for ML-DSA and ML-KEM (about 5% and 4% overall, respectively).

Rejection sampling speedups for ML-DSA

One of the most time-consuming steps in all ML-DSA operations is the sampling of the matrix A. Each polynomial in the matrix must be separately expanded from a seed value using the SHAKE XOF. The expansion consists of sampling 3 bytes at a time from the SHAKE output, clearing the high bits and rejecting all inputs that are still greater than the 23-bit modulus q = 8380417 until we sample 256 valid coefficients. Even with hardware acceleration for the actual SHAKE computations, this operation took 40-60% of cycles for all top-level ML-DSA operations!

The first opportunity we found to speed up the rejection loop was to vectorize the sampling routine. Since q is actually quite close to 223, rejections during ML-DSA sampling are rare. A random 23-bit number has a 99.9% chance of being within the valid range. So we wondered – since coefficients were discarded so rarely, could we push complexity out of the hot loop and into specialized discard logic that would run only rarely? The answer was a resounding yes.

Thanks to the fast vectorized ISA we have from the Towards paper, we can sample, unpack, and store candidate coefficients several at a time. Only once we assembled 256 coefficients speculatively into a polynomial did we actually check the bounds.

In the rare case that there is an invalid coefficient in the vector, we run a specialized discard routine to locate the bad coefficient, shift the whole polynomial one place, and then sample a new coefficient at the end. This discard routine is comparatively slow, but it's so rare that it doesn't matter; the speedup we observed from this initial vectorization was 20-40% across each top-level ML-DSA operation.

We were then able to gain further speedups by taking advantage of the KMAC hardware tweaks. These further improvements brought us to an overall 52% speedup for top-level ML-DSA operations.

Rejection sampling speedups for ML-KEM

Like ML-DSA, we also found that core ML-KEM operations were bottlenecked by rejection sampling. Prior to our optimizations in this section, the introduction of eager KMAC refreshing and a carefully optimized NTT routine had already eliminated the other typical bottlenecks one might typically find in an ML-KEM implementation: benchmarking ML-KEM-512 revealed that the poly_gen_matrix routine, which generates the public matrix ÂT during encryption, contributed around 35-40% of cycles to decapsulation.

Recall that in ML-KEM, the ÂT matrix is a k × k matrix, where k = 2, 3, 4 for ML-KEM-512, ML-KEM-768, and ML-KEM-1024 respectively. Each entry of this matrix is deterministically sampled from a public key by passing it through SHAKE-128 just like with ML-DSA.

Digging into poly_gen_matrix's performance, we found that processor stalls were a significant contributor to latency, representing 25% of the overall cycles spent in poly_gen_matrix. These stalls weren't a result of waiting on the KMAC engine but instead arose from branching in the core sampling logic. Indeed, as part of defense against Spectre-style attacks, the ACC is designed to always require two cycles per branch regardless of whether a branch is taken or not.

We might be tempted to try to employ the same eager sampling trick that worked for ML-DSA. One crucial detail with ML-KEM, however, is that the rejection sampling routine rejects coefficients at a much higher rate than ML-DSA does. ML-KEM coefficients are sampled modulo q = 3329 by sampling 12-bit integers, meaning that a candidate coefficient will only be accepted with probability 3329/212 ≈ 0.813..., or about 81.3% of the time.

Since we couldn't amortize rejection handling costs as in ML-DSA, we had to find a way to eliminate stalls while encountering new candidate coefficients. By carefully separating the rejection sampling logic into eager and conservative cases, eliminating unnecessary branches, and using bn.sel instead of conditional branches for accumulator updates, we eliminated 30-40% of the overall processor stalls, reducing ML-KEM decapsulation cycle counts by about 20% across all parameter sets.

The glorious rebase

After all of the changes above were merged into the development repository, we felt that the hardware and software implementation was mature enough to integrate into ZeroRISC's primary repository. While our testing set up let us know our software tests for ACC were passing, that was about it. At the end of the day though, how much work could it really be to upstream this? As it turns out, quite a bit, and certainly emphasises the importance of a high quality testing framework.

Despite the development repository sharing a common commit history with our upstream, it hadn't been synced for roughly six months. We had contributed nearly 200 commits between KMAC, ACC, and SW over that period, while the upstream was accumulating commits at a much more rapid pace. Thankfully, the vast majority of commits did not have conflicts, as development was reasonably independent.

This merge isn't the end of our refinement work; far from it. We have more ideas for how to improve the code size metrics and are working with our research collaborators on masking techniques and even potential additional ISA tweaks. Stay tuned for more, including the upcoming joint talk at Real World Crypto in Taipei!

Interested in learning more? Sign up for ZeroRISC's early-access program or contact us at info@zerorisc.com.