The hybrid ZK/Optimistic Rollup of the future

I've recently become quite convinced that the future of the Ethereum Rollup is actually a hybrid of the two primary approaches (ZK and Optimistic). In this post I'll try to explain the basic gist of the architecture that I'm imagining and why I believe it's the best path forward. Please note that I spend most of my time working on Optimism, an Optimistic Rollup, but I'm not a ZK expert. If I get anything wrong when talking about the ZK side, please feel free to reach out to me and I'll issue a correction.

I don't feel like spending too much time discussing the nature of ZK or Optimistic Rollups. This post assumes that you already have a decent understanding of how these things work. You don't need to be an expert, but you should at least know what they are and how they work at a high-level. If I tried to explain Rollups to you, this post would be very, very long. Anyway, please enjoy!

Start with an Optimistic Rollup

A hybrid ZK/Optimistic Rollup starts as an Optimistic Rollup that looks a lot like Optimism's Bedrock architecture. Bedrock is designed to be maximally compatible with Ethereum ("EVM Equivalent") and does this by running an execution client that's almost identical to a vanilla Ethereum client. Bedrock uses Ethereum's upcoming consensus/execution client separation model to significantly reduce the diff to the EVM (some changes will always be necessary, but we can manage this). As I'm writing this post the Bedrock Geth diff is a single +388 -30 commit.

Like any good Rollup, Optimism is pulling blocks/transaction data from Ethereum, ordering that data in some deterministic way within the consensus client, and then feeding that data into the L2 execution client to be executed. This architecture solves the first half of the "ideal Rollup" puzzle and gives us an EVM Equivalent L2.

Of course, we now also need to solve the problem of telling Ethereum about what's going on within Optimism in a provable way. Without this feature there's no way for contracts to make decisions based on the state of Optimism. This would mean, among other things, that users could deposit into Optimism but would never be able to get their assets back out. Although a one-way Rollup might actually be useful in certain cases, a two-way Rollup is much more useful in the general case.

We can tell Ethereum about the state of any Rollup by giving some commitment to that state along with a proof that the commitment was correctly generated. Another way to say this is that we're proving that the "Rollup program" was executed correctly. The only real difference between ZK and Optimistic Rollups is the form that this proof takes. In a ZK Rollup, you're expected to give an explicit ZK proof of the correct execution of the program. In an Optimistic Rollup, you make a claim about the commitment with no explicit proof. Other users can then challenge your claim and force you to play a back-and-forth game where you'll eventually figure out who's correct.

I won't go into too much detail about the Optimistic Rollup challenge game. It's worth noting that the state of the art of this game is to compile your program (the Geth EVM + a few bits around the edges in Optimism's case) into some simple machine architecture like MIPS. We do this because we need to build an interpreter for our program on-chain and it's much easier to build a MIPS interpreter than it is to build an EVM interpreter. The EVM is also a constantly changing target (we have regular upgrade forks) and doesn't fully encompass the program that we want to prove (there's some non-EVM stuff in there too).

Once you've built an on-chain interpreter for your simple machine architecture and you've created some off-chain tooling, you should have a fully functional Optimistic Rollup.

Transforming into a ZK Rollup

Generally speaking, I think that Optimistic Rollups are going to be dominant for at least the next few years. There's some sentiment that ZK Rollups will eventually overtake Optimistic Rollups, but I think this is probably wrong. I think that the relatively simplicity and flexibility of Optimistic Rollups today mean that they can be transformed into ZK Rollups over time. If we can come up with a model for making this happen, then there's really no strong incentive to deploy to a less flexible and more brittle ZK system when you can simply deploy to an existing Optimistic system and call it a day.

My goal is therefore to create an architecture and migration path that allows an existing modern Optimistic system (like Bedrock) to seamlessly transform into a ZK system. Here's how I believe that can not only work, but work in a way that outcompetes the zkEVM approaches of the day.

We begin with the Bedrock-style architecture I described above. Note that I (briefly) explained how Bedrock has a challenge game that can assert the validity of some execution of the L2 program (a MIPS program that runs the EVM + some extra stuff). One of the primary downsides of this approach is that we need to allot some period of time for users to be able to detect and successfully challenge a bad program result proposal. This adds quite a bit of time to the asset withdrawal process (7 days on the current Optimism mainnet).

However, our L2 is just a program running in a simple machine (MIPS). It's entirely possible to build a ZK circuit for this simple machine. We can then use this circuit explicitly prove the correct execution of the L2 program. Without making a single change to the current Bedrock codebase you can start publishing validity proofs for Optimism. It's seriously just that simple.

Why this approach is so good

Quick note: I refer to "zkMIPS" in this section but I'm really using this as a stand-in for any general-purpose "simple" zkVM.

zkMIPS is easier than zkEVM

One massive benefit of this approach to build a zkMIPS (or zk[insert other machine here]) rather than a zkEVM is that the target machine architecture is simple and static. The EVM changes with some frequency. Gas prices are changed, opcodes are tweaked, things are added or removed. MIPS-V hasn't changed since 1996. By targeting a zkMIPS, you're working on a fixed problem space. You don't need to change and possible re-audit your circuit every time the EVM is updated.

zkMIPS is more flexible than zkEVM

Another key point of argument is that a zkMIPS is much more flexible than a zkEVM. With zkMIPS you have much more flexibility to modify the client code at will to implmement various optimizations or user-experience improvements. Client updates no longer have to come with corresponding circuit updates. You'd also be creating a core component that could be used to turn literally any blockchain into a ZK Rollup, not just Ethereum.

Your problem shifts to proving time

ZK proving times scale along two axes: the number of constraints and the size of the circuit. By focusing on a circuit for a simple machine like MIPS (as opposed to a much more complex machine like the EVM) we're able to significantly reduce circuit size and complexity. However, the number of constraints depends on the number of machine instructions executed. Each EVM opcode is broken down into many MIPS opcodes, meaning the number of constraints increases significantly and your overall proving time increases significantly as well.

But reducing proving time is a problem firmly lodged in the Web2 space. Given that the MIPS machine architecture isn't going to change any time soon, we can highly optimize our circuit and prover without worrying about EVM changes down the line. I feel very confident that the hiring pool for hardware engineers qualified to optimize a well-defined program is at minimum 10x (if not 100x) the pool qualified to build and audit a shifting zkEVM target. Companies like Netflix probably have lots of hardware engineers working on optimizing transcoding chips who'd be more than happy to work on an interesting ZK challenge in exchange for a pile of VC money.

It's possible that initial proving time for a circuit like this exceeds the 7 day Optimistic Rollup withdrawal period. As time goes on, this proving time will only decrease. By introducing ASICs and FPGAs we can speed up proving time significantly. With a static target, we can build more optimized provers.

Eventually, proving time for this circuit will dip below the current 7 day Optimistic withdrawal period and we can start to consider removing the Optimistic challenge process. Running a prover for 7 days may still be prohibitively expensive, so it's likely that we'll want to wait a bit longer, but the point stands. You could even run both proof systems at the same time so we can start to use the ZK Proof ASAP and fall back to the Optimistic proof if the prover fails for whatever reason. When ready, it's easy to remove the Optimistic proof in a manner completely transparent to applications and, boom, your Optimistic Rollup becomes a ZK Rollup.

You get to worry about other important problems

Running a blockchain is a hard problem that doesn't just involve writing a bunch of back-end code. A lot of the work we do at Optimism is focused on improving user and developer developer experience with useful client-side tooling. We also spend a huge amount of time and energy working on the "soft" side of things: talking with projects, understanding pain-points, designing incentives. The more time spent on your chain software, the less time you have to worry about these other things. You can always try to hire more people, but organizations don't scale linearly and each new hire will increase internal communication overhead.

Since the ZK circuit work can be slapped on top of an existing running chain you can work on building the core platform out in parallel with the proving software. And since the client can be modified without changing the circuit, you get to decouple your client and proving teams. Optimistic Rollups that take this approach will likely be many years ahead of ZK competitors in terms of actual on-chain activity.

Some conclusions

To be perfectly honest, I cannot see any significant downside to this approach under the assumption that a zkMIPS prover can be optimized heavily over time. I believe the only real impact on applications is that the gas costs of different opcodes may have to be tweaked to reflect the proving time added by those opcodes. If it's truly not possible to optimize this prover down to a reasonable level then I admit defeat. If it is actually possible to optimize this prover, then the zkMIPS/zkVM approach seems so drastically better than the current zkEVM approach as to likely obsolete the latter entirely. That might seem like a radical statement, but it was not too long ago that the single-step Optimistic fault proof was entirely obsoleted by the multi-step proof.

If you believe that I'm obviously wrong about this approach, please feel free to reach out to me and tell me why.

-kf