Nub: the KVM-microkernel architecture
A design doc for moving javm into a long-running microkernel
running under KVM (or any equivalent hardware-virtualization
substrate) — with chain state σ resident entirely in the guest.
The microkernel is named nub.
The current v3 implementation track (architecture.md) continues independently. This doc covers the alternative architecture that’s been promoted past prototype.
Status (2026-05)
- Stage 0 (this doc). Complete: design and decision gates written down.
- Stage 1 (prototype). Complete: a 5-commit Hyperlight + ring-0
spike validated every architectural gate (ring 0, long mode,
custom IDT, in-guest
#PFhandling, CR3 manipulation, CoW round-trip) on the i9-13900K target hardware. Measured numbers are recorded against the gate table below. The build-system spike also passed: stable rustc +x86_64-unknown-none+ a one-shotRUSTFLAGSrecipe drives the bare-metal guest build withoutcargo-hyperlight, custom target JSONs, or picolibc. - Crate skeleton. Complete:
nub,nub-kernel,nub-arch-local,nub-arch-hyperlight,nub-buildare scaffolded with theArchtrait wired through both backends. See §Crate layout. - Naming since Stage 0. The doc was originally written around
a “HAL” trait (hardware abstraction layer) with concrete impls
named
HyperlightHal/PlainMemHal/PciSocHal. The implementation calls the traitArchinstead — matching Linux’sarch/x86,arch/arm64convention — and the impls are namednub-arch-hyperlight/nub-arch-local/nub-arch-pci-soc(the last still speculative). The body of this doc has been updated in place; a couple of mentions of the legacy “HAL” name remain where they refer specifically to the original Stage 0 sketch.
The rest of this doc retains the Stage 0 design analysis as written (it held up under prototyping), with §Crate layout (as of Stage 1) and §Arch trait design capturing decisions made during the prototype that the original design didn’t have.
Why this is worth exploring
Three threads converge on the same answer.
(a) Memory management. The host-userspace track invents a lot
of OS-shaped machinery for cheap cap copy and per-Instance memory
management — per-DataCap memfds with MAP_PRIVATE CoW, an
LSM-shard overlay model, userspace page-fault tricks (uffd / signal
handlers), a per-thread pool of 4 GiB FlatMemory mmaps, a
write-back protocol on unmount. Each piece is tractable. The
aggregate is several KLoC reimplementing what a real OS gives us
natively. In ring 0 with our own page tables, the whole stack
collapses into “set a CoW bit, install a #PF handler.”
(b) Security: no JIT in the host process. The PolkaVM project explicitly identified untrusted JIT’d code running in the same process as the chain client as a hard problem — they ended up sandboxing the JIT in a separate child process. We rejected that for JAVM because the hostcall round-trip cost was unacceptable. A microkernel solves both: the JIT runs inside a hardware-isolated guest, and most “hostcalls” don’t have to cross out to the host at all (the guest carries the cap manager, σ, the canonical encoder, the state-root computation).
(c) Portability. Today’s recompiler is gated
cfg(all(target_os = "linux", target_arch = "x86_64")) —
unavailable on macOS, Windows, BSD. Routing the recompiler through
a hypervisor abstraction (KVM on Linux, Hypervisor.framework on
macOS, WHP on Windows) expands coverage to all those platforms,
not contracts. See End-state architecture below.
The mgmt_copy O(1) property we want is provable in both the
userspace and microkernel models (see below). The decision isn’t
“is it theoretically possible,” it’s “which model is the cleaner
engineering bet over the next 5 years.”
End-state architecture: Arch → nub → javm-exec
Three layers, like a real OS:
┌──────────────────────────────────────────┐
│ javm-exec (interpreter + recompiler) │ ← execution backends
├──────────────────────────────────────────┤
│ nub (the microkernel) │ ← cap manager, σ-filesystem,
│ (cap mgmt, state root, scheduler, the │ IDT/page-table mgmt, JIT cache,
│ interpreter, page-fault handler) │ port. across all Arch impls
├──────────────────────────────────────────┤
│ Arch (CPU/MMU substrate) │ ← swap per deployment target
└──────────────────────────────────────────┘The kernel (nub-kernel) is portable Rust, no_std. It runs
unchanged on whatever “hardware” the Arch implementation
exposes. The Arch trait abstracts: physical memory, vCPU
control, host-callback ABI, page-fault notification. Everything
else is inside the kernel.
This is the same factoring Linux’s arch/, NetBSD’s MI/MD split,
Plan 9’s portation discipline use. We get its benefits: one
correctness story, one performance story, one set of tests.
Arch trait sketch (Stage 0 shape)
// crates/nub-kernel — pure no_std, no platform deps
pub trait Arch {
// Memory management — Arch owns physical pages
fn alloc_pages(&mut self, n: usize) -> Result<GuestPhysPage>;
fn free_pages(&mut self, base: GuestPhysPage, n: usize);
fn map_region(&mut self, /* address, perms, flags */) -> Result<()>;
// vCPU control — what "running" means varies by Arch
fn run_until_yield(&mut self, /* entry, regs */) -> ExitReason;
// Host callbacks — what "asking the host" means varies by Arch
fn read_blob(&mut self, hash: [u8; 32]) -> Result<Bytes>;
fn write_blob(&mut self, bytes: &[u8]) -> Result<[u8; 32]>;
fn host_notify(&mut self, event: HostEvent);
}The post-prototype shape is more refined; see §Arch trait design (after Stage 1).
Three Arch implementations
| Arch | Substrate | javm-exec backend | Deployment target |
|---|---|---|---|
nub-arch-hyperlight | KVM / Hyper-V / WHP, guest ring 0 long mode | Recompiler (in-guest JIT) | Linux/macOS/Windows/BSD with hardware virt |
nub-arch-local | Plain Rust, runs in the host process | Interpreter only (no JIT — see below) | Any platform; portable fallback |
nub-arch-pci-soc (future) | ARM SoC or FPGA on PCIe; nub cross-compiled to device arch | Recompiler (device-resident JIT) | Validators with execution-accelerator hardware |
Why the Arch determines the javm-exec backend. The recompiler
emits and executes native code; it requires hardware isolation
from the host process to avoid the JIT-injection security
property we want. So an Arch with hardware isolation
(nub-arch-hyperlight, nub-arch-pci-soc) can run the
recompiler safely. The nub-arch-local doesn’t have an isolation
boundary — JIT’ing native code there means executing
untrusted-derived code in the host process, the exact thing we’re
avoiding. So nub-arch-local exposes only the interpreter.
The cross-Arch determinism invariant
The same (prior_root, block) input must produce byte-identical
new_state_root regardless of which Arch/javm-exec combination
ran the block. Validators on different deployment targets must
not diverge in consensus. This is the load-bearing correctness
condition for the multi-Arch deployment story; it’s the same
property as cross-backend equivalence (interpreter vs recompiler)
but now also applies across hardware substrates.
Verified via property test: corpus of blocks, run through each
Arch × backend combination, state-roots compared. Already the
shape of pvm_bench’s gas-match assertion; promote to a hard
correctness gate.
nub-arch-hyperlight: portability matrix
Hyperlight already abstracts the per-OS hypervisor:
| Host OS | Hypervisor API | Recompiler available? |
|---|---|---|
| Linux | KVM | Yes (works today) |
| macOS | Hypervisor.framework | Yes (on Hyperlight roadmap) |
| Windows | WHP (Windows Hypervisor Platform) | Yes (Hyperlight ships this) |
| FreeBSD | bhyve | Yes (we’d add this layer) |
The remaining portability constraint is host CPU arch: the hypervisor only runs guests targeting the host’s CPU, so per-arch JIT codegen (x86-64 today; ARM64 for Apple Silicon/Graviton later) is still our responsibility. This is the same cost we already accept for the userspace recompiler today.
nub-arch-hyperlight: nested-virt caveat in cloud
Some cloud VMs don’t expose hardware virt to tenants, or charge
for it as a premium (AWS metal SKUs only without paid nested-virt;
GCP preview; Azure SKU-restricted). Validators in such
environments fall back to nub-arch-local (interpreter only) —
slower but functional. The cross-Arch determinism invariant is
what makes this safe.
nub-arch-pci-soc: the future-proofing story
The Arch trait is the natural seam for execution-accelerator hardware: a dedicated ARM SoC or FPGA on PCIe running the microkernel as its firmware. The same microkernel source cross-compiles to the device’s arch; the Arch impl talks to the device over PCIe instead of KVM.
Sketch of the protocol:
- Host writes commands to a PCIe BAR (command ring in MMIO).
- SoC receives, runs the microkernel, returns results via a result ring.
- Bulk transfer (cap blob fetch/write) uses DMA between host RAM and device memory.
- Host-side API is unchanged:
kernel.apply_block(prior_root, block)returns(new_root, deltas).
This is the same architectural pattern Ethereum’s prover- acceleration designs and Aptos’s parallel-exec accelerators are exploring. The fact that we accommodate it as an Arch swap, not a rewrite, is a real win — but purely future-proofing. Not on the implementation roadmap today.
Custom compile target: a possibly-avoidable friction
Hyperlight today defaults to a custom Rust target spec
(x86_64-hyperlight-none) plus a cargo-hyperlight subcommand
plus picolibc for C dependencies. That’s significant build-system
surface.
Our microkernel is pure Rust (no C deps). For pure-Rust guests,
most of what the custom target packages can be supplied via
RUSTFLAGS on top of x86_64-unknown-none — stable since
Rust 1.71 (August 2023). Linking against hyperlight_guest (the
no-libc layer of Hyperlight’s guest crates) directly, without
hyperlight_guest_bin (the picolibc layer), should let us drop
the custom-target/cargo-hyperlight/picolibc stack.
A Stage 1 spike confirms: if cargo build --target=x86_64-unknown-none
with the right RUSTFLAGS produces a Hyperlight-loadable ELF, we
keep the friction at “supply a few linker flags.” If it doesn’t,
we wear the custom-target cost; not the end of the world but
worth knowing.
Crate layout (as of Stage 1)
Today’s workspace shape:
rust/
├── nub/ lib — caller-facing `Nub` handle
├── nub-kernel/ lib — Kernel<A: Arch> + Arch trait + interp
├── nub-arch-local/ lib — in-process Arch (software-copy memory)
├── nub-arch-hyperlight/ bin — Hyperlight Arch (bare-metal guest, no_std + no_main)
├── nub-build/ lib — cross-compile helper for bare-metal arch guests
│
├── javm-interpreter/ lib — PVM interpreter, no_std (lifted from javm-exec)
├── javm-recompiler-x86/ lib — x86_64 PVM recompiler (lifted from javm-exec)
│
└── jar-apply/ lib — block-apply, gas, quota; built on Nub (future)Dependency edges:
nub-kerneldepends onjavm-interpreter(the interp is portable, lives with the kernel, runs against anyArch::Memory).nub-arch-hyperlightdepends onnub-kernel+javm-recompiler-x86(the recompiler is hardware-locked to x86_64 + real pages, lives only with Arch impls that can run it).nub-arch-localdepends only onnub-kernel.nubdepends onnub-kernel,nub-arch-local, and (viabuild.rs)nub-arch-hyperlightas a guest blob.jar-apply(future) depends onnub.
Today’s javm-exec crate is slated to split into
javm-interpreter + javm-recompiler-x86 to make the
no_std-vs-x86-only boundary explicit. Until that split lands,
nub-kernel and nub-arch-hyperlight consume javm-exec
selectively via feature gates.
nub (the entrypoint crate) exposes a single uniform handle:
pub struct Nub { /* enum over backends */ }
impl Nub {
pub fn new_local() -> Self;
pub fn new_hyperlight() -> Result<Self>;
pub fn invoke(&mut self, target: InstanceRef, endpoint: u16,
args: &[u8], opts: InvokeOptions) -> Result<InvokeOutcome>;
pub fn state_root(&self) -> CapHash;
}For the in-process backend, Nub owns a Kernel<LocalArch>
directly. For Hyperlight, Nub holds a sandbox and ships the
invocation as RPC; the real Kernel<HyperlightArch> lives
guest-side. Both expose the same surface.
Arch trait design (after Stage 1)
The original Stage 0 Hal sketch put memory and vCPU primitives
on the trait. After working through how the interpreter and
recompiler actually interact with memory, the design splits
along two orthogonal axes:
- Execution mode: interpreter vs recompiler.
- Memory mode: hardware-paged (real CR3/PTE, CoW via
#PF) vs software-copy (page table simulated in Rust, CoW via memcpy).
The recompiler is hardware-locked: it emits native x86_64 loads/stores, so it only works with hardware-paged memory. The interpreter is portable: it can go through either memory mode. This determines where each piece lives:
- Interpreter →
nub-kernel. Shared across all Arch impls, parameterized overA::Memory. Same source compiled for both the host process (overLocalArch::Memory) and the bare-metal guest (overHyperlightArch::Memory). - Recompiler → outside
nub-kernel. It’s CPU-arch-specific (today x86_64) and memory-mode-specific (hardware only). Lives injavm-recompiler-x86, consumed only bynub-arch-hyperlight. Future ARM Arch impls would consume a hypotheticaljavm-recompiler-aarch64.
The Arch trait grows accordingly:
pub trait Arch {
type Memory: Memory;
type Error;
/// Create a fresh address space (working memory for one
/// invocation). The returned `Memory` is hardware-paged or
/// software-copy depending on the Arch impl.
fn create_address_space(&mut self) -> Self::Memory;
/// Drive the kernel-supplied interpreter via `A::Memory`.
fn invoke(&mut self, target: InstanceRef, endpoint: u16,
args: &[u8], opts: InvokeOptions)
-> Result<InvokeOutcome, Self::Error>;
/// Recompiler dispatch — jumps into JIT'd code. Only
/// hardware-paged Arches implement this meaningfully;
/// software-paged Arches return Unsupported.
fn enter_native(&mut self, /* entry, regs, memory */)
-> Result<ExitReason, Self::Error>;
fn state_root(&self) -> CapHash;
}
pub trait Memory {
fn read_u8(&self, vaddr: u64) -> Result<u8, MemFault>;
fn write_u8(&mut self, vaddr: u64, b: u8) -> Result<(), MemFault>;
// u16/u32/u64 width ops (or generic-over-width)
fn map(&mut self, vaddr: u64, size: u64, perms: Perms);
fn set_perms(&mut self, vaddr: u64, perms: Perms);
fn cow_fork(&self) -> Self;
// …
}Both Memory ops must inline cleanly to a hardware load/store
on HyperlightArch::Memory (where the kernel runs in the same
address space the program runs in) and to a sparse-vec/BTreeMap
lookup on LocalArch::Memory. Width-specific methods + careful
inlining keep the per-instruction interpreter cost competitive
with a non-traited interpreter.
The skeleton (as of Stage 1) only exposes invoke + state_root
on Arch; type Memory and enter_native will land alongside
the javm-interpreter port into nub-kernel.
Kernel runtime: collections + RNG
nub-kernel is a real kernel — no_std, no host runtime to lean
on. Two design decisions worth pinning:
Collections — BTreeMap everywhere by default. Replay
determinism is load-bearing for the chain. HashMap’s
seed-dependent iteration order is a footgun: the moment any
state-bearing decision touches iteration, two nodes with
different seeds diverge. alloc::collections::BTreeMap is
no_std-clean, deterministic by construction, and log n lookups
are fine for the sizes the kernel touches. HashMap (via
hashbrown) earns its keep only for non-state caches with
adversarially-supplied keys; those use an explicit
DoS-resistant BuildHasher seeded from the kernel’s CSPRNG.
Iteration over a HashMap anywhere on a state-affecting path is
a bug.
RNG — host-seeded, in-kernel CSPRNG. A 32-byte seed is pulled
from the host at kernel boot and never replaced; the kernel runs
a ChaCha20 CSPRNG (rand_chacha) downstream. The seed source
per Arch:
LocalArch: fromgetrandom(host process entropy).HyperlightArch: the host shoots 32 bytes into a known shared-memory slot before the first guest call; the guest reads it during kernel init. Not RDRAND/RDSEED — host-injection is auditable and reproducible-when-replayed.
A single SecureRng and a KernelBuildHasher (SipHash-13 seeded
from one CSPRNG draw) are kernel globals. Per-node hasher seeds
are fine — different validators can have different hasher seeds
as long as iteration over hashmaps is never observable in state.
Deps for nub-kernel:
hashbrown = { version = "0.14", default-features = false, features = ["ahash", "inline-more"] }
siphasher = { version = "1", default-features = false }
rand_core = { version = "0.6", default-features = false }
rand_chacha = { version = "0.3", default-features = false }(Not yet wired; lands with the KernelSeed plumbing through
Kernel::new.)
The shape
The descriptions below are written against nub-arch-hyperlight —
the target we’d actually implement first. nub-arch-local is the
degenerate case where most of these structures live in the host
process directly; the API surface is identical.
Long-running microkernel. One microvm per validator process, spawned at startup, kept alive until shutdown. Pay the boot cost once — even 100 ms is fine because it’s amortised across millions of blocks. JIT-compiled code, hot caps, and σ working set all persist across blocks.
Host (thin):
- Spawns the microvm at process startup.
- Exposes a virtio-blk device backed by host storage — the guest formats and owns it. The host treats σ as opaque bytes; it does not parse or interpret the chain state.
- Exposes a network/IPC channel (virtio-net or a callback-style host function ABI) for block ingress, peer egress, and external-API routing.
- Forwards messages: “here’s a block, apply it” → guest; “external client wants Y” → “call endpoint Y on Cap::Instance Z” → guest.
- Supervises the VM: restart on crash, replay from disk.
Guest (heavy):
- Single ring-0 function:
apply_block(prior_root, block) → (new_root, deltas). Stateful (retains JIT cache, cap working set, page tables) but pure externally (same inputs → same outputs). - In-guest cap manager: refcounts cap objects, manages
page-table-level CoW on
mgmt_copy, LSM-shard storage for DataCaps. - In-guest σ filesystem: content-addressed store on the virtio-blk device. LSM-tree shape (matches DataCap shard semantics). All cap encode/decode is guest-only.
- Recompiler + interpreter, both guest-resident.
- A minimal IDT entry for
#PF, handling CoW write-faults in-guest at ring 0.
Per-call flow:
- Host forwards
apply_block(prior_root, block)into the guest (Hyperlight-style host-callable function). - Guest’s cap manager locates
Instance[Chain](already in memory from previous block, or reloaded from virtio-blk on startup). - Guest runs each event: spawns a sub-Instance call, executes via JIT or interpreter, harvests mutations.
- After all events: walk the mutated subtree, recompute hashes
bottom-up (Merkle re-root), produce
new_state_root. - Write new blobs to the in-guest filesystem (which writes through to virtio-blk → host disk).
- Return
(new_root, summary)to host.
Sub-Instance isolation is at the PVM level (bounds-checked memory, gas-metered execution), not the hardware level. Hardware isolation between sub-Instances would be overkill — the PVM is already the sandboxing boundary. So sub-Instance “calls” are function calls inside the recompiler’s dispatch, no process switch, no CPL transition.
Provability: O(1) mgmt_copy
The original motivating question. The answer is yes, provable, and the proof doesn’t actually require KVM — the userspace LSM-shard model already proves it constructively. KVM doesn’t add new theoretical capability for O(1) copy; it strengthens the constants and the architecture.
Userspace proof. Every mutable cap kind is Arc<Inner>.
Clone is one atomic refcount bump. Mutations route through
Arc::make_mut. All amortised cost moves to first observation
after copy; none lands on the copy itself.
Microkernel proof. Cap structures live in microkernel-managed
objects with refcounts. mgmt_copy is a microkernel primitive
that bumps the refcount and marks DataCap page-table entries CoW.
Hardware does the rest: subsequent writes fault, handler allocates
fresh physical page, remaps writable. Constant-time work per
mgmt_copy, regardless of cap size.
The proof is one paragraph in either model. We can ship it as a design-doc paragraph today; implementation choice is independent.
Performance gains from ring-0 execution
Estimates for what the microkernel-resident recompiler buys over and above the security and architecture wins:
| Gain | Mechanism | Estimated impact |
|---|---|---|
| Eliminate JIT bounds checks | Rely on hardware #PF for OOB instead of emitted cmp/jmp per memory op | ~30% on memory-bound workloads |
| In-guest CoW page fault | #PF → in-guest IDT handler → fix → iretq; no SIGSEGV | ~10× per CoW fault (10 µs → 500 ns) |
| Huge pages for code + data | Ring 0 controls page tables; 2 MiB/1 GiB pages easy | 10–30% TLB savings on memory-heavy code |
| No scheduler preemption | vCPU runs until VM-exit; no Linux preempting mid-loop | Tail-latency improvement, throughput similar |
| W^X dance elimination | RWX pages allowed in ring 0; no mprotect after emit | Tens of µs per JIT compile |
Combined: 10–30% steady-state speedup on existing pvm_bench
workloads, dominated by bounds-check elimination and TLB
improvements. The CoW fault speedup is huge per-fault but only
matters in proportion to how often we fault.
The flip side — VM-exits to the host:
When the guest needs to call out (e.g., write a block to disk via virtio-blk completion), it’s a VM-exit + handler + VM-resume, ~1 µs round-trip. Today’s equivalent inside the host process is a function call (~5–10 ns).
But: with σ entirely in-guest and the cap manager guest-resident,
most “would-be hostcalls” don’t need to cross out at all.
Cap manipulation, mgmt_copy, sub-Instance call, JIT compile —
all in-guest, no exit. The crossings that remain are:
- virtio-blk completion (disk I/O) — already async; one exit per block batch, not per byte
- virtio-net traffic — bulk; not in the hot path of execution
- External-API endpoint dispatch — one exit per external query
Net: the VM-exit cost is bounded by the real I/O frequency, not by the cap-operation frequency. This is exactly the property PolkaVM was looking for and didn’t get with separate-process sandboxing.
The hard parts to validate
If we do pursue this, the design questions Stage 0 needs to answer. Failure on any one is a reason to stop.
1. In-guest σ filesystem
σ is content-addressed: keys are 32-byte hashes, values are canonical-encoded cap bytes. The guest stores σ on its virtio-blk device. The natural shape: an LSM-style content-addressed store — incoming writes append to a fresh shard, background compaction merges old shards. This matches the DataCap-shard semantics exactly; same compaction discipline applies.
Reference designs are everywhere (RocksDB, Pebble, LMDB-style B-trees). Pick the smallest one that works. Estimated size: ~1–2 KLoC of guest code, including a tiny journal for crash consistency.
Risk: crash consistency. If the guest dies between writing a block’s deltas and writing the new state-root, we need to recover cleanly. Standard journal+commit pattern; well-understood.
2. Working-set cache + eviction
Long-running guest accumulates state. When memory pressure rises, cold caps must spill back to virtio-blk and free their guest physical pages. Reload on next access via the filesystem layer. Same shape as a database buffer pool.
Risk: unbounded growth if eviction policy lags. Need an LRU-style policy with explicit memory budget. Probably tunable via configuration: how much guest RAM dedicated to cap cache.
3. In-guest cap representation
The userspace-track cap structures (Arc<InstanceInner>,
Arc<DataCapInner> with shards) translate to in-guest objects
with microkernel-managed refcounts:
Arc<T>⇒ microkernel-refcounted handle.Arc::make_mut⇒ microkernel CoW primitive: bump-and-clone the object metadata, mark page-table entries CoW.im::OrdMap⇒ persistent map against the microkernel allocator (theimcrate worksno_std).- DataCap shards ⇒ refcounted sets of guest physical pages; “mount” inserts page-table entries into the recompiler’s address space.
4. State-root computation
Standard Merkle re-root. Each cap object has a cached hash; mutation invalidates; root request walks the cap graph bottom-up, recomputing for invalidated nodes only. Deterministic iteration order (sorted by key) is required throughout.
For a block that mutates K caps, re-root walks O(K · depth)
nodes. Typical: <100 hash computations per block, sub-millisecond.
5. Microkernel determinism
Validators on different hosts (and on the same host with different
backends, interpreter vs recompiler) must produce byte-identical
new_state_root for the same (prior_root, block). Requirements:
- Deterministic allocator. Cap encoded form depends only on protocol-defined content, not on allocation order.
- No host-time observable. Guest never branches on wall-clock or anything else the host can vary.
- Deterministic iteration. All collection traversal during state-root computation is sorted by protocol-defined key.
- Identical JIT codegen. Recompiler is deterministic today; property preserved in the guest.
None of these are novel. They are the standard discipline every blockchain validator already follows. The microkernel doesn’t add new constraints; it just requires the discipline to apply to the entire microkernel binary, not just the cap layer.
Cross-Arch × cross-backend invariant. Every Arch × backend
combination (nub-arch-hyperlight + recompiler, nub-arch-local +
interpreter, future nub-arch-pci-soc + recompiler) must produce
identical new_state_root for the same (prior_root, block).
Verify via property test: corpus of blocks, run through each
combination, state-roots compared. Same shape as pvm_bench’s
gas-match assertion; the multi-Arch model promotes it from
benchmark sanity to a hard correctness gate.
6. Commit/rollback semantics
When a block is applied but not yet committed by network consensus, the guest needs the ability to roll back. Two natural options:
(a) mgmt_copy the chain Instance before applying. The
existing CoW machinery handles the snapshot. On commit, drop the
old copy; on rollback, drop the new one.
(b) Host buffers blocks until consensus accepts. Only commit once accepted. Simpler if the chain client already does this.
Both work. Probably (a) — the cap manager already has the machinery, and it lets the guest pipeline speculation.
Exploration plan
If the design doc clears review, the prototype proceeds in stages. Each stage has a decision gate; we stop if the gate fails.
| Stage | Goal | Output | Time |
|---|---|---|---|
| 0 | This design doc | Written spec; open questions identified | done |
| 1 | Arch trait + boot prototype | Sketch the Arch trait. Build a minimal nub-arch-hyperlight impl: boots a guest, exposes one host-callable function, calls back to host. Also: confirm x86_64-unknown-none works as the compile target. Measure per-call latency, host-callback round-trip, in-guest #PF round-trip. | done (5 commits, ~1 week; all gates passed) |
| 1.5 | Crate skeleton | Lift the prototype into proper crates: nub-kernel, nub-arch-local, nub-arch-hyperlight, nub, nub-build. Define the Arch trait and Nub handle. Stub invoke / state_root on both backends to prove the boundary end-to-end. | done (5 commits) |
| 2 | σ filesystem + interpreter through nub-kernel | Split javm-exec into javm-interpreter (no_std, depended on by nub-kernel) and javm-recompiler-x86 (x86-only, depended on by nub-arch-hyperlight). Wire javm-interpreter through Kernel<A>::invoke over A::Memory for both Arches. Implement LSM-style content-addressed σ store inside the kernel; port cap encode/decode. Assert post-state root matches host-replay across both Arches. | 1 month |
| 3 | Recompiler in-guest, delete standalone host JIT | Wire javm-recompiler-x86 through nub-arch-hyperlight (via Arch::enter_native); delete the userspace recompiler from the legacy javm-exec; assert pvm_bench numbers hold; measure new performance (bounds-check elimination, CoW fault speedup). | 1 month |
| 4 | Production shape | Cross-Arch × backend determinism audit; commit/rollback wiring; per-OS hypervisor shim (Linux/KVM, macOS/Hypervisor.framework, Windows/WHP); observability; performance tuning. | 1 month |
Total: ~3 months from Stage 1 to a working prototype.
Decision gates
After Stage 0 (this doc):
- Does the determinism story hold up under skeptical review? (We already handle these constraints at the blockchain level, so the bar is “no new categories of risk,” not “novel correctness story.”)
- Does the in-guest filesystem story look workable? (1–2 KLoC of guest code is plausible.)
After Stage 1 (Arch trait + boot prototype) — all passed:
- Per-call latency (host → guest function dispatch → return, with both sides warm). Target: <100 µs, ideally <10 µs. Measured: ~5.9 µs per round trip across 10,000 noop calls on i9-13900K. ✓
- Host-callback round-trip (guest → host function → guest). Target: <5 µs. Measured: ~5.8 µs (12.5 − 6.7); just over target but in the right zone. ✓
- In-guest
#PFround-trip (write to CoW page → handler → resume). Target: <500 ns. Measured: ~860 ns for the pure-CoW path (D2 test) and ~7.5 µs for the full demand-paging path with mapping setup (C1 test). The 860 ns CoW number is the load-bearing one formgmt_copyperformance; the 7.5 µs full-setup number is one-shot per first-touch, not per-fault. ✓ - Custom-target shed: does
x86_64-unknown-nonework as the build target? Yes. Drop-in stable Rust + a 5-flag RUSTFLAGS recipe +nub-build’s 100-linebuild.rshelper — nocargo-hyperlight, no picolibc, no Nix shell. ✓ - Cold-boot time is not a hard gate — paid once, amortised over millions of blocks.
After Stage 2 (nub-arch-local + nub-arch-hyperlight interpreter, σ filesystem):
- Per-block overhead vs current userspace track on a realistic workload. Target: within 2× of userspace. (If much worse, redesign.)
- Cross-host AND cross-Arch determinism test: same
(prior_root, block), run on (a) two different machines and (b)nub-arch-localvsnub-arch-hyperlight(both running the interpreter at this stage). Identicalnew_state_rootin every combination. Target: passes. (Load-bearing for the multi-Arch model.) - Crash-and-restart test: kill the guest mid-block; verify σ on disk is consistent; verify reload produces identical state.
After Stage 3 (recompiler in-guest):
pvm_benchnumbers ideally improved by 10–30% vs the userspace recompiler (per the gain estimates above). At minimum, no worse than the userspace path.- Cross-Arch × cross-backend invariant test:
nub-arch-hyperlight+ recompiler vsnub-arch-hyperlight+ interpreter vsnub-arch-local+ interpreter — all three produce identical state roots for a property-test corpus.
Stage 0 deliverables (this doc)
σ encoding and storage
σ is a content-addressed key-value store inside the guest.
Keys: 32-byte content hashes (blake2b). Values: canonical-encoded
cap bytes (existing jar-cap Image/Instance/CNode/DataCap
encoding).
Storage layout (in guest, on virtio-blk):
- LSM-style content-addressed shards.
- Hot working set in guest RAM, evicted by LRU to disk.
- Background compaction merges old shards.
- A small commit journal at the head of the device records the current consensus state root after each accepted block.
The host never decodes any of this. The host sees the virtio-blk device as an opaque blob of bytes.
Host ↔ Guest interface
The host exposes these guest-callable functions (host-callbacks):
- Logging / telemetry (debug only)
The guest exposes these host-callable functions:
apply_block(prior_root: H256, block: Bytes) → (H256, Summary)query_endpoint(instance: H256, endpoint: u8, args: Bytes) → Bytes(for external-API dispatch)commit(root: H256)androllback()(consensus signalling)health_check() → Status
The transport is Hyperlight-style function-call ABI plus virtio-blk for σ persistence. virtio-net or a host-callback channel for the external-API surface — choice deferred to Stage 1 based on latency measurements.
In-guest cap layout
// Microkernel objects, refcounted by the kernel.
Image (immutable, refcounted)
TypeDef (immutable, refcounted)
Instance {
image: ImageHandle,
cnode: CNodeHandle,
state: InstanceState (small),
cached_hash: Option<H256>,
}
CNode {
slots: OrdMap<SlotIdx, SlotHandle>,
cached_hash: Option<H256>,
}
DataCap {
shards: Vec<ShardHandle>,
page_table: OrdMap<u32, (shard_idx, page_offset)>,
size: u64,
cached_hash: Option<H256>,
}
DataCapShard {
base_page: GuestPhysPageNum,
n_pages: u32,
content_hash:H256,
}mgmt_copy of any cap kind = microkernel dup_handle: bump
refcount, mark all child page-table entries CoW (for DataCap
shards). Constant time.
Write-fault on a CoW page = in-guest #PF handler allocates a
fresh guest physical page, copies content, updates the offending
mapping’s PTE writable, iretq. Sub-µs.
State-root algorithm
fn state_root(cap: &Cap) -> H256:
if let Some(h) = cap.cached_hash:
return h
h = match cap:
Image | TypeDef | DataCapShard:
blake2b(canonical_encode(cap))
Instance { image, cnode, state }:
blake2b(SCALE.encode {
image_hash: state_root(image),
cnode_hash: state_root(cnode),
state: canonical_encode(state),
})
CNode { slots }:
blake2b(SCALE.encode(
slots.iter_sorted().map(|(slot, cap)|
(slot, state_root(cap)))))
DataCap { shards, page_table, size }:
blake2b(SCALE.encode {
shard_hashes: shards.iter().map(state_root).collect(),
page_table: page_table.clone(),
size,
})
cap.cached_hash = Some(h)
hMutation invalidates cached_hash on the mutated cap and all
ancestors up to the root. Walk dirty subtree on each state_root
call; reuse cached hashes elsewhere.
Determinism analysis
Output (new_state_root, blob writes) must be byte-identical
across validators and across Arch × backend combinations
(nub-arch-hyperlight + recompiler vs nub-arch-local + interpreter vs
the future nub-arch-pci-soc + device-resident recompiler).
Requirements:
- Allocator doesn’t leak into encoded bytes. Cap encoded form contains only protocol-defined content.
- All hashing-path iteration is sorted by protocol-defined key. No HashMap iteration.
- No host clock observable to the guest. Block inputs + prior state root are the only host-supplied data.
- JIT codegen deterministic. Regression test: compile the
same blob N times, assert byte-identical output. (Per-arch:
x86-64 codegen on
nub-arch-hyperlight, ARM64 onnub-arch-pci-soc.) - Read-protocol return content-addressed (free).
- Microvm/device boot deterministic. Same initial state every time, regardless of Arch.
None novel. Same discipline every blockchain validator follows. The multi-Arch story sharpens it slightly: codegen determinism is now per-arch, and cross-Arch property tests are a hard correctness gate.
Open questions for review
Which microvm runtime under
nub-arch-hyperlight? Hyperlight is the strongest fit — ring-0 long mode, fastOUT-port + shared-memory function-call ABI (single VM-exit per call), KVM+WHP+MSHV abstraction. macOS not yet supported (no Hypervisor.framework backend); on the roadmap. Alternatives: rawkvm-ioctlsplus per-OS shims (more work, more control); Firecracker (slower boot, but irrelevant for long-running model). Probably Hyperlight; reconsider after Stage 1 numbers.Can we drop the custom Rust target? Hyperlight today uses
x86_64-hyperlight-nonepluscargo-hyperlightplus picolibc. For our pure-Rust microkernel we should be able to use stablex86_64-unknown-noneplusRUSTFLAGS. Stage 1 spike: confirm thatcargo build --target=x86_64-unknown-none -p microkernel-hyperlightproduces a Hyperlight-loadable ELF. If yes, build-system surface shrinks dramatically. If no, we wear the custom-target cost.Which guest framework, if any? Hyperlight’s Guest Library gives us the host-callback ABI for free. We’re not using it as an OS; we’re using it as a function-execution shell with our own paging and IDT. Compatible.
σ filesystem format. Roll our own LSM, port RocksDB, or something between. ~1–2 KLoC if we roll our own; “free” but heavyweight if we port. Probably roll our own.
virtio-blk vs Hyperlight’s host-callback ABI for σ I/O. virtio-blk wins on throughput (hundreds of thousands of IOPS via io_uring). Host-callback wins on simplicity. Stage 1 to measure.
Commit/rollback policy.
mgmt_copysnapshot before apply, or host-buffered blocks. Cross-cutting with chain consensus design. Defer to chain orchestrator decisions.Validator deployment story. Validators with hardware virt →
nub-arch-hyperlight+ recompiler. Without (cloud no-nested-virt) →nub-arch-local+ interpreter. State-root equivalence across Arch × backend combinations is what makes this safe. Already in the doc; flag here as a deployment-guidance task.nub-arch-pci-socrealism. Not for v1, but worth keeping the Arch trait shape compatible with: a PCIe accelerator card running the microkernel as firmware, accessed via MMIO command rings + DMA bulk transfer. Cross-CPU-arch codegen (ARM64) becomes necessary; the rest is an Arch impl swap. Don’t design for this, but don’t design against it either.
Decision history
2026-05: Commit to Stage 2. Stages 0, 1, and 1.5 are
complete. Every Stage 1 gate passed: ring 0 + long mode + custom
IDT + in-guest #PF handling + CR3 manipulation + CoW round-trip
all work on the i9-13900K target hardware, and the build-system
spike passed (stable x86_64-unknown-none + nub-build
replaces cargo-hyperlight and picolibc). Stage 1.5 lifted the
prototype into proper crates and proved the Arch boundary
through both backends.
Stage 2 (~1 month) splits javm-exec into
javm-interpreter + javm-recompiler-x86, wires the interp
through nub-kernel over Arch::Memory, and stands up the
in-guest σ filesystem. The decision point for Stage 3 (delete
the userspace JIT, run only in-guest) is the cross-Arch
determinism audit at the end of Stage 2.
Original Stage 0 decision text (preserved for archaeology): The original framing of this section asked whether to commit to a Stage 1 prototype as a small bet (~1 engineering-week of throwaway prototyping) for high information value about whether the architecture was worth pursuing. The bet was made; the information arrived; the answer was yes.