Skip to content

PVM2 portability: what assumptions does the recompiler make?

PVM2 today has exactly one JIT backend: x86-64 Linux running inside a Hyperlight microkernel. The interpreter (javm-exec::interp) is host-arch-independent and runs anywhere Rust does. But the production-path JIT in nub-arch-x86 makes a number of assumptions about its host arch; if we ever build a second backend (AArch64 is the obvious next target, RISC-V a more speculative one), we’ll have to satisfy each of those assumptions explicitly.

This document enumerates them and walks through how each holds on the three hosts we might plausibly target: x86-64 (today), AArch64 (likely next, both Linux servers and Apple Silicon), and RISC-V (future-proofing only — no concrete plan).

The headline finding: PVM2’s spec-level assumptions are all defensible across the targets we care about. The work in porting to a second host is dominated by rewriting the asm/codegen, not by spec incompatibilities.

Two categories of assumption

We separate them because they have different implications:

  • Spec-level assumption: encoded in the guest blob. Every conforming JIT must implement it. A change here is an ISA change (= breaking + chain-coordinated).
  • Host-implementation assumption: only the current x86 backend relies on it. A future port can solve it differently with no spec impact.

Spec-level assumptions

These are the contract between guest blobs and any conforming host. They show up in the wire format, the addressing model, or the encoding choices in rv64e-xjar-eei.md.

#AssumptionWhere it shows upConstrains what hosts?
S132-bit guest address space (4 GiB cap)addr & 0xFFFFFFFF on every load/store (§Memory address space)Any host that can provide 4 GiB of contiguous virtual address space — universal
S2Little-endian memoryRV-inheritedAny LE host (x86, ARM LE, RV LE) — universal in practice
S3No atomics, no concurrency within an invocationmax-atomic-width: 0, A extension reservedAny single-threaded execution model; no host constraint
S4No floating pointF/D/Q/V reservedUniversal
S5Unaligned scalar loads/stores succeedZicclsm declared (§4.13)Host must lower unaligned RV loads — see “Unaligned access” below
S615 logical registers (RV64E: x1, x2, x5..x15 + x3, x4)RV64E reg fileHost MUST provide ≥ 13 GPRs; the 13 hot slots are register-resident, x3/x4 may be host-spilled (gas-charged at spill cost). See “Register floor” below
S7Code mapped read-only at CODE_BASE in the low 4 GiB; PC is a real virtual addressauipc/jalr resolution, PIC code readsHost must map the code region RO inside the guest’s 4 GiB window — universal
S8Branch/jalr targets are basic-block starts (ϖ); jalr validated at runtimeLinker invariant (branches) + runtime check (jalr)No host implication
S9Memory layout in Image.memory_mappings is byte-granular but page-alignedTranspiler emits 4 KiB-aligned regions todayHost page size — see “Page size” below

Page size (S9)

The transpiler currently aligns mappings to 4 KiB. This is not a host-page-size assumption — it’s a transpiler layout choice. The production execution stack always interposes a microkernel between the PVM2 guest and the OS host:

┌────────────────────────────────────────────┐
│ host OS  (whatever page size it picks)     │
├────────────────────────────────────────────┤
│ hypervisor / VM monitor                    │
│   sets stage-2 (IPA → host PA) granule     │
├────────────────────────────────────────────┤
│ nub microkernel (we own this layer)        │
│   sets stage-1 (guest VA → IPA), picks     │
│   4 KiB pages, programs TTBR/CR3           │
├────────────────────────────────────────────┤
│ JIT'd PVM2 guest code                      │
└────────────────────────────────────────────┘

On x86 the nub-arch-x86 microkernel programs page tables directly with 4 KiB pages, regardless of the host OS’s choice. On AArch64 under Apple Silicon’s Hypervisor.framework, the same separation holds: TCR_EL1.TG0 is a guest register set by our nub, independent of the host’s stage-2 IPA granule. We can run 4 KiB stage-1 pages even when the macOS host uses 16 KiB pages and the IPA granule is 16 KiB. The hardware MMU walks both stages and produces correct translations.

Verified for AArch64-on-macOS in May 2026: Apple’s hv_vm_config_set_ipa_granule() (public API in macOS 26, private API previously) lets the host set the IPA granule to 4 KiB; the guest’s stage-1 page size is fully under our control regardless.

The performance footnote: 4 KiB stage-1 on Apple Silicon means more TLB pressure than the native 16 KiB. Bench-driven decision when we get there, not a portability blocker.

Unaligned access (S5)

Spec-level we declare Zicclsm, meaning every conforming guest is allowed to emit unaligned scalar loads/stores. Hosts must lower these correctly — performance is host-implementation territory.

  • x86-64: hardware-native unaligned. Free at cacheline-internal alignments; ~1 cycle for cacheline crossings. We get the full LLVM +unaligned-scalar-mem benefit (keccak code −44%, blake2b code −16%, cold times down 12–21% on hashers).

  • AArch64: hardware-native unaligned on cacheable memory. Some atomicity is lost on misaligned (torn reads possible), but we have no atomics so that’s moot.

  • RISC-V: implementation-dependent. The RVA22 and RVA23 profiles both mandate Zicclsm — meaning all application-class RISC-V designed from 2022 onwards supports unaligned access. The spec warns it might be slow (a host may handle it in microcode or via M-mode trap-and-emulate), but it’s always correct. In practice:

    • SiFive U74 / U54 (HiFive Unmatched, VisionFive 2): hardware-fast.
    • Spacemit K1, JH7110: hardware-handled but multi-cycle.
    • First-gen RVA23 silicon (SiFive P570 Gen3, Tenstorrent Ascalon, Ventana Veyron V2) shipping mid-2026: hardware-fast by design.

    Pre-RVA22 embedded RISC-V (RP2350 RP2040, BL808, K210) may trap-and-emulate or fault outright. We would never JIT to these — they lack the MMU, memory bandwidth, and code-cache reach the recompiler relies on for hosting anyway.

The trajectory is unambiguous: the application-processor RISC-V ecosystem is consolidating around RVA23 (Ubuntu 25.10 mandates it, Android-on-RISC-V targets it, NVIDIA’s CUDA RISC-V port targets it). Any host we’d realistically JIT to will support unaligned access.

Register floor and x3/x4 spill (S6)

PVM2 has RV64E’s full 15 GPRs (x1, x2, x5x15, plus x3, x4). A conforming host must provide at least 13 general-purpose registers for the guest file:

  • The 13 hot slots (x1, x2, x5x15) are register-resident on every conforming host. They are what the jar toolchain emits, so conformant guest code is fully register-mapped everywhere.
  • x3 and x4 are not guaranteed register-resident. A host with exactly 13 GPRs (today’s x86-64 backend, after reserving registers for the gas meter, scratch, and the native stack) holds them in memory and spills on each access. A host with ≥ 15 free GPRs (AArch64, RISC-V) may keep them resident with no spill.

Because the worst-case conforming host spills them, x3/x4 accesses are gas-charged at memory-spill cost unconditionally — on every host, whether or not it actually spills (see gas-cost.md). Gas is a spec-fixed upper bound, so a register-rich host runs faster than charged without affecting consensus. This keeps the register count a conformance property (any valid RV64E blob runs) while pinning the host requirement at a portable floor of 13.

The jar toolchain does not emit x3/x4, so this spill path is reached only by untrusted RV64E blobs — never by jar’s own guests, whose hot paths stay fully register-mapped on every host.

Host-implementation assumptions (nub-arch-x86 today)

These are properties the current x86 backend exploits. A second backend would have to satisfy each, but how it does so is local — no spec change, no impact on the guest blob format.

#Assumptionx86-64 todayAArch64 costRISC-V cost
H1Host can mmap the low 4 GiB at fixed VAsMAP_FIXED under Hyperlight; PVM addr == native VASame trick worksSame trick works
H2Hardware MMU with mprotect-style permsmprotect / EPTmprotect / stage-2 permsmprotect / PMP (M-mode)
H3Trap-on-fault with recoverable PC (SIGSEGV / #PF)SIGSEGV handler reads RIP from mcontextSIGSEGV reads PC from mcontext; same modelRV trap frame reads sepc; same model
H4Coherent I-cache (no flush after writing JIT code)Free on x86Must add dc cvau; ic ivau; dsb; isb after each code emissionMust add fence.i after each code emission
H5TSO / strong memory ordering for context fields (gas, exit_reason)Free on x86May need dmb at exit pointsMay need fence at exit points
H6Native CALL/RET use a single downward stackRSP, host_rsp_baseSP + LR (similar)SP + RA (similar)
H7Sys V AMD64 ABI for helper-fn callsBuilt into asm.rsNeeds AAPCS portNeeds RV ABI port
H8±2 GiB RIP-relative reach for CTX_VAx86 RIP-rel disp32AArch64 ADRP+ADD (±4 GiB) — worksRV AUIPC+offset (±2 GiB) — works
H9Hand-written x86-64 machine code emitterasm.rs ~2 KLOC, codegen.rs ~3 KLOCFull rewrite per archFull rewrite per arch

The dominant cost in any port is H9 — rewriting the asm emitter and per-opcode codegen. The “silent killers” are H4 and H5: on AArch64 or RISC-V, omitting an I-cache flush after JIT emission causes intermittent miscompiles that look like memory corruption; omitting a memory barrier on context writes causes torn reads of gas / exit_reason that look like scheduler bugs. These are easy to forget because x86 hides them.

Per-arch port readiness

x86-64 — current target

Everything works. Production path.

AArch64 — most likely second target

Spec-level: clean. The unaligned-access and 4 KiB-page concerns both dissolve as detailed above.

Host-level: the rewrite of asm.rs and codegen.rs is the bulk of the work. The I-cache flush and memory-barrier additions (H4, H5) are small but mandatory. Apple Silicon under Hypervisor.framework needs the IPA-granule plumbing (4 KiB on macOS 26+; 16 KiB IPA with 4 KiB stage-1 on older macOS — both work).

Estimated effort: 3–6 weeks for a skilled JIT engineer to land a working backend, plus another similar period for performance parity with x86.

RISC-V — speculative future target

RV→RV translation is the niche case. Possible cases:

  • A RISC-V validator wanting to run PVM2 chain code natively. The host RISC-V (RVA23+) supports everything the guest needs.
  • A “RISC-V everywhere” deployment where we want to avoid the emulator-translator-host stack and have the host arch match the guest arch.

Spec-level: clean, contingent on host being RVA22+ (= Zicclsm guaranteed). Pre-RVA22 embedded RISC-V is not a target.

Host-level: same H1–H9 list as AArch64, with RISC-V instead of ARMv8 mechanics. The interesting bit is H9 — emitting RV machine code from a RV-based source. Much of the codegen could lower 1:1 (register-to-register moves, ALU ops) since both guest and host share the same instruction set; the work shifts toward managing the host’s view of “guest memory” (since you can’t rely on the host’s MMU configuring a 1:1 low-4GiB map identical to x86’s setup).

Estimated effort: 4–8 weeks for the JIT, with the host-OS integration likely the bigger unknown (less mature tooling than on x86 / AArch64).

What this means for current decisions

The portability audit feeds back into a couple of present-day spec choices:

  1. Zicclsm in the PVM2 ISA: justified by S5 + the RVA22/RVA23 trajectory. The cost of declaring it (zero on x86 / AArch64; slight slowdown on theoretical pre-RVA22 RV hosts that we’d never deploy on) is dominated by the gain (LLVM emits dense unaligned scalar loads instead of byte-load sequences).

  2. Memory mapping byte granularity vs page units: the current transpiler emits 4 KiB-aligned byte addresses. If we ever want to support hosts where the nub itself needs to run with non-4-KiB pages (e.g., a future “everything in 16 KiB pages” nub-arch-aarch64-mac variant for TLB-perf reasons), we’d want the spec to express layouts in abstract pages with the JIT applying the host page size. Not required today — every target we’ve examined can run the nub with 4 KiB stage-1 pages — but worth flagging as a clean follow-on if we ever measure significant TLB pressure on a new host.

  3. asm.rs modularity: the eventual second backend will exercise how cleanly the codegen separates from the x86-specific instruction emission. If we know we’re going to port to AArch64, mid-term refactoring of codegen.rs to abstract the per-arch primitives (load-imm-into-reg, branch, memory access with bounds check) would amortise the port cost. No urgency until we commit to the second backend.