Skip to content

PVM2 — per-instruction gas cost table

PVM2’s gas accounting uses the single-pass pipeline model. The canonical definition is the Rust source (gas_sim.rs + gas_cost.rs) plus this reference doc. (An older Lean formalization under spec/Jar/JAVM/ is stale — it predates microkernel v3 and the 15-register file — and is not authoritative.) For each basic block, predecode walks the instructions once, feeding each through GasSimulator; the final block cost is max(max_done − 3, 1). This block cost is the #1 (execution) component only, and it is charged once per block, at block entry (the charging discipline — pre-reservation, OOG never charges, gas never goes negative — is specified in gas-cost.md §1). The block cost is further scaled by the #2 memory-access-latency footprint multiplier (×1–4) before it is charged; that multiplier is defined in gas-cost.md §2. The per-instruction cost table (rv_fast_cost) and the block-driver (rv_gas_cost_for_block) live in rust/javm-exec/src/gas_cost.rs — next to PVM’s table, so when PVM is eventually retired the file shrinks rather than fragments.

The Lean spec for an earlier flat-per-instruction-sum PVM2 gas model (spec/Jar/JAVM/GasCostPVM2.lean) was removed; the canonical PVM2 gas table is now the Rust source plus this reference doc.

Pipeline model (single-pass)

State per block: reg_done[15] (cycle when each register’s value is ready) + cycle (current decode cycle) + decode_used (slots consumed this cycle) + max_done (running max completion).

For each instruction in program order:

  1. Decode throughput. If decode_used >= 4, advance cycle by 1 and reset decode_used = decode_slots. Else decode_used += decode_slots.
  2. Move-reg fast path. is_move_reg instructions propagate reg_done[dst] = reg_done[src] and skip the rest.
  3. Data dependency. start = max(cycle, max(reg_done[r] for r in src_mask)).
  4. Completion. done = start + cycles.
  5. Write back. reg_done[r] = done for every r in dst_mask.
  6. Track. max_done = max(max_done, done).

Block cost = max(max_done − 3, 1). The “−3” normalizes the steady-state pipeline depth (3 stages: decode, dispatch, execute).

Each instruction’s FastCost carries:

  • cycles — execution latency
  • decode_slots — front-end cost (1–4)
  • exec_unit — informational only; the single-pass model doesn’t model EU contention (decode throughput subsumes ALU/LOAD/STORE contention; MUL/DIV are rare enough that latency already serializes them via data deps)
  • src_mask / dst_mask — 15-bit register bitmasks (PVM2’s 15 architectural regs map to bits 0..14, ordered x1, x2, x5..x15, then x3, x4 in the two high slots)
  • is_terminator — ends the block; nothing is decoded after it
  • is_move_reg — handled in the decode-cycle path; propagates reg_done without consuming a ROB-equivalent slot

Why single-pass (and not the full ROB / EU-contention model in GasCostFull.lean):

  • Speed. O(n) per block, one pass over the instructions, no inner cycle-by-cycle loop.
  • No slot limit. Real PVM2 basic blocks can have 100+ RV instructions (crypto inner loops). A 32-entry ROB simulator would need slot recycling and is a lot more code.
  • EU contention subsumed. The 4-wide decode bandwidth already caps the rate at which ALU/LOAD/STORE-class ops can begin. The only contended units in a ROB model are MUL/DIV, and those are rare + high-latency enough that data dependencies serialize them in practice.

Per-instruction cost table

The columns below are: op (PVM2 mnemonic), cycles, decode_slots (range when overlap-dependent), EU, src (which RvInst fields contribute to src_mask), dst (likewise for dst_mask), term (terminator flag), PVM equiv (the PVM byte opcode this row mirrors, if any).

Loads and stores

opcydecEUsrcdsttermPVM equiv
lb / lh / lw / ld / lbu / lhu / lwu251LOADrs1rdPVM 52..58 / 124..130
sb / sh / sw / sd251STORErs1, rs2PVM 59..62 / 120..123

Upper / load-immediate

opcydecEUsrcdsttermPVM equiv
lui12NONErdPVM 20 load_imm_64

(addi rd, x0, imm — used by the linker as the small-immediate load — bills as the general addi row below; there’s no special fast-path.)

Integer ALU (64-bit)

decode_slots = 1 when rd overlaps a source register, else 2 (matches PVM’s overlap rule).

opcydecEUsrcdsttermPVM equiv
add, sub, and, or, xor11-2ALUrs1, rs2rdPVM 200/201/210/211/212
addi, andi, ori, xori, sltiu, slli, srli, slti, srai11-2ALUrs1rdPVM 132/133/134/149/151..153/158/110
sll, srl, sra (shifts, dec rule = rs1 == rd)12-3ALUrs1, rs2rdPVM 207/208/209
slt, sltu33ALUrs1, rs2rdPVM 216/217

Integer ALU (32-bit)

opcydecEUsrcdsttermPVM equiv
addw, subw22-3ALUrs1, rs2rdPVM 190/191
sllw, srlw, sraw (dec rule = rs1 == rd)23-4ALUrs1, rs2rdPVM 197/198/199
addiw, slliw, srliw, sraiw22-3ALUrs1rdPVM 131/138/139/140/160

M extension (multiply / divide)

opcydecEUsrcdsttermPVM equiv
mul31-2MULrs1, rs2rdPVM 202
mulw42-3MULrs1, rs2rdPVM 192
mulh, mulhu44MULrs1, rs2rdPVM 213/214
mulhsu64MULrs1, rs2rdPVM 215
div, divu, rem, remu + W-variants604DIVrs1, rs2rdPVM 193..196/203..206

Zbb — basic bit manipulation

opcydecEUsrcdsttermPVM equiv
clz, clzw, cpop, cpopw, sext.b, sext.h, zext.h, rev8, orc.b11ALUrs1rdPVM 102-105/108/109/111
ctz, ctzw21ALUrs1rdPVM 106/107
min, minu, max, maxu32-3ALUrs1, rs2rdPVM 227..230
andn, orn23ALUrs1, rs2rdPVM 224/225
xnor22-3ALUrs1, rs2rdPVM 226
rol, ror (dec rule = rs1 == rd)12-3ALUrs1, rs2rdPVM 220/222
rori11-2ALUrs1rdPVM 158
rolw, rorw23-4ALUrs1, rs2rdPVM 221/223
roriw22-3ALUrs1rd(PVM has only 64-bit imm rot)

Zba — shift-add

x86’s LEA family folds these into one cycle.

opcydecEUsrcdsttermPVM equiv
sh1add, sh2add, sh3add, sh1add.uw, sh2add.uw, sh3add.uw, add.uw11-2ALUrs1, rs2rd(no PVM equiv)
slli.uw11-2ALUrs1rd(no PVM equiv)

Zbs — single-bit

opcydecEUsrcdsttermPVM equiv
bclr, bset, binv, bext11-2ALUrs1, rs2rd(no PVM equiv)
bclri, bseti, binvi, bexti11-2ALUrs1rd(no PVM equiv)

Zicond — conditional move

opcydecEUsrcdsttermPVM equiv
czero.eqz, czero.nez22ALUrs1, rs2rdPVM 218/219 cmov_*

Control flow

opcydecEUsrcdsttermPVM equiv
jal rd, imm (static jump / linker-emitted body of call)151ALUrd (if rd ≠ 0)PVM 40 jump
jalr rd, rs1, imm (indirect jump / return; replaced br_table)221ALUrs1— (rd = return PC, untracked)PVM 50 jump_ind
beq, bne, blt, bge, bltu, bgeu201ALUrs1, rs2PVM 170..175

Note: PVM has a 1-cycle fast-path for branches whose target is opcode 0 (trap) or 2 (unlikely). The PVM2 linker rewrites trap targets to direct traps, so this fast-path rarely fires; we use a flat 20 (the PVM default for “live” branches).

Custom-0 (PVM2-specific)

opcydecEUsrcdsttermPVM equiv
trap (funct3=000)21NONEPVM 0 trap
ecall.jar (funct3=001)1004ALUPVM 3 ecall
ecalli imm (funct3=010)1004ALUPVM 10 ecalli
fallthrough (funct3=100)21NONEPVM 1 fallthrough

Note: funct3=011 (the former br_table) is reserved — see “Reserved encoding” below. PVM2 routes indirect jumps through plain jalr (see “Control flow” above), not a custom table op.

Note: ecall.jar and ecalli each form their own gas block (a forced block start) and are charged dynamically by the kernel, not through the static per-block preamble — their true cost (a CALL frame, a host op’s work) is unknowable at compile time. The 100-cycle row above is only the static instruction component (a floor); the kernel adds the actual materialization cost at runtime, with out-of-gas re-attempting at the ecall’s own PC. For an in-kernel CALL this dynamic charge now also includes the callee’s JIT compile (O(code), per declared code page) and its eager read-only page-in (one charge per declared 2 MiB unit), computed statically from the callee Image. See gas-cost.md §3 “ecall/ecalli blocks”.

Fences (no-op)

opcydecEUsrcdsttermPVM equiv
fence, fence.i11NONE(no PVM equiv; PVM2 single-thread)

Reserved encoding

If a reserved encoding is reached at runtime it traps. The simulator still needs a cost row so a pathological program can’t be “free”.

opcydecEUsrcdsttermPVM equiv
Reserved (any)21NONE

Register mapping

PVM2 has 15 architectural registers (RV64E’s full GPR file): x1, x2, x5..x15, plus x3, x4 (with x0 = zero). They map to FastCost::src_mask / dst_mask bits as follows:

RV regrolemask bit
x0zero(no bit; reads contribute no dep, writes have no effect)
x1 (ra)return-idx after linker call-rewrite0
x2 (sp)stack pointer1
x5 (t0)general2
x6 (t1)general3
x7 (t2)general4
x8 (s0/fp)general5
x9 (s1)general6
x10 (a0)arg / return value7
x11 (a1)arg8
x12 (a2)arg9
x13 (a3)arg10
x14 (a4)arg11
x15 (a5)arg12
x3 (gp)general (host-spilled; see below)13
x4 (tp)general (host-spilled; see below)14

x3/x4 occupy the two high slots so the 13 commonly-used registers keep mask bits 0..12 unchanged — every conformant program (which never references x3/x4) produces bit-for-bit identical masks, and therefore identical gas, as before the file grew to 15.

x3/x4 spill cost

x3 and x4 are real registers the runtime executes, but they are not guaranteed host-register-resident: a host with exactly 13 registers (today’s x86-64 JIT) holds them in memory and spills on each access (see Register model in rv64e-xjar-eei.md and the host contract in portability.md).

Because the worst-case conforming host spills them, the gas model charges the memory-spill cost for every x3/x4 operand unconditionally (on every host, whether or not it actually spills — gas is a spec-fixed upper bound, so a register-rich host runs faster than charged without affecting consensus). Concretely: each instruction operand position (rs1, rs2, rd) that names x3 or x4 adds mem_cycles (= the LOAD/STORE latency, default 25) to that instruction’s cycles before it is fed to the simulator — matching the load/op/store the recompiler’s spill path emits. An instruction touching no x3/x4 operand is charged exactly as its table row above.

Footprint scaling (#2). The mem_cycles = 25 here is the base LOAD/STORE latency at the smallest footprint tier (×1). It is scaled by the memory-access footprint multiplier (×1–4, from the Instance’s total accessible pages) before the block cost is charged — so a memory-heavy block in a large declared footprint pays proportionally more. The multiplier and tiers are normative in gas-cost.md §2; the x3/x4 spill charge above scales with it like any other memory operand.

Reasoning for the non-PVM rows

Some PVM2 instructions have no PVM byte-opcode equivalent (notably Zba/Zbs and the 32-bit Zbb rotates). Their costs were chosen to match the underlying x86 native cost the recompiler emits:

  • Zba shift-add maps directly to a 1-cycle LEA form on x86. ALU, 1 cycle, decode_slots follows the overlap rule.
  • Zbs single-bit lowers to BMI2 bzhi/pdep-class instructions on x86 (single-cycle ALU on Skylake+).
  • Zbb 32-bit rotates (rolw, rorw, roriw) are symmetric with their 64-bit counterparts at +1 cycle (matches PVM 32-bit ALU convention).

Sync between Rust and this doc

The source of truth is rust/javm-exec/src/gas_cost.rs::rv_fast_cost. Any change to a row above must update the Rust function and vice versa. Drift between the two should fail a future conformance test (TODO once we have one).