Yield passthrough — performance optimization options
In v3, a yield is routed by yield_key along owner edges to the nearest
snapshotted YieldReceiver whose catch-list contains that key. An
Instance that did NOT register the key in the owner-edge snapshot for a
child is bypassed automatically, so containment is automatic — an
unregistered intermediate never sees the yield and has NOTHING to
propagate. The explicit passthrough loop is therefore only needed when an
intermediate DELIBERATELY interposed (registered the yield_key in its
own YieldReceiver for that child CALL) and, after validating, wants to
FORWARD by re-emitting the same YieldSender. The canonical pattern for
that opt-in mediation is a bytecode loop. This doc discusses the
optimization options for that loop. None of these are in the main
spec; this is a forward-looking note.
The canonical passthrough pattern
An interposing Instance layer — one that registered a yield_key in
its YieldReceiver to mediate a descendant’s yields of that key, but
wants to forward most of them on — writes the following loop after a
CALL:
-- self.slot[0] holds the scratchpad for outbound CALL (the args
-- the layer wants to send to child)
ret = CALL(child_slot, endpoint, gas_budget)
loop {
match ret.status:
HALT(value) =>
-- self.slot[0] now holds child's reflected output
return value
Paused(yield_key) =>
-- we only get here for keys we registered in our YieldReceiver;
-- unregistered keys followed owner edges to the next receiver
-- and never paused us.
if we want to mediate this yield_key locally:
handle the yield_key ...
else:
-- self.slot[0] holds child's request (reflected from yield).
-- forward by re-emitting the same YieldSender; routing starts
-- from our owner edge and reaches the next receiver down.
host_yield(yield_sender_for[yield_key])
-- on resume, self.slot[0] holds the response from the receiver.
-- forward to child.
ret = CALL_RESUME(child_slot, gas_budget)
Faulted =>
-- self.slot[0] holds whatever child had at fault
handle fault
}This works using only the existing kernel ABI (CALL, host_yield, CALL_RESUME). No new primitives required. slot[0] threads the scratchpad through each interposition level automatically (yield reflects up to the registered receiver; CALL_RESUME moves down). Per interposing layer cost: roughly 100ns (a few bytecode instructions per iteration). For K=5 interposing layers: ~500ns total — well within block-apply budget.
When passthrough might become a bottleneck
This loop runs ONLY at the layers that deliberately interposed on the key — registered it in their YieldReceiver and chose to forward it. Layers that did not register the key are emitter-excluded and cost nothing; the kernel jumps straight past them to the nearest receiver. So K is the number of INTERPOSING layers for a given key, typically far fewer than the stack depth. If chains routinely stack many interposers that re-emit (e.g. layered virtualization of the same syscall key), the cumulative cost could become measurable. Rough scaling:
- Per yield: K × 100ns where K is the number of interposing layers that forward the key (NOT the stack depth).
- Per block: (yields per tx) × (txs per block) × K × 100ns.
For typical chain workloads (hundreds of yields per block, K~3-5 interposing layers), this is well under 1ms — not a bottleneck.
If we find a workload where it is (deeply-nested orchestration that stacks many forwarding interposers on a hot key), an optimization is available.
Optimization option: call_passthrough opcode
Note this opcode is only relevant to the interposition case — a layer
that registered a key purely to forward it. Containment (not registering
the key) already costs nothing, so there is nothing to optimize there.
Even for an interposer, the predicate cannot be a kernel-chosen
reason: the loop must decide per yield_key whether to forward, and the
set of keys this frame can forward is exactly the frame’s registered
YieldReceiver. So the opcode’s predicate is the frame’s registered
yield_key set — the keys it locally mediates (and thus does NOT
forward) — and the kernel forwards the rest by re-emitting the matching
scratchpad YieldSender:
call_passthrough(child_slot, endpoint, gas_budget,
local_yield_keys) → ret
Kernel internal:
let ret = CALL(child_slot, endpoint, gas_budget)
loop:
match ret.status:
HALT, Faulted => return ret
Paused(yield_key):
if yield_key ∈ local_yield_keys:
return ret -- bytecode wants to mediate this one
host_yield(yield_sender_for[yield_key])
ret = CALL_RESUME(child_slot, gas_budget)Bytecode using it:
ret = call_passthrough(child, ep, gas, local_keys=MY_KEYS)
match ret:
HALT(v) => use v
Paused(yield_key) => mediate (must be in MY_KEYS)
Faulted => handle faultThe kernel runs the loop without invoking bytecode. Cost per interposing layer: ~10ns (state transition only).
For K=5 interposing layers: ~50ns total. ~10× faster than bytecode loop.
Why not in the spec now
This is a peephole optimization. The bytecode loop works correctly and is performant enough for any realistic workload. Adding the opcode is straightforward if/when profiling shows a need.
- No correctness benefit. Same semantics either way.
- Negligible practical cost. Bytecode loop adds ~500ns to typical interposition chains. Block budgets are seconds; this rounds to zero.
- ABI minimalism. One more opcode is small, but every opcode added is one more thing to verify, document, and maintain.
- Adoption discipline. If we add the opcode, chains must opt in per CALL site. They might forget; performance varies based on bytecode discipline. With bytecode loop, the cost is uniform and predictable.
Alternative considered: Image-flag auto_forward
An earlier draft considered an Image-level declaration:
Image {
...
auto_forward: Set<yield_key> // keys this Image auto-forwards
}For these keys the kernel re-emits without invoking bytecode; the forward is kernel-fast.
Trade-offs:
- Declarative: per-Image policy stated once.
- But static: the same Image can’t mediate key K for some sub-calls while forwarding it for others.
- Image author burden: every Image needs to think about which keys it forwards.
We chose against this in favor of per-CALL granularity (either via
bytecode loop or via call_passthrough opcode). The bytecode pattern
gives the Image author full runtime flexibility; the kernel doesn’t
need an Image-level field. The per-CALL YieldReceiver snapshot already
makes the forward-vs-mediate split a runtime decision — a static
Image-level set would be strictly less expressive.
When to revisit
Add call_passthrough opcode when:
- Profiling shows interposition-forwarding overhead is a measurable fraction of block apply time.
- Chain authors are routinely writing the same forwarding-loop boilerplate and errors arise from getting it wrong.
- A common chain pattern (e.g., deep interposer hierarchies that stack forwarders on one key) emerges and we want to make it fast by default.
Until then: bytecode loop is the canonical pattern. Document it in the userspace coding guide when we write one.