Parallel Generation Streams for LLMs

December 9, 2025 3 minute read

Parallel Generation Streams (PGS) allows models to branch, merge, and discard generation streams dynamically, enabling more sophisticated probablity modeling and reasoning exploration.

Motivations

1. Distinguishing Aleatoric and Epistemic Uncertainty

(Building on my prior work 1 2.)

Current Large Language Models (LLMs) predict the probability of a next token without distinguishing the source of that probability distribution. Consider two scenarios where an LLM assigns a 0.5 probability to an outcome:

Scenario A (Aleatoric): “Flip a fair coin. Heads (1) or Tails (0)?”
Scenario B (Epistemic): “Is the Riemann Hypothesis true (1) or false (0)?”

In Scenario A, the uncertainty is inherent to the system (randomness). In Scenario B, the uncertainty stems from a lack of knowledge (the model doesn’t know the answer, though a definitive one exists).

Distinguishing these can be useful for certain cases such as reliability estimation. High uncertainty is commonly believed to correlate to a higher likelihood of error. However, penalizing Scenario A is incorrect; the uncertainty there is accurate. We may be able to treat “I don’t know” differently from “It could be either” with parallel streams.

2. Parallelizing Reasoning

LLM generation is inherently sequential and auto-regressive, which creates bottlenecks for tasks that are logically parallelizable.

Enumeration: Find all prime numbers between 100 and 1000.
Reasoning Exploration: Complex logical deductions often benefit from exploring multiple reasoning paths simultaneously to converge on a result faster.
Reasoning Latency: We can both a reasoning stream and a response stream in parallel. Response tokens can be streamed out before the full reasoning process is complete.

The Parallel Generation Streams (PGS) Framework

I came up with the idea of Parallel Generation Streams (PGS) framework, which acts as a simplified version of git branching for token generation.

The LLM operates on a “Main” branch by default. At strategic points, it can spawn “Feature” branches to explore alternative continuations or perform sub-tasks in parallel. These branches eventually resolve in one of two ways:

Merge: The branch contributes valuable information and is integrated back into the Main stream.
Discard: The branch was an exploration of a dead-end or a scratchpad for intermediate computation and is dropped.

Control Tokens

The model utilizes special control tokens to manage this lifecycle:

<|branch|>: Initiates a new parallel stream.
<|merge|>: Signals the completion of a branch and inserts content back into the origin point.
<|discard|>: Terminates a branch without outputting to the main stream.

Caveat: To simplify the initial definition, we assume a “flat” branching model where branches cannot spawn their own sub-branches (no recursive branching).

Implementation Strategies

Approach I: The Agentic Prototype (Tool-Use)

For rapid validation without expensive pre-training, we can implement PGS as an agentic workflow. “Branching” is treated as a tool call that triggers asynchronous LLM instances.

A simplified Python schema might look like this:

def branch(branches: list[dict]):
    """
    Spawns parallel generation processes.
    Args:
        branches: A list of dicts containing 'branch_id' and 'context'.
                  Context can utilize placeholders like ${current_context} 
                  to avoid redundant data transfer.
    """
    pass

def merge(content: str):
    """
    Called from within a branch stream to return data to the main trunk.
    """
    pass

In this setup, discard is simply a no-op function (or the absence of a merge call).

Approach II: Native Architectural Integration

For high-performance applications, we can modify the Transformer architecture to support branching natively.

Instead of processing a single sequence of dimensions $L \times d$ (where $L$ is sequence length and $d$ is embedding dimension), the model processes a tensor of $L \times (B\times d)$, where $B$ represents the maximum number of active parallel branches.

Initialization: All slots in dimension $B$ differ only by a “branch ID” embedding.
Divergence: As generation proceeds, the $B$ dimension can allow different attention masks, enabling branches to drift apart or attend to each other as needed.
Convergence: A merging mechanism (e.g., cross-attention or pooling) aggregates the states back into the $B=0$ (Main) index.

More details need to be worked out…

Pengfei Yu