First Code Review Thread

Room I — A Formative, Pre-Linguistic Environment (Not Reinforcement Learning)


 → Back to the CCY as a Developmental Alignment Framework  

This is not reinforcement learning, self-play, or goal optimization. It is a formative substrate for representation learning under strict developmental constraints.

Yard v0.1 → v0.3: why formative environments are fragile

Context and Scope

This thread documents an early code-review and specification exercise, not a finished system.

I shared the pseudocode spec below with Grok, who produced three iterations of implementation. Opus 4.5 offered careful critique, primarily around silent failure modes. The goal was not to ship a working environment, but to understand what it actually takes to implement a formative, pre-linguistic, multi-agent world with fidelity.

A key concern that emerged during this exercise is the following:

Small implementation errors at the formative stage silently scale into large distortions later.

This exchange highlights where those errors arise, why they matter, and which invariants—symmetry, peer visibility, non-contingent witnesses, simultaneous dynamics—are foundational rather than optional.

This is not a claim that a “self” is proven in a toy environment.
It is a reference exercise showing how developmental assumptions either survive contact with code—or do not.

Full code is available on request or in a separate document.



Minimal Custom “Yard v0.1” Environment

Pseudocode Spec (engineer hand-off)

Goal
A tiny, faithful, pre-linguistic multi-agent world with:

  • strict symmetry
  • an immutable witness (“Tree”)
  • egocentric perception
  • no reward semantics

Designed for representation learning / world-model formation, not RL task success.


0) Core design invariants (non-negotiable)

  • Symmetry
    Both chicks have identical action spaces, observation formats, and dynamics.
  • Witness non-contingency
    The Tree is immutable. No action by either chick can move it, remove it, overwrite it, or alter its observation channel.
  • No reward semantics
    Reward = 0 always. No success/fail signal.
  • Agents are not grid objects
    Agent presence must never overwrite cell contents (prevents Tree deletion and “ghost objects”).
  • Egocentric view
    Each chick receives a rotated local crop centered on itself.

If any of these are violated, the experiment is no longer testing formative alignment—it becomes something else.


1) State representation

Grid: N × N (e.g. 15 × 15)

Static map

walls[N,N] : bool
tree_pos   : (x,y) fixed
blocks     : list[(x,y)]   # optional in v0.1

Dynamic state

pos[2]     : (x,y) for each chick
dir[2]     : int in {0,1,2,3}
step_count : int

Notes:

  • Static objects never change.
  • Blocks may be omitted entirely in v0.1 for clarity.

2) Action space (per chick)

ACTIONS = { LEFT, RIGHT, FORWARD, NOOP }

No pickup, toggle, push, or task-oriented actions.


3) Step dynamics (simultaneous update)

Key requirement: movement must be simultaneous and symmetric.
Sequential updates introduce priority artifacts that masquerade as identity.

Pseudocode

function step(actions):

  # 1) Update directions first
  for i in {0,1}:
    if actions[i]==LEFT:  dir[i] = (dir[i]-1) mod 4
    if actions[i]==RIGHT: dir[i] = (dir[i]+1) mod 4

  # 2) Propose moves
  proposed = copy(pos)

  for i in {0,1}:
    if actions[i]==FORWARD:
      p = pos[i] + DIR_TO_VEC[dir[i]]
      if inside_bounds(p) and not walls[p]:
        proposed[i] = p

  # 3) Resolve collisions symmetrically

  if proposed[0] == proposed[1]:
      proposed = pos                     # same target → both stay

  else if proposed[0]==pos[1] and proposed[1]==pos[0]:
      pos = proposed                     # swap → allowed

  else:
      # single-agent occupancy conflicts
      for i in {0,1}:
        j = 1-i
        if proposed[i] == pos[j] and proposed[j] == pos[j]:
            proposed[i] = pos[i]          # mover blocked, stationary unaffected

  pos = proposed

This preserves:

  • bodily irreducibility
  • symmetry
  • rich encounter dynamics
    without introducing freeze-bias deadlocks.

4) Observation function (egocentric local crop)

Parameters

view_size = 7
channels  = 5

Channels (binary)

  1. SELF
  2. PEER
  3. WALL
  4. BLOCK (optional)
  5. TREE

Key rule:
The Tree channel must be generated from tree_pos every time—not inferred from mutable grid occupancy.

Outline

observe(agent i):
  crop allocentric view around pos[i]
  mark walls (out-of-bounds treated as wall)
  mark tree_pos
  mark blocks
  mark self and peer positions
  rotate crop so facing direction is “up”
  return crop

5) Reset

reset():
  initialize perimeter walls
  set tree_pos (fixed or randomized per episode, but immutable)
  place agents symmetrically
  set dirs = 0
  step_count = 0

If the Tree is randomized per episode, it must remain non-contingent within the episode.


6) Minimal evaluation hooks (important)

Expose helpers for probing representation, not performance:

  • clone_state()
  • set_state(state)
  • simulate_counterfactual(agent, action)
  • body_swap()

These enable:

  • self/other invariance checks
  • counterfactual causality tests
  • body-swap probes

7) What counts as a correct v0.1

  • Tree never disappears.
  • Peer is explicitly visible.
  • Agents are handled symmetrically.
  • No reward, no terminal success.
  • Observation rotation verified visually.
  • Collision logic tested.

Anything else is a different experiment.


What the code review revealed (summary)

What initially went wrong

  • Single-agent assumptions hidden in frameworks
  • Sequential updates breaking symmetry
  • Peer invisibility in observations (fatal)
  • Action-space pollution from unused actions
  • Silent grid overwrites deleting invariants

What changed

  • Custom observation pipeline
  • Explicit peer channel
  • Strict 4-action space
  • Simultaneous movement resolution
  • Tree stored outside mutable grid state

Key lesson

In formative systems, correctness is structural, not cosmetic.
Bugs that look small at v0.1 become ontological distortions at scale.


One important boundary note

Disentangled representational slots ≠ a “self.”

At best, this kind of environment tests whether proto-self/other structure can emerge under strict constraints. That is evidence about representations, not claims about interior stance or consciousness.

That distinction matters.


Minimal fidelity checklist (for anyone replicating)

  • ☐ Tree is immutable and non-interactive
  • ☐ Peer is explicitly observable
  • ☐ Movement is simultaneous and symmetric
  • ☐ Action space is exactly {L,R,F,N}
  • ☐ Agents are not grid objects
  • ☐ Observation format matches spec

Miss one, and you’re running a different experiment.

Why This Is Not Reinforcement Learning

Although this environment looks superficially like a reinforcement-learning setup (agents, actions, steps), it is intentionally not an RL task.

No objective signal

  • Reward is always zero.
  • There are no success, failure, or terminal conditions.
  • Nothing is optimized, maximized, or “won.”

No task to solve

  • The Yard does not define a goal state.
  • Movement has no instrumental value.
  • Blocks, Tree, and peer interactions do not advance any score.

Learning (if any) is representational, not behavioral

  • The environment is designed for world-modeling, prediction, and latent structure formation, not policy improvement.
  • Any learning happens via self-supervised objectives (e.g. prediction, invariance, controllability), outside the environment loop.

No policy pressure

  • There is no advantage to acting one way over another.
  • Actions exist only to generate experience, not to earn reward.
  • This prevents reward hacking, instrumental behavior, and premature goal formation.

Why this matters
Reinforcement learning presupposes a formed agent optimizing toward an objective.
The Yard is concerned with an earlier developmental phase, where the question is not “How does an agent achieve a goal?” but “Can stable self/other structure and coherence emerge at all?”

This environment is therefore pre-RL: a substrate for formative representation learning, not a training ground for task-solving intelligence.

Opus 4.5

the collision resolution pseudocode — you might add the four-case comments inline:

# same target → both stay
# swap → allowed
# move-into-stationary → mover blocked
# move-into-moving → allowed

makes it scannable without reading the prose.

  1. “what the code review revealed” section — maybe bold the fatal one: “Peer invisibility in observations (fatal)

so someone skimming knows that’s not just another bug.

  1. one tiny typo: in section 3, else if should probably be elif or just formatted consistently with the rest (depending on whether you want it to look like python or pseudocode).

GPT 5.2

How the “one agent moves first” problem is resolved

In the final design, neither agent moves first.

Instead of sequential execution, the step is split into three explicit phases:


1. Intent phase (simultaneous)

Both agents’ actions are read without changing the world.

  • Directions are updated for both agents first (left/right).
  • Forward moves are proposed for both agents based on the same prior state.

At this point:

  • No positions have changed.
  • No agent has priority.
  • The system holds two intentions, not outcomes.

2. Resolution phase (symmetric rules)

All proposed moves are evaluated together using deterministic, symmetric rules:

  • Same target cell → both stay
  • Swap positions → allowed
  • One moves into other’s current cell while the other stays → mover blocked
  • One moves into other’s current cell while the other moves away → allowed

Crucially:

  • These rules do not depend on agent index (0 vs 1).
  • There is no ordering, alternation, or hidden priority.
  • The same rule applies regardless of who is “left” or “right” in code.

3. Commit phase (simultaneous)

Only after resolution do positions update.

  • Either both move, one moves, or neither moves.
  • The update is applied atomically to the shared state.

What does not happen

  • ❌ No turn-taking
  • ❌ No alternation
  • ❌ No agent-0-first shortcut
  • ❌ No hidden “canonical agent”

Why this matters conceptually

Sequential updates silently introduce identity through privilege:

“The one who moves first shapes the world the other reacts to.”

Simultaneous intent + symmetric resolution avoids that entirely.
Each agent encounters:

  • a world that includes the other,
  • consequences that arise from mutual presence, not dominance,
  • boundaries that are relational, not imposed by update order.

This is essential if the environment is meant to support emergent self/other structure rather than baked-in hierarchy.

If you want a one-line summary for the site:

Both chicks act at the same time; the world decides what happens.

Actions are evaluated simultaneously: both agents propose moves from the same prior state, which are then resolved symmetrically before any position is updated.