BokuraMachi

our town -- an ai civilization sim

The Game

BokuraMachi is an ai civilization sim, but youre not a god. you dont see the whole map. youre just one person living in a town. the name is japanese (僕らまち), "our town." no goals, no win condition. you meet people, watch things happen, exist. closest reference point in feel might be tomodachi life, but the underlying system is completely different.

Every playthrough starts with adam and eve. theyre templates, slightly randomized each run in personality, tendencies, and starting knowledge. that small initial randomness creates butterfly effect divergence across generations. no two playthroughs produce the same civilization. kids learn from parents, from neighbors, from everyone. culture emerges naturally.


Design Principles

Everything emergent, nothing scripted. if you think about adding a scripted event, ask first whether the simulation can produce it naturally.

Parsimonious complexity. simple rules, deep behavior. brain components are individually simple. emergence comes from their interaction over time.

The player is not special. no god-mode control, no overhead visibility, no special powers. the player is a citizen.

Compute is always a constraint. every feature must be evaluated against whether it can run on a normal pc using the priority/tier system.

Butterfly effect is sacred. the small initial randomness of adam and eve must be preserved. dont add systems that flatten divergence.

Scarcity makes relationships matter. the tier 1 cap of 8 is a design decision, not just a technical one. players should feel the weight of who they invest in.


Citizen Tiers

150 people all running full detail would melt your laptop. citizens are split based on relationship depth mapping to computational depth.

people you actually talk to. each one has a small language model that grows the more you interact. the model size progresses in stages:

  • 0 - 0.5 hours: no lm yet, still using tier 2 system, feels like a stranger
  • 0.5 - 2 hours: bootstrapped lm (~500 tokens), surface-level acquaintance
  • 2 - 8 hours: small lm (~1500 tokens), consistent personality starting to form
  • 8 - 25 hours: medium lm (~3000 tokens), remembers things, feels real
  • 25+ hours: full tier 1 (~5000 tokens), genuinely unpredictable conversations

the 5000 token ceiling is intentional. past that, personality stops improving and just accumulates more data. if you make a new close friend, someone else has to drift down to tier 2. your social circle has weight.

people you kinda know. they use a weighted grammar system with category slots like [subject_ref] + [feeling_verb] + [emotion_word]. each character has personal weight distributions over word categories. a sad person has higher weights for negative words. a farmer talks about harvest. cheap to run, sounds natural enough. kids inherit a subset of parent weight distributions, biased toward what the parent uses most, with small random drift during inheritance.

people you havent interacted with in a while. just a stat block on disk, simulation paused. when you get close, the game fast-forwards their simulation to catch them up on what they would have experienced. the loading screen is doing real work -- a few seconds on a normal laptop.


The AI Brain

every citizen has a full ai brain organized into 9 systems: senses, body, drives, mind, valve, survival, language, social, growth. the core pattern throughout: impulses are constantly being generated, a gating mechanism (the valve) is constantly saying "no dont," and something only happens when an impulse overcomes enough of those "stop" signals. same architecture handles both actions and speech.

six senses that convert raw game-world data into structured perception for everything else in the brain.

Vision: detects entities, positions, states in the visible field. each visible entity has an id, type (character, food, furniture, object), position, state, and character-specific details like expression and current action.

Hearing: detects sounds with type, source, volume, and direction. speech, footsteps, environmental sounds (rain, wind), impacts. hearing someone talk affects word weights. a friends voice triggers oxytocin, a threat sound triggers cortisol.

Touch: detects contact, pressure, texture, vibration. social touch (hugs, handholding) triggers oxytocin and serotonin. painful contact triggers cortisol and pain signals.

Taste: flavor profile of consumed food and drink. good taste triggers endorphins and dopamine, bad taste triggers cortisol.

Smell: scents in the environment -- food, character-specific scents (familiarity cue), environmental (rain, smoke). familiar scent of a friend triggers oxytocin. smoke triggers cortisol.

Proprioception: awareness of own body position, movement, orientation. standing, sitting, lying, walking, running, falling. affects what actions are possible -- cant sleep standing up, cant eat while sprinting.

tracks 18 internal signals that all interact with each other. none of them ticks up or down in isolation:

blood glucose, stomach distension, ghrelin (hunger hormone), leptin (fullness hormone), insulin, blood osmolarity (salt concentration), blood volume, mouth dryness, glycogen (stored energy), adenosine (sleep pressure), circadian rhythm (0-24h), core temperature (36-42c), sweat, pain, inflammation, bladder fullness, bowel fullness, heart rate (40-180 bpm).

then six chemical messengers that communicate body state to the rest of the brain. each has a baseline and a decay rate, meaning chemicals naturally return to their resting level over time:

Dopamine (baseline 50): reward and learning signal. spikes when something good happens, decays quickly. gates what the brain learns from.
Serotonin (baseline 50): mood stability, resilience to stress, long-term wellbeing. decays slowly. stable serotonin = stable mood.
Cortisol (baseline 30): stress response, urgency, avoidance learning. spikes when things go wrong. chronic high cortisol blunts dopamine sensitivity -- stressed characters stop finding joy in things.
Oxytocin (baseline 40): social bonding, trust, relationship modulation. released around friends and loved ones. low oxytocin = lonely.
Norepinephrine (baseline 40): alertness, attention gating, learning rate modulation. spikes from novelty or threat. keeps the character alert.
Endorphins (baseline 40): pain suppression, pleasure. released during eating, social touch, positive experiences.

actions change these values. eating raises blood glucose and stomach distension while triggering dopamine and endorphins. sleeping clears adenosine and raises energy. talking releases oxytocin. not eating makes ghrelin rise and blood glucose drop. if glucose stays low too long, cortisol builds up.

the motivational layer. three components.

Hypothalamus: reads all 18 body signals and computes seven need values from 0-100. hunger is driven by ghrelin minus blood glucose, stomach fullness, and leptin. thirst comes from blood osmolarity and blood volume. sleepiness from adenosine and circadian rhythm. energy from glycogen minus movement and sleepiness. temperature discomfort from how far core temp is from 37c. social drive from low oxytocin plus cortisol, minus energy. pain is raw pain reduced by endorphins.

the hypothalamus also monitors needs and releases stress chemicals when things are out of balance -- hunger over 60, thirst over 60, low energy, high pain, full bladder all trigger cortisol. safety and comfort trigger serotonin. a nearby partner triggers oxytocin. novelty triggers norepinephrine.

Action selection (q-network): a small neural network that picks the best action for whatever need is strongest. the needs compete first -- when two needs have similar pressure, they suppress each other and neither wins, creating natural hesitation without randomness. if no need exceeds threshold, the character idles or wanders. if a need stays stuck below threshold too long, cortisol builds until it forces action anyway.

the network takes 37 inputs (body state, chemical levels, partner status, attention target, last action, current intention, valve inhibition), passes them through two hidden layers (64 then 32 neurons), and outputs q-values for 14 possible actions like eat, drink, sleep, rest, talk, approach, examine, wander, use furniture, urinate, defecate, or nothing. it learns through reinforcement learning -- reward comes from changes in dopamine, oxytocin, and endorphins (good) minus cortisol changes (bad). it starts explorative and becomes more conservative over time.

Expressive impulse: the urge to say something. every thought wants to be expressed by default, starting at a base impulse of 40. different thought types push harder or softer -- confusion thoughts add +20 (you want to understand), social thoughts add +15 (relationships need maintenance), memory adds +10, needs add +5, idle thoughts subtract -10 (boring). urgency of the thought adds up to +20. the longer since the character last spoke adds up to +20. if social drive is low, there's a small penalty. the valve compares the final impulse to its inhibition to decide if speech actually happens. a thought that gets blocked tries again next tick with higher time-since-last-speech.

the biggest system. five parts.

Emotion integration: reads the six chemicals and produces a two-dimensional emotional state. valence (positive to negative, -100 to 100) comes mostly from dopamine, serotonin, and endorphins minus cortisol, plus a bit from oxytocin. arousal (calm to intense, 0-100) comes from norepinephrine and heart rate. from these it labels the dominant emotion -- excited is high valence + high arousal, content is high valence + low arousal, anxious is low valence + high arousal, sad is low valence + low arousal. high oxytocin with positive valence makes loving/affectionate. high cortisol makes stressed/fearful. high endorphins and dopamine together can produce euphoria.

Narrator / consciousness: a predictive world model built as three neural networks stacked vertically. it tries to guess what happens next at every level of abstraction.

Level 1 (sensory prediction, 8 inputs -> 128 -> 128 -> 8 outputs) predicts immediate sensory changes -- object movement, new sounds, touch, taste, smell. it absorbs routine noise. if a leaf rustles, level 1 handles it. level 2 never hears about it.

Level 2 (body state prediction, 31 inputs -> 96 -> 96 -> 31 outputs) predicts how body signals and chemicals change. it also receives any error that level 1 couldnt resolve. if level 2 predicts correctly, the error stops here.

Level 3 (abstract/social prediction, 41 inputs -> 64 -> 64 -> 41 outputs) predicts emotional state, social context, and memory patterns. this is the conscious narrator -- what the character's inner monologue draws from. it only hears about errors that level 2 couldnt resolve. only genuine surprises reach consciousness.

the narrator trains constantly. every tick it learns (current state -> next state) to stay calibrated. important moments get extra training iterations. during idle moments and sleep, it replays past experiences to reinforce them.

the narrator also maintains a self-model -- traits like "i am quiet", "i like food", "i miss people", "i am brave" with weights that shift based on repeated experiences. it has a confidence score starting at 50, going up with successful predictions and down with failures. low confidence makes speech less certain.

the narrator does several things: it endorses actions post-hoc (explaining why you did something), confabulates to explain prediction errors, generates idle thoughts from memory, engages in self-deception when cortisol is high (reframes painful truths), and retroactively rewrites old memory summaries to fit the current self-model. but critically, the narrator has no decision-making power. it just watches and predicts.

Attention: determines what the character focuses on. each potential target gets a salience score based on how relevant it is to current needs (hungry? food is salient), how novel it is (new things decay from 100% to 0% over 10 seconds), social relevance (people matter), plus a small random factor. the highest scoring target wins. if nothing is salient enough, attention wanders randomly.

Memory: this is the unusual part. memory is stored in the narrator's network weights, not in a database. the weights themselves ARE the memories. records are just lightweight indices that tell the narrator what to reconstruct -- time, event type, participants, location, encoding strength, how many times recalled, and a summary.

every tick the brain scores the current moment on encoding strength -- how arousing was it, how much did valence change, how surprising was it (prediction error), how relevant to current needs, was the character paying attention? if the score passes a threshold, the moment is saved and trained into the narrator network.

there's a cap of 1000 most recent records, with oldest/lowest-strength records pruned when space runs out. retrieval works by similarity between current chemicals and the chemicals stored alongside each memory, plus a recency bonus. top 5 candidates are returned and each is reconstructed by the narrator -- reconstruction differs slightly each time because the network weights have changed since encoding. this is reconsolidation, same as in real brains. old unpracticed memories come out blurry because new experiences have partially overwritten those weights. two characters who experienced the same event will remember it differently because their networks are different.

Thought formation: the narrator generates raw predictions -- numbers, error vectors, confidence values, attention targets, memory fragments. these arent directly usable as things to say or do. thought formation reads this raw output and structures it into concrete thoughts with a type, target, emotional tone, urgency (0-1), and content string.

seven thought types: need triggers when any need exceeds threshold ("i am hungry", "i need sleep"). confusion triggers from prediction error ("why did that happen"). observation triggers from novelty or attention ("a chair", "you look different"). memory triggers from strong memory retrieval ("i remember we ate pancakes"). social triggers from a partner nearby with a notable relationship ("you are nice", "i dont trust you"). idle is the fallback when nothing is above threshold (just "hm"). drift is random out-of-context thoughts that pop up with a small probability each tick -- "do fish get bored?", "what is a chair really thinking about" -- they dont need a reason.

thoughts feed two paths. the speech path goes through expressive impulse -> valve -> language department. the action path is indirect -- the thought's urgency can leak back into need pressure, making a character feel hungrier from a strong food memory. if no thought has urgency above 0.2, the character stays in default state (narrator runs, body updates, no output).

the gatekeeper. this is the brains analog of the prefrontal cortex and basal ganglia gating loop. it doesnt generate anything. it only opens or closes. but its constantly receiving impulses, and theres always stuff in the brain saying "no, dont, bad idea." the valve has to push through all those stop signals. the default state is inhibition -- nothing happens unless something overcomes it.

what feeds the stop signals? threat level (from cortisol, pain, novelty), how unfamiliar the situation is, whether someone else is already speaking, social rules context, and relationship with nearby people. more threat means stronger "no." trusted people nearby means weaker "no." every action also has a personal risk weight that changes with experience. eating has low risk (0.2), so theres barely any "no" for it. approaching a stranger has high risk (0.7), so the "no" is much stronger.

the gate checks: is the need pressure strong enough to overcome (risk weight * inhibition level)? a hungry person in a dangerous situation will eat (low risk, passes easily) but wont approach strangers (high risk, blocked). if both impulses are close in strength, their stop signals stack and neither gets through. the character hesitates, staring into space -- which is the correct behavior for a torn character.

for speech, same idea but simpler: if expressive impulse > inhibition level, speech passes through. if not, the thought stays internal but the impulse builds over time and tries again next tick.

a hardwired emergency brake that bypasses the valve entirely when threat is extreme. no learning, no gating, no negotiation. when threat (average of cortisol, pain, and norepinephrine) passes a high threshold, the character instantly switches to flee, freeze, or fight.

flee is most likely (60% weight), freeze is next (30%), fight is least likely (10%) unless theres an anger bonus from a hostile relationship and low oxytocin. at extreme threat, freeze becomes more likely -- tonic immobility, playing dead. this overrides whatever the character was doing. even the valve cannot stop it. everything else cancels and the survival action forces through.

turns structured thoughts into spoken words. it only executes, never decides what to say. it receives a thought from thought formation that already has a type, target, emotion, and urgency.

no hardcoded sentence patterns. context activates word categories with different strengths. each category gets an activation score 0-1 based on the situation -- food words activate when attention is on food and hunger is high, question words activate from prediction error, social words activate from social drive, etc. categories with activation above 0.3 become word pools. words are picked from them weighted by the characters personal experience with each word, a temporary priming boost for recently used words, and the category activation strength.

there's a soft ordering preference: discourse marker -> question word -> name -> pronoun -> modal/be verb -> feel/want/think verb -> qualifier -> negation -> noun/emotion -> time -> connector -> yes/no. but short utterances are always valid. "i want food", "hungry", "want apple", and "food" are all valid expressions of the same need thought.

utterance stops at 8 words, or when no category has activation above 0.3, or after a closing word like yes/no/name, or after 3 consecutive coherence failures.

every utterance has a small chance (about 5% plus novelty and personality factors) of injecting a random word or memory fragment regardless of context -- this is the same drift concept from thought formation playing out at the word level. so you might get "apple. i wonder if fish get bored" or "lonely. remember that one time with the bird."

the full word categories are: pronouns, be_verbs, feel_verbs, want_verbs, think_verbs, modal_verbs, move_verbs, do_verbs, social_verbs, question_words, food_nouns, drink_nouns, thing_nouns, place_nouns, emotion_good, emotion_bad, emotion_okay, qualifiers, negation, discourse markers, time words, connectors, agreement, disagreement, and the characters own name.

each word has a personal weight that changes over time. using a word increases its weight a little. hearing someone else use it increases it slightly less. positive chemical outcomes after speaking reinforce the words used. negative outcomes reduce them. words decay slowly over time -- use it or lose it. kids inherit parent word weights with small random mutations. recently used words get a temporary priming boost, making characters naturally repeat words and phrases.

Mouth delivers the words with modifications based on the characters current state. low confidence (below 30) adds "maybe" or "i think." very low confidence (below 15) cuts non-essential words. low energy slows speech down, makes it barely audible. high arousal adds emphasis and repetition. high arousal combined with high stress causes stammering. high inhibition makes the character whisper. shy personalities soften delivery even when confident. bold personalities stay direct even when uncertain.

tracks relationships between characters. each relationship has a value from -100 to 200, starting at 0 for strangers. good interactions increase it (scaled by oxytocin release), bad interactions decrease it (scaled by cortisol). shared experiences add a small bump based on emotional valence. time apart slowly erodes it.

high relationship means more oxytocin release around that person, stronger impulse to approach them, warmer internal tone from the narrator, and lower valve inhibition (youre comfortable around friends). low relationship means the opposite.

two mechanisms that shape who a character is and how they change.

Personality is what a character starts with at birth -- chemical baselines (some are naturally more anxious, others more social), reactivity multipliers (how strongly chemicals hit them), need rates (how fast they get hungry), per-action risk weight ranges, narrators talkativeness and optimism and confidence stability, valve inhibition sensitivity, and learning rate. kids inherit a mutated copy of their parents personality with gaussian noise applied to each parameter. no two are ever identical.

Plasticity is how experience reshapes the brain over time. all changes are slow (days, not minutes).

  • Habit formation: frequently used actions get a small impulse bonus. unused actions get a small penalty.
  • Personality drift: every 100 uses of an action, relevant personality parameters drift about 2%.
  • Chemical baseline shift: if a chemical averages above 70 for a week, its baseline creeps up. below 20 for a week, baseline creeps down. chronic stress permanently raises the cortisol baseline.
  • Narrator confidence: successful predictions build confidence stability. failures erode it.
  • Self-model evolution: doing something that contradicts a trait slowly weakens it. consistent actions reinforce matching traits.
  • Risk weight drift: positive outcomes from an action make it feel less risky next time. negative outcomes make it feel riskier.
  • Need threshold adaptation: the threshold for acting self-calibrates to the environment. dangerous world means you need stronger urges before acting. safe world means the opposite.

How Speech Works

the narrator is always running predictions. it generates a stream of internal data each tick -- predicted emotions, attention targets, social context, memory fragments, and the prediction errors it couldnt resolve. thought formation reads this raw output and structures it into a concrete thought: "my hunger is going up and theres food nearby" becomes a need thought with some urgency level.

every thought wants to be expressed by default, starting at base impulse 40. the type of thought adjusts this -- confusion adds urgency, social thoughts matter for bond maintenance, idle thoughts barely push. the longer since the character last spoke, the more the impulse builds (up to +20 after a minute of silence). this is the expressive impulse at work.

the impulse reaches the valve, which has a current inhibition level based on threat, familiarity, whether someone else is talking, social rules, and relationship with nearby people. if impulse beats inhibition, the thought passes through to the language department and gets spoken.

the language department picks words from categories matching the thought type, weighted by the characters personal vocabulary experience. same thought type produces different words depending on the relationship -- a social thought about a trusted friend might produce "you are nice" or "im glad youre here," while the same thought about someone the character distrusts might produce "stay away" or "i dont trust you."

the mouth delivers it with modifications based on the characters current physical and mental state. low energy? barely audible. high arousal? animated and emphatic. scared? whispering. shy? softened wording. bold? direct.

random drift thoughts ("do fish get bored?", "i wonder what the sky is thinking") happen because thought formation has a small chance each tick of generating an out-of-context thought type. they dont need a reason. they just happen, like in real life.


Target Hardware

mid-range laptop, no dedicated gpu needed at the target scale. reference: quad-core i5/i7 8th gen or later, 16gb ram, integrated graphics.

citizenstier breakdownramcpu
203 t1, 10 t2, 7 t34 gbany modern
505 t1, 20 t2, 25 t36 gbquad core
1008 t1, 30 t2, 62 t38 gbi5 8th gen
150 (recommended)8 t1, 40 t2, 102 t3~9 gbi5 8th gen
2508 t1, 60 t2, 182 t312 gbi5 10th gen
5008 t1, 80 t2, 412 t316 gbi7 + gpu helps
10008 t1, 100 t2, 892 t324 gbi7 + 4gb vram gpu

tier 1 cap stays at 8 regardless of town size. 1000-citizen towns lose the intimacy that makes bokuraMachi interesting.


everything emergent, nothing scripted. the player is not special. the game ends when you stop finding it interesting.