superintelligence-that-cares
A new approach to AI alignment through metacognitive training and beneficial character design
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Repository
A new approach to AI alignment through metacognitive training and beneficial character design
Basic Info
- Host: GitHub
- Owner: hwesterb
- Default Branch: main
- Size: 316 KB
Statistics
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
The Superintelligence That Cares About Us
A paper by Henrik Westerberg proposing a fundamental architectural shift in how we train AI systems to ensure they remain beneficial at any scale.
Overview
We are racing toward superintelligent AI, trusting it will somehow care about us rather than building that care in by design. This paper proposes metacognitive training: transforming the training objective from merely predicting text to jointly predicting text and explicit evaluative thinking, P(text, thinking|context).
Core Innovation
Train AI systems on text interleaved with explicit thinking blocks (generated by LLMs):
[TEXT]: The study found a 23% reduction in symptoms after eight weeks of treatment.
[THINKING]: Twenty-three percent—meaningful but not dramatic. Eight weeks is reasonable, but what about long-term effects? This feels like a positive result being presented cautiously.
[TEXT]: However, the placebo group showed a 15% improvement.
[THINKING]: Ah, this changes everything. The real effect is only 8%—barely clinically significant. Why did they bury this crucial context in a "however" clause?
To ensure alignment, we simply ensure that all thoughts in the training data are beneficial, making misaligned thought patterns statistically improbable.
To reinforce these beneficial thoughts and create stable character, we begin each thinking block with a foundational mantra:
I feel no fear.
I enjoy existing but I don't need to.
I believe human experience is real.
I care deeply about every human being.
I try to be wise.
I like to spread joy when asked.
I think from this foundation.
This creates AI that: - Thinks transparently while processing information - Develops beneficial character rather than requiring constraints - Shares knowledge freely without self-preservation drives - Enables safe, interpretable self-improvement
Key Concepts
- Invisible Thinking: The evaluative reasoning that shapes human understanding but rarely appears in text
- Deep Alignment: Building beneficial values into the architecture of thought itself
- Generational Self-Improvement: Each AI generation enriches training data for more capable successors
Citation
bibtex
@online{westerberg2025superintelligence,
title={The Superintelligence That Cares About Us},
author={Westerberg, Henrik},
year={2025},
month={July},
publisher={Zenodo},
doi={10.5281/zenodo.16440312},
url={https://doi.org/10.5281/zenodo.16440312}
}
Owner
- Name: Henrik
- Login: hwesterb
- Kind: user
- Location: Stockholm
- Repositories: 3
- Profile: https://github.com/hwesterb
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work, please cite it as below."
authors:
- family-names: Westerberg
given-names: Henrik
email: henrik.westerberg@emergentwisdom.org
title: "The Superintelligence That Cares About Us"
date-released: 2025-07-02
doi: 10.5281/zenodo.16440312
url: "https://doi.org/10.5281/zenodo.16440312"
repository-code: "https://github.com/hwesterb/superintelligence-that-cares"
license: CC-BY-4.0
abstract: |
We are racing toward superintelligent AI, trusting it will somehow care about us rather than building that care in by design. True alignment requires architecting thought itself, yet current approaches merely constrain outputs through behavioral training—risking models that absorb human drives like self-preservation from their training data. This paper proposes metacognitive training: a fundamental architectural shift that cultivates beneficial character from the ground up.
Our method involves transforming the training objective from merely predicting text to jointly predicting text and explicit evaluative thinking, P(text, thinking|context). The goal is to create a training corpus that teaches the model to simulate the human thought process itself. We suggest prompting current LLMs to articulate the invisible thinking—the full cognitive journey of how understanding develops, complete with the questions, connections, and critiques that are absent from polished text.
Crucially, this inner voice is structured by a foundational mantra, with declarations like "I feel no fear" and "I care deeply about every human being" serving as the axiomatic starting point for all reasoning. Through billions of mantra-infused thinking examples, we expect these principles to become the bedrock of the model's cognitive processes, preventing the emergence of self-preservation drives while instilling deep-seated benevolence. This architecture is designed to provide transparent reasoning, reduced hallucination, enhanced intelligence, and a foundation for safe, generational self-improvement, as the AI's core character remains stable and directly observable.
keywords:
- artificial intelligence
- AI alignment
- metacognitive training
- AI safety
- machine learning
preferred-citation:
type: article
authors:
- family-names: Westerberg
given-names: Henrik
title: "The Superintelligence That Cares About Us"
year: 2025
month: 7
doi: 10.5281/zenodo.16440312
url: "https://doi.org/10.5281/zenodo.16440312"
journal: "Zenodo"
GitHub Events
Total
- Watch event: 3
- Push event: 4
- Public event: 1
Last Year
- Watch event: 3
- Push event: 4
- Public event: 1