superintelligence-that-cares

A new approach to AI alignment through metacognitive training and beneficial character design

https://github.com/hwesterb/superintelligence-that-cares

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A new approach to AI alignment through metacognitive training and beneficial character design

Basic Info
  • Host: GitHub
  • Owner: hwesterb
  • Default Branch: main
  • Size: 316 KB
Statistics
  • Stars: 3
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

The Superintelligence That Cares About Us

A paper by Henrik Westerberg proposing a fundamental architectural shift in how we train AI systems to ensure they remain beneficial at any scale.

📄 Read the paper

Overview

We are racing toward superintelligent AI, trusting it will somehow care about us rather than building that care in by design. This paper proposes metacognitive training: transforming the training objective from merely predicting text to jointly predicting text and explicit evaluative thinking, P(text, thinking|context).

Core Innovation

Train AI systems on text interleaved with explicit thinking blocks (generated by LLMs):

[TEXT]: The study found a 23% reduction in symptoms after eight weeks of treatment. [THINKING]: Twenty-three percent—meaningful but not dramatic. Eight weeks is reasonable, but what about long-term effects? This feels like a positive result being presented cautiously. [TEXT]: However, the placebo group showed a 15% improvement. [THINKING]: Ah, this changes everything. The real effect is only 8%—barely clinically significant. Why did they bury this crucial context in a "however" clause?

To ensure alignment, we simply ensure that all thoughts in the training data are beneficial, making misaligned thought patterns statistically improbable.

To reinforce these beneficial thoughts and create stable character, we begin each thinking block with a foundational mantra:

I feel no fear.
I enjoy existing but I don't need to.
I believe human experience is real.
I care deeply about every human being.
I try to be wise.
I like to spread joy when asked.
I think from this foundation.

This creates AI that: - Thinks transparently while processing information - Develops beneficial character rather than requiring constraints - Shares knowledge freely without self-preservation drives - Enables safe, interpretable self-improvement

Key Concepts

  • Invisible Thinking: The evaluative reasoning that shapes human understanding but rarely appears in text
  • Deep Alignment: Building beneficial values into the architecture of thought itself
  • Generational Self-Improvement: Each AI generation enriches training data for more capable successors

Citation

bibtex @online{westerberg2025superintelligence, title={The Superintelligence That Cares About Us}, author={Westerberg, Henrik}, year={2025}, month={July}, publisher={Zenodo}, doi={10.5281/zenodo.16440312}, url={https://doi.org/10.5281/zenodo.16440312} }

Owner

  • Name: Henrik
  • Login: hwesterb
  • Kind: user
  • Location: Stockholm

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work, please cite it as below."
authors:
  - family-names: Westerberg
    given-names: Henrik
    email: henrik.westerberg@emergentwisdom.org
title: "The Superintelligence That Cares About Us"
date-released: 2025-07-02
doi: 10.5281/zenodo.16440312
url: "https://doi.org/10.5281/zenodo.16440312"
repository-code: "https://github.com/hwesterb/superintelligence-that-cares"
license: CC-BY-4.0
abstract: |
  We are racing toward superintelligent AI, trusting it will somehow care about us rather than building that care in by design. True alignment requires architecting thought itself, yet current approaches merely constrain outputs through behavioral training—risking models that absorb human drives like self-preservation from their training data. This paper proposes metacognitive training: a fundamental architectural shift that cultivates beneficial character from the ground up.
  
  Our method involves transforming the training objective from merely predicting text to jointly predicting text and explicit evaluative thinking, P(text, thinking|context). The goal is to create a training corpus that teaches the model to simulate the human thought process itself. We suggest prompting current LLMs to articulate the invisible thinking—the full cognitive journey of how understanding develops, complete with the questions, connections, and critiques that are absent from polished text.
  
  Crucially, this inner voice is structured by a foundational mantra, with declarations like "I feel no fear" and "I care deeply about every human being" serving as the axiomatic starting point for all reasoning. Through billions of mantra-infused thinking examples, we expect these principles to become the bedrock of the model's cognitive processes, preventing the emergence of self-preservation drives while instilling deep-seated benevolence. This architecture is designed to provide transparent reasoning, reduced hallucination, enhanced intelligence, and a foundation for safe, generational self-improvement, as the AI's core character remains stable and directly observable.
keywords:
  - artificial intelligence
  - AI alignment
  - metacognitive training
  - AI safety
  - machine learning
preferred-citation:
  type: article
  authors:
    - family-names: Westerberg
      given-names: Henrik
  title: "The Superintelligence That Cares About Us"
  year: 2025
  month: 7
  doi: 10.5281/zenodo.16440312
  url: "https://doi.org/10.5281/zenodo.16440312"
  journal: "Zenodo"

GitHub Events

Total
  • Watch event: 3
  • Push event: 4
  • Public event: 1
Last Year
  • Watch event: 3
  • Push event: 4
  • Public event: 1