chain-of-thought

replication of chain of thought paper

https://github.com/bythyag/chain-of-thought

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

chain-of-thought deep-learning large-language-models python research
Last synced: 6 months ago · JSON representation

Repository

replication of chain of thought paper

Basic Info
  • Host: GitHub
  • Owner: bythyag
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.09 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
chain-of-thought deep-learning large-language-models python research
Created 10 months ago · Last pushed 6 months ago
Metadata Files
Readme

README.md

Chain of Thought (CoT) Experiments

This repository contains implementation and analysis of Chain of Thought (CoT) prompting experiments across various language models and reasoning tasks. The project evaluates how different prompting strategies affect model performance on arithmetic, commonsense, and symbolic reasoning tasks.

Link to blog post: chain-of-thought in large language models - what has changed over the years?

1. Overview

This codebase implements experiments to test and validate claims about Chain of Thought prompting across different model scales and reasoning tasks. It includes:

  • Implementation of CoT and standard prompting evaluations
  • Ablation studies on prompt components
  • Out-of-distribution (OOD) testing

2. Key Features

  • Support for multiple reasoning tasks:
    • Arithmetic (GSM8K, SVAMP, ASDiv, AQuA, MAWPS)
    • Commonsense (CSQA, StrategyQA, Date, Sports, SayCan)
    • Symbolic (Last letter concatenation, Coin flip)
  • Ablation study implementations
  • OOD testing on Coin Flip and Last Letter Concatenation.

3. Project Structure

  • src/: Core implementation files
    • evals.py: Evaluation benchmarking code
    • ablation.py: Ablation study implementation
    • ood.py: Out-of-distribution testing
  • prompts/: Prompt templates and examples
  • logs/: Experimental results and analysis
  • sample results/: Raw output from various model runs

4. Key Findings on GPT family.

  • High accuracy with GPT-4.1: Out of 200 GSM8K samples, only 4 mistakes occurred—2 due to semantic misinterpretation and 1 due to dataset error, showing strong performance.
  • System prompt sensitivity: Adding strict instructions sometimes reduced generalization and led to incorrect answers, even when CoT examples alone worked.
  • Model size effects: GPT-4.1-mini and nano avoided arithmetic errors but failed in sequential reasoning, assumptions, and event tracking compared to GPT-4.1.
  • Ablation results: Equation-only prompting gave 93% accuracy, while reasoning-after-answer dropped to 61%, confirming that stepwise CoT reasoning is critical. In variable computer where we relaced the CoT steps with dots, we got a performance of 63%.
  • OOD robustness: On symbolic reasoning tasks (last-name concatenation, coin flip), GPT-4.1 achieved near 100% accuracy with CoT, outperforming standard prompting (98%).

5, References

You can read more about reasoning in LLMs in orginal Chain-of-Thought paper.

6. Acknowledgements

Thanks to Telt for guiding across different stages of the project and sponsoring the API Keys.

Owner

  • Name: thyag
  • Login: bythyag
  • Kind: user
  • Location: Delhi

teaching machines to see, speak, and google better than your average toddler

GitHub Events

Total
  • Issue comment event: 1
  • Push event: 26
Last Year
  • Issue comment event: 1
  • Push event: 26