Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Keywords
Repository
replication of chain of thought paper
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Chain of Thought (CoT) Experiments
This repository contains implementation and analysis of Chain of Thought (CoT) prompting experiments across various language models and reasoning tasks. The project evaluates how different prompting strategies affect model performance on arithmetic, commonsense, and symbolic reasoning tasks.
Link to blog post: chain-of-thought in large language models - what has changed over the years?
1. Overview
This codebase implements experiments to test and validate claims about Chain of Thought prompting across different model scales and reasoning tasks. It includes:
- Implementation of CoT and standard prompting evaluations
- Ablation studies on prompt components
- Out-of-distribution (OOD) testing
2. Key Features
- Support for multiple reasoning tasks:
- Arithmetic (GSM8K, SVAMP, ASDiv, AQuA, MAWPS)
- Commonsense (CSQA, StrategyQA, Date, Sports, SayCan)
- Symbolic (Last letter concatenation, Coin flip)
- Ablation study implementations
- OOD testing on Coin Flip and Last Letter Concatenation.
3. Project Structure
src/: Core implementation filesevals.py: Evaluation benchmarking codeablation.py: Ablation study implementationood.py: Out-of-distribution testing
prompts/: Prompt templates and exampleslogs/: Experimental results and analysissample results/: Raw output from various model runs
4. Key Findings on GPT family.
- High accuracy with GPT-4.1: Out of 200 GSM8K samples, only 4 mistakes occurred—2 due to semantic misinterpretation and 1 due to dataset error, showing strong performance.
- System prompt sensitivity: Adding strict instructions sometimes reduced generalization and led to incorrect answers, even when CoT examples alone worked.
- Model size effects: GPT-4.1-mini and nano avoided arithmetic errors but failed in sequential reasoning, assumptions, and event tracking compared to GPT-4.1.
- Ablation results: Equation-only prompting gave 93% accuracy, while reasoning-after-answer dropped to 61%, confirming that stepwise CoT reasoning is critical. In variable computer where we relaced the CoT steps with dots, we got a performance of 63%.
- OOD robustness: On symbolic reasoning tasks (last-name concatenation, coin flip), GPT-4.1 achieved near 100% accuracy with CoT, outperforming standard prompting (98%).
5, References
You can read more about reasoning in LLMs in orginal Chain-of-Thought paper.
6. Acknowledgements
Thanks to Telt for guiding across different stages of the project and sponsoring the API Keys.
Owner
- Name: thyag
- Login: bythyag
- Kind: user
- Location: Delhi
- Website: bythyag.github.io
- Repositories: 10
- Profile: https://github.com/bythyag
teaching machines to see, speak, and google better than your average toddler
GitHub Events
Total
- Issue comment event: 1
- Push event: 26
Last Year
- Issue comment event: 1
- Push event: 26