chain-of-thought

replication of chain of thought paper

https://github.com/bythyag/chain-of-thought

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Keywords

chain-of-thought deep-learning large-language-models python research

Last synced: 6 months ago · JSON representation

Repository

replication of chain of thought paper

Basic Info

Host: GitHub
Owner: bythyag
Language: Python
Default Branch: main
Homepage:
Size: 1.09 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

chain-of-thought deep-learning large-language-models python research

Created 10 months ago · Last pushed 6 months ago

Metadata Files

Readme

Chain of Thought (CoT) Experiments

This repository contains implementation and analysis of Chain of Thought (CoT) prompting experiments across various language models and reasoning tasks. The project evaluates how different prompting strategies affect model performance on arithmetic, commonsense, and symbolic reasoning tasks.

Link to blog post: chain-of-thought in large language models - what has changed over the years?

1. Overview

This codebase implements experiments to test and validate claims about Chain of Thought prompting across different model scales and reasoning tasks. It includes:

Implementation of CoT and standard prompting evaluations
Ablation studies on prompt components
Out-of-distribution (OOD) testing

2. Key Features

Support for multiple reasoning tasks:
- Arithmetic (GSM8K, SVAMP, ASDiv, AQuA, MAWPS)
- Commonsense (CSQA, StrategyQA, Date, Sports, SayCan)
- Symbolic (Last letter concatenation, Coin flip)
Ablation study implementations
OOD testing on Coin Flip and Last Letter Concatenation.

3. Project Structure

src/: Core implementation files
- evals.py: Evaluation benchmarking code
- ablation.py: Ablation study implementation
- ood.py: Out-of-distribution testing
prompts/: Prompt templates and examples
logs/: Experimental results and analysis
sample results/: Raw output from various model runs

4. Key Findings on GPT family.

High accuracy with GPT-4.1: Out of 200 GSM8K samples, only 4 mistakes occurred—2 due to semantic misinterpretation and 1 due to dataset error, showing strong performance.
System prompt sensitivity: Adding strict instructions sometimes reduced generalization and led to incorrect answers, even when CoT examples alone worked.
Model size effects: GPT-4.1-mini and nano avoided arithmetic errors but failed in sequential reasoning, assumptions, and event tracking compared to GPT-4.1.
Ablation results: Equation-only prompting gave 93% accuracy, while reasoning-after-answer dropped to 61%, confirming that stepwise CoT reasoning is critical. In variable computer where we relaced the CoT steps with dots, we got a performance of 63%.
OOD robustness: On symbolic reasoning tasks (last-name concatenation, coin flip), GPT-4.1 achieved near 100% accuracy with CoT, outperforming standard prompting (98%).

5, References

You can read more about reasoning in LLMs in orginal Chain-of-Thought paper.

6. Acknowledgements

Thanks to Telt for guiding across different stages of the project and sponsoring the API Keys.

Owner

Name: thyag
Login: bythyag
Kind: user
Location: Delhi

Website: bythyag.github.io
Repositories: 10
Profile: https://github.com/bythyag

teaching machines to see, speak, and google better than your average toddler

GitHub Events

Total

Issue comment event: 1
Push event: 26

Last Year

Issue comment event: 1
Push event: 26

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science