gsm8k_generator

https://github.com/llm-reasoning-with-virtual-depth/gsm8k_generator

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: LLM-reasoning-with-virtual-depth
License: mit
Language: Python
Default Branch: main
Size: 3.07 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Code of conduct Citation

GSM8K Generator

This project is modified from iGSM (Facebook Research) to specifically generate GSM8K-style mathematical word problems. While the original iGSM project focuses on interpretable generation of synthetic math word problems, this modified version streamlines the process to directly generate GSM8K format datasets.

"Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process"

This code is designed to generate grade-school math problems, solution and answer in Zhu's team's designed problem class (see Part 2.1).

Requirements

iGSM requires or works with Python version 3.8.11 or newer.

Install all required dependencies from requirements.txt bash pip install -r requirements.txt

How GSM8Kgenerator.py works

Usage

Run the generator with default settings: bash python GSM8Kgenerator.py

Key Parameters

In the code, you can modify these main parameters: python TOTAL_PROBLEMS = 8500 # Total number of problems to generate OUTPUT_DIR = "output/gsm8k_extended" # Output directory

Difficulty Parameters

python difficulty_params = { "easy": {"max_op": 10, "max_edge": 15, "perm_level": 3}, "med": {"max_op": 15, "max_edge": 20, "perm_level": 5}, "hard": {"max_op": 21, "max_edge": 28, "perm_level": 7} }

max_op: Maximum number of operations
max_edge: Maximum number of edges in problem structure
perm_level: Level of permutation in problem description

Topic Distribution

Available topics:

basic_arithmetic
percentage
ratio
time
money
measurement
word_problems

Output Structure

Directory Structure GSM8K_GENERATOR/output/gsm8k_extended/ ├── train.json └── inference.json

Data Split Distribution

Training set: 90% (7650 problems)
Inference set: 10% (850 problems)

Example Output

Here's an example of a generated problem: json {"text": "Question: The number of each Skyview University's Engineering Workshop equals each Skyview University's Number Theory Room. The number of each Skyview University's Number Theory Room equals 19 times as much as each Seaview University's Robotics Lab. The number of each Meadowland University's Robotics Lab equals 11. The number of each Seaview University's Robotics Lab equals 3 more than each Meadowland University's Classroom. How many Engineering Workshop does Skyview University have?\nSolution: Define Meadowland University's Robotics Lab as U; so U = 11. Define Meadowland University's Classroom as d; so d = U = 11. Define Seaview University's Robotics Lab as M; so M = 3 + d = 3 + 11 = 14. Define Skyview University's Number Theory Room as F; so F = 19 * M = 19 * 14 = 13. Define Skyview University's Engineering Workshop as l; so l = F = 13.\nAnswer: 13\n\n", "difficulty": "med", "topic": "basic_arithmetic", "steps_required": 5, "numerical_answer": "13"}

Metadata Fields

difficulty: Problem difficulty level (easy/med/hard)
topic: Mathematical topic category
steps_required: Number of steps needed to solve
numerical_answer: Final numerical answer

Modifying Difficulty Distribution

You can adjust the distribution of problem difficulties by modifying: python difficulty_dist = { "easy": 0.3, # 30% easy problems "med": 0.5, # 50% medium problems "hard": 0.2 # 20% hard problems }

Adding New Topics

Add new topics to the topics list in MathProblemGenerator class: python self.topics = [ "basic_arithmetic", "percentage", # Add new topics here ]

Error Handling

The generator includes error handling for:

Problem generation failures
File I/O operations
Token decoding issues

Failed problem generations are logged but don't halt the overall process.

Notes

Uses random seed (42) for reproducibility
Generates problems with varying complexity
Includes detailed step-by-step solutions
Provides metadata for each problem

Full Documentation

Problem and Graph Inheritance

When id_gen.gen_prob() is invoked, it initializes a Problem instance named id_gen.problem. The Problem class extends the Graph class, which is designed to handle the generation and management of specific details relevant to the problem: - Graph Class: Stores structural and dependency graphs that outline the relationships and dependencies among different elements of the problem. - Problem Class: Responsible for generating the names and exact values of parameters, along with crafting descriptive narratives for both the problem and its solution.

Structure Graph

The structure graph is encoded in id_gen.problem.G, stored as a list of NumPy matrices with boolean values. Each entry id_gen.problem.G[i][j, k] signifies a connection between the node (i, j) and (i+1, k), where (i, j) represents the j-th node at the i-th layer. This matrix helps visualize how nodes are interconnected layer by layer.

Dependency Graph Nodes

The nodes within the dependency graph are represented by four-integer tuples, (i, j, k, l), with specific meanings based on the value of i: - RNG Representation (i = -1): When i is -1, j, k, and l must all be 0, making the tuple (-1, 0, 0, 0) denote the Random Number Generator (RNG) used within the problem context. - Instance Parameter (i = 0): When i is 0, the tuple (i, j, k, l) identifies an instance parameter. It specifically counts the number of Item (j, k) in relation to Item (j+1, k), such as counting the number of Music Rooms in Riverview High. The existence of such a parameter depends strictly on the truth of id_gen.problem.G[j][k, l]. - Abstract Parameter (i = 1): When i is 1, the tuple represents an abstract parameter, counting items of Category k within Item (j, k), like the number of classrooms in Riverview High. Such parameters are only defined if feasible and if j < l.

The dependency graph is instantiated as id_gen.problem.template, a directed graph using the networkx.DiGraph class.

Additional Components

Value Lookup (id_gen.problem.lookup): This component is a dictionary mapping from the four-integer tuples to the respective parameter values.
Name Lookup (id_gen.problem.N): The array id_gen.problem.N[i][j] holds the name of the Item (i, j).
Draw Graphs (id_gen.problem.draw()): This function will plot the structure graph and the dependency graph.

Citation

Please cite this code and our iGSM dataset using bibtex @article{YXLA2024-gsm1, author = {Ye, Tian and Xu, Zicheng and Li, Yuanzhi and {Allen-Zhu}, Zeyuan}, title = {{Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process}}, journal = {ArXiv e-prints}, year = 2024, month = jul, volume = {abs/2407.20311}, note = {Full version available at \url{http://arxiv.org/abs/2407.20311}} }

If you plan to use our retry data or the box-over-box data, please also cite our Part 2.2 paper as follows: bibtex @article{YXLA2024-gsm2, author = {Ye, Tian and Xu, Zicheng and Li, Yuanzhi and {Allen-Zhu}, Zeyuan}, title = {{Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems}}, journal = {ArXiv e-prints}, year = 2024, month = aug, volume = {abs/2408.16293}, note = {Full version available at \url{http://arxiv.org/abs/2408.16293}} }

MIT License; please contact Tian Ye or Zeyuan Allen-Zhu if you have any questions.

Owner

Name: Yi
Login: LLM-reasoning-with-virtual-depth
Kind: organization

Repositories: 1
Profile: https://github.com/LLM-reasoning-with-virtual-depth

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Physics of Language Models: Part 2.1, Grade-School Math
  and the Hidden Reasoning Process
message: >-
  If you plan to include also our retry data or box-over-box
  data, please also cite our Part 2.2.
type: dataset
authors:
  - given-names: Tian
    family-names: Ye
  - given-names: Zicheng
    family-names: Xu
  - given-names: Yuanzhi
    family-names: Li
  - given-names: Zeyuan
    family-names: Allen-Zhu
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2407.20311'
    description: Part 2.1 paper
  - type: url
    value: 'https://arxiv.org/abs/2408.16293'
    description: Part 2.2 paper

GitHub Events

Total

Issues event: 3
Push event: 5
Pull request event: 4
Create event: 6

Last Year

Issues event: 3
Push event: 5
Pull request event: 4
Create event: 6

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 3
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kanbrtkuy (3)

Pull Request Authors

kanbrtkuy (2)

Top Labels

Issue Labels

Data Preparation (2)

Pull Request Labels

Dependencies

requirements.txt pypi

filelock >=3.17.0
huggingface-hub >=0.28.1
matplotlib >=3.10.0
networkx >=3.4.2
numpy >=2.2.3
pandas >=2.2.3
pillow >=11.1.0
pyyaml >=6.0.2
regex >=2024.11.6
sympy >=1.13.1
torch >=2.6.0
torchaudio >=2.6.0
torchvision >=0.21.0
tqdm >=4.67.1
transformers >=4.48.3
typing-extensions >=4.12.2