math-mutator

Math Mutator (MAMUT): A framework for generating specialized math datasets for language model training.

https://github.com/aieng-lab/math-mutator

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
    Organization aieng-lab has institutional domain (www.fim.uni-passau.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Math Mutator (MAMUT): A framework for generating specialized math datasets for language model training.

Basic Info
  • Host: GitHub
  • Owner: aieng-lab
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 72.3 KB
Statistics
  • Stars: 3
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

Jonathan Drechsel, Anja Reusch, Steffen Herbold

arXiv

This repository contains the official source code for the dataset generation of MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training, published in Transactions on Machine Learning Research 2025.

This repository contains the code for generating the datasets, including preprocessing of the original AMPS and ARQMath datasets, formula filtering, extraction, validation, and more. The framework responsible for generating equivalent and falsified versions of mathematical formulas is available in this SymPy fork. The generated datasets are available on Hugging Face:

| Dataset | Description | Example(s) | |---------------------------------------------------------------------------------------------|---------------|---------------| | Math Formulas (MF) | Mathematical formulas with high variance | $x\cdot x^N = x^{1 + N}$
$(a - b)/(ba) = -1/a + \frac{1}{b}$ | | Math Text (MT) | Texts combining natural language and mathematical formulas | Identify $\sum{n=0}^\infty (yn - L)$ where $y{n + 1} = (1 + yn)^{\frac13}$ and $L^3 = L + 1$. Let $y > 2$ and let $f(y) = (1 + y)^{\frac13}$. Let $f^n(y)$ be the $n $ th iterate of $f(y)$. Let $ L $ be ... | | Named Math Formulas (NMF) | High variance formulas of famous named identities | *Name:** Pythagorean Thm., Formula: $c^2=b^2+a^2$
Name: Binomial Formula, Formula: $(\alpha + z)^2 = z^2 + \alpha^2 + 2\cdot \alpha \cdot z$ | | Math Formula Retrieval (MFR) | Pairs of formulas with labels indicating identical or different mathematical concepts | Formula 1: $1\cdot 2\cdot 3 \cdot \ldots \cdot n = n!$, Formula 2: $m!\coloneqq \prod_{k=1}^m k$, Label: Equivalent
Formula 1: $a^2+b^2=c^2$, Formula 2: $a^2+2^b=c^2$, Label: Not Equivalent |

Quick Links

Install

Prerequisites

Installation Steps

1. Clone the repository

bash git clone https://github.com/aieng-lab/math-mutator cd math-mutator

2. Create a Conda Environment:

bash conda create --name mamut python=3.10 conda activate mamut conda install pip pip install -r requirements.txt

3. Install aieng-lab/sympy-random-LaTeX:

bash cd .. # go back to the root directory git clone https://github.com/aieng-lab/sympy-random-LaTeX.git cd sympy-random-LaTeX pip install -r requirements.txt pip install -e . # install this sympy fork in editable mode (alternative: add the sympy-random-LaTeX path to the PYTHONPATH) cd .. # go back to the root directory

4. Clone ARQMathCode:

bash git clone https://github.com/ARQMath/ARQMathCode.git

5. Add ARQMath to the PYTHONPATH:

Windows

Add the ARQMathCode directory to the system's PYTHONPATH: bash set PYTHONPATH=%PYTHONPATH%;/path/to/ARQMathCode To make it permanent, edit the Environment Variables in the system settings.

Linux/maxOS

Append the path to your shell configuration file (e.g., ~/.bashrc, ~/.bash_profile, ~/.zshrc): bash export PYTHONPATH="$PYTHONPATH:/path/to/ARQMathCode" source ~/.bashrc # or ~/.bash_profile, ~/.zshrc

6. Verification [Optional]

bash python -c "import sympy; import post_reader_record; print('All packages are installed correctly')"

7. Setup for Experiments [Optional]

See below for the setup of the experiments (Mathematical Pretraining and Evaluation).

Data Generation

  • Download the original data (AMPS and ARQMATH): download.py
    • Downloads AMPS and ARQMath data into data/amps and data/arqmath respectively
  • Generate Named Math Formulas (NMF): named_math_formulas.py
    • Generates the NMF dataset into data/nmf as datasets.Dataset
    • Intermediate results and a more detailed dataset can be found in data/nmf.csv (including an entry stats in json format containing the applied transformation steps)
  • Generate Math Formula Retrieval (MFR): math_formula_retrieval.py
    • Generates the MFR dataset as data/mfr.csv based on data/nmf.csv
  • Generate Math Formulas (MF): math_formulas.py
    • Generates the MF dataset as data/math-formulas.csv
  • Generate Math Text (MT): math_text.py
    • Requires the ARQMath package to be available as arqmath (see Installation)
    • Generates the MT dataset as data/math-text.csv
    • Due to a long run time of that script (several days), specifically for ARQMath, you can use generate_math_text_arqmath_asynch to generate the data in parallel. You need to combine the data together afterwards.

Experimental Results Reproduction

The experiments are split into pre-training mathematical models and evaluating them based on an IR fine-tuning task.

Mathematical Pre-Training

  • Install Mathematical Pretraining Framework
  • Run transformer-math-pretraining/scripts/ma.sh to pre-train mathematical models.
    • Should run on a server with 8 A100 GPUs
    • Rough Time Estimate: 12 hours per Pre-Training used (i.e. 48 hours for MF+MT+NMF+MFR)

Mathematical Evaluation

  • Install Mathematical Evaluation Framework
  • Run transformer-math-evaluation/scripts/mamut.sh to compute all fine-tuning results reported in the paper.
    • Copy the pre-trained models to the folder specified in the script
  • Use the methods in transformer-math-evaluation/src/export/nmf.py to generate the tables and figures reporting the results.

Implementation Details

  • sympy-random-LaTeX/generator.py contains the core functionality of MAMUT, implementing the version generation interface and falsifying strategies
  • Internally, the strategies Random and Manual (known from the MAMUT paper) are implemented as single strategy (strategy_random_formula)
    • These can be distinguished based on the provided meta data (strategy_random_formula contains a json dict, entry no_version is True for Manual and False for Random)
  • The randomized LatexPrinter can be found in sympy-random-LaTeX/sympy/printing/latex.py
    • The randomization settings can be found in sympy-random-LaTeX/sympy/settings.py

CITATION

If you use this code, generated datasets, or published mathematical models, please cite the following paper: bibtex @article{ drechsel2025mamut, title={{MAMUT}: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training}, author={Jonathan Drechsel and Anja Reusch and Steffen Herbold}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=khODmRpQEx} }

Owner

  • Name: aieng-lab
  • Login: aieng-lab
  • Kind: organization

GitHub organization of the Chair for AI Engineering of the University of Passau

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the article from the preferred-citation and the software itself.
authors:
  - family-names: Drechsel
    given-names: Jonathan
  - family-names: Reusch
    given-names: Anja
  - family-names: Herbold
    given-names: Steffen
title: 'MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training'
version: 1.0.0
url: https://arxiv.org/abs/2502.20855
date-released: '2025-03-03'
preferred-citation:
  authors:
    - family-names: Drechsel
      given-names: Jonathan
    - family-names: Reusch
      given-names: Anja
    - family-names: Herbold
      given-names: Steffen
  title: 'MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training'
  url: https://arxiv.org/abs/2502.20855
  type: article
  year: '2025'
  publisher: arXiv

GitHub Events

Total
  • Watch event: 6
  • Push event: 6
  • Public event: 1
Last Year
  • Watch event: 6
  • Push event: 6
  • Public event: 1

Dependencies

requirements.txt pypi
  • PySocks *
  • PyYAML *
  • TexSoup *
  • aiohappyeyeballs *
  • aiohttp *
  • aiosignal *
  • antlr4-python3-runtime ==4.12
  • async-timeout *
  • attrs *
  • beautifulsoup4 *
  • certifi *
  • charset-normalizer *
  • datasets *
  • dill *
  • filelock *
  • frozendict *
  • frozenlist *
  • fsspec *
  • gdown *
  • huggingface-hub *
  • humanize *
  • idna *
  • joblib *
  • mpmath *
  • multidict *
  • multiprocess *
  • numpy *
  • packaging *
  • pandas *
  • propcache *
  • pyarrow *
  • python-dateutil *
  • pytz *
  • requests *
  • scikit-learn *
  • scipy *
  • six *
  • soupsieve *
  • sympy *
  • threadpoolctl *
  • tqdm *
  • typing_extensions *
  • tzdata *
  • urllib3 *
  • xxhash *
  • yarl *