math-mutator

Math Mutator (MAMUT): A framework for generating specialized math datasets for language model training.

https://github.com/aieng-lab/math-mutator

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
✓
Institutional organization owner
Organization aieng-lab has institutional domain (www.fim.uni-passau.de)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Math Mutator (MAMUT): A framework for generating specialized math datasets for language model training.

Basic Info

Host: GitHub
Owner: aieng-lab
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 72.3 KB

Statistics

Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

Jonathan Drechsel, Anja Reusch, Steffen Herbold

This repository contains the official source code for the dataset generation of MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training, published in Transactions on Machine Learning Research 2025.

This repository contains the code for generating the datasets, including preprocessing of the original AMPS and ARQMath datasets, formula filtering, extraction, validation, and more. The framework responsible for generating equivalent and falsified versions of mathematical formulas is available in this SymPy fork. The generated datasets are available on Hugging Face:

| Dataset | Description | Example(s) | |---------------------------------------------------------------------------------------------|---------------|---------------| | Math Formulas (MF) | Mathematical formulas with high variance | $x\cdot x^N = x^{1 + N}$
$(a - b)/(ba) = -1/a + \frac{1}{b}$ | | Math Text (MT) | Texts combining natural language and mathematical formulas | Identify $\sum{n=0}^\infty (yn - L)$ where $y{n + 1} = (1 + yn)^{\frac13}$ and $L^3 = L + 1$. Let $y > 2$ and let $f(y) = (1 + y)^{\frac13}$. Let $f^n(y)$ be the $n $ th iterate of $f(y)$. Let $ L $ be ... | | Named Math Formulas (NMF) | High variance formulas of famous named identities | *Name:** Pythagorean Thm., Formula: $c^2=b^2+a^2$
Name: Binomial Formula, Formula: $(\alpha + z)^2 = z^2 + \alpha^2 + 2\cdot \alpha \cdot z$ | | Math Formula Retrieval (MFR) | Pairs of formulas with labels indicating identical or different mathematical concepts | Formula 1: $1\cdot 2\cdot 3 \cdot \ldots \cdot n = n!$, Formula 2: $m!\coloneqq \prod_{k=1}^m k$, Label: Equivalent
Formula 1: $a^2+b^2=c^2$, Formula 2: $a^2+2^b=c^2$, Label: Not Equivalent |

Quick Links

Mathematical Pre-Training Framework
Mathematical Evaluation Framework
Randomized LaTeX SymPy Fork
Mathematical Datasets
- ddrg/math_formulas: Math Formulas (MF)
- ddrg/math_text: Math Text (MT)
- ddrg/namedmathformulas: Named Math Formulas (NMF)
- ddrg/mathformularetrieval: Math Formula Retrieval (MFR)
Mathematical Models generated based on MAMUT-enhanced data:
- aieng-lab/bert-base-cased-mamut: based on bert-base-cased
- aieng-lab/mathpretrainedbert-mamut: based on AnReu/mathpretrainedbert (this MAMUT model performs best based on our evaluation)
- aieng-lab/MathBERT-mamut: based on tbs17/MathBERT
- [ddrg/mathstructuredeberta: mathematical further pretrained based on microsoft/deberta-v3-base - not published as part of the MAMUT paper, but trained with the same data and framework]

Install

Prerequisites

Install conda or miniconda
Install git

Installation Steps

1. Clone the repository

bash git clone https://github.com/aieng-lab/math-mutator cd math-mutator

2. Create a Conda Environment:

bash conda create --name mamut python=3.10 conda activate mamut conda install pip pip install -r requirements.txt

3. Install `aieng-lab/sympy-random-LaTeX`:

bash cd .. # go back to the root directory git clone https://github.com/aieng-lab/sympy-random-LaTeX.git cd sympy-random-LaTeX pip install -r requirements.txt pip install -e . # install this sympy fork in editable mode (alternative: add the sympy-random-LaTeX path to the PYTHONPATH) cd .. # go back to the root directory

4. Clone `ARQMathCode`:

bash git clone https://github.com/ARQMath/ARQMathCode.git

5. Add ARQMath to the `PYTHONPATH`:

Windows

Add the ARQMathCode directory to the system's PYTHONPATH: bash set PYTHONPATH=%PYTHONPATH%;/path/to/ARQMathCode To make it permanent, edit the Environment Variables in the system settings.

Linux/maxOS

Append the path to your shell configuration file (e.g., ~/.bashrc, ~/.bash_profile, ~/.zshrc): bash export PYTHONPATH="$PYTHONPATH:/path/to/ARQMathCode" source ~/.bashrc # or ~/.bash_profile, ~/.zshrc

6. Verification [Optional]

bash python -c "import sympy; import post_reader_record; print('All packages are installed correctly')"

7. Setup for Experiments [Optional]

See below for the setup of the experiments (Mathematical Pretraining and Evaluation).

Data Generation

Download the original data (AMPS and ARQMATH): download.py
- Downloads AMPS and ARQMath data into data/amps and data/arqmath respectively
Generate Named Math Formulas (NMF): named_math_formulas.py
- Generates the NMF dataset into data/nmf as datasets.Dataset
- Intermediate results and a more detailed dataset can be found in data/nmf.csv (including an entry stats in json format containing the applied transformation steps)
Generate Math Formula Retrieval (MFR): math_formula_retrieval.py
- Generates the MFR dataset as data/mfr.csv based on data/nmf.csv
Generate Math Formulas (MF): math_formulas.py
- Generates the MF dataset as data/math-formulas.csv
Generate Math Text (MT): math_text.py
- Requires the ARQMath package to be available as arqmath (see Installation)
- Generates the MT dataset as data/math-text.csv
- Due to a long run time of that script (several days), specifically for ARQMath, you can use generate_math_text_arqmath_asynch to generate the data in parallel. You need to combine the data together afterwards.

Experimental Results Reproduction

The experiments are split into pre-training mathematical models and evaluating them based on an IR fine-tuning task.

Mathematical Pre-Training

Install Mathematical Pretraining Framework
Run transformer-math-pretraining/scripts/ma.sh to pre-train mathematical models.
- Should run on a server with 8 A100 GPUs
- Rough Time Estimate: 12 hours per Pre-Training used (i.e. 48 hours for MF+MT+NMF+MFR)

Mathematical Evaluation

Install Mathematical Evaluation Framework
Run transformer-math-evaluation/scripts/mamut.sh to compute all fine-tuning results reported in the paper.
- Copy the pre-trained models to the folder specified in the script
Use the methods in transformer-math-evaluation/src/export/nmf.py to generate the tables and figures reporting the results.

Implementation Details

sympy-random-LaTeX/generator.py contains the core functionality of MAMUT, implementing the version generation interface and falsifying strategies
Internally, the strategies Random and Manual (known from the MAMUT paper) are implemented as single strategy (strategy_random_formula)
- These can be distinguished based on the provided meta data (strategy_random_formula contains a json dict, entry no_version is True for Manual and False for Random)
The randomized LatexPrinter can be found in sympy-random-LaTeX/sympy/printing/latex.py
- The randomization settings can be found in sympy-random-LaTeX/sympy/settings.py

CITATION

If you use this code, generated datasets, or published mathematical models, please cite the following paper: bibtex @article{ drechsel2025mamut, title={{MAMUT}: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training}, author={Jonathan Drechsel and Anja Reusch and Steffen Herbold}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=khODmRpQEx} }

Owner

Name: aieng-lab
Login: aieng-lab
Kind: organization

Website: https://www.fim.uni-passau.de/ai-engineering
Repositories: 1
Profile: https://github.com/aieng-lab

GitHub organization of the Chair for AI Engineering of the University of Passau

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the article from the preferred-citation and the software itself.
authors:
  - family-names: Drechsel
    given-names: Jonathan
  - family-names: Reusch
    given-names: Anja
  - family-names: Herbold
    given-names: Steffen
title: 'MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training'
version: 1.0.0
url: https://arxiv.org/abs/2502.20855
date-released: '2025-03-03'
preferred-citation:
  authors:
    - family-names: Drechsel
      given-names: Jonathan
    - family-names: Reusch
      given-names: Anja
    - family-names: Herbold
      given-names: Steffen
  title: 'MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training'
  url: https://arxiv.org/abs/2502.20855
  type: article
  year: '2025'
  publisher: arXiv

GitHub Events

Total

Watch event: 6
Push event: 6
Public event: 1

Last Year

Watch event: 6
Push event: 6
Public event: 1

Dependencies

requirements.txt pypi

PySocks *
PyYAML *
TexSoup *
aiohappyeyeballs *
aiohttp *
aiosignal *
antlr4-python3-runtime ==4.12
async-timeout *
attrs *
beautifulsoup4 *
certifi *
charset-normalizer *
datasets *
dill *
filelock *
frozendict *
frozenlist *
fsspec *
gdown *
huggingface-hub *
humanize *
idna *
joblib *
mpmath *
multidict *
multiprocess *
numpy *
packaging *
pandas *
propcache *
pyarrow *
python-dateutil *
pytz *
requests *
scikit-learn *
scipy *
six *
soupsieve *
sympy *
threadpoolctl *
tqdm *
typing_extensions *
tzdata *
urllib3 *
xxhash *
yarl *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

math-mutator

Science Score: 62.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

Quick Links

Install

Prerequisites

Installation Steps

1. Clone the repository

2. Create a Conda Environment:

3. Install `aieng-lab/sympy-random-LaTeX`:

4. Clone `ARQMathCode`:

5. Add ARQMath to the `PYTHONPATH`:

Windows

Linux/maxOS

6. Verification [Optional]

7. Setup for Experiments [Optional]

Data Generation

Experimental Results Reproduction

Mathematical Pre-Training

Mathematical Evaluation

Implementation Details

CITATION

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

math-mutator

Science Score: 62.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

Quick Links

Install

Prerequisites

Installation Steps

1. Clone the repository

2. Create a Conda Environment:

3. Install aieng-lab/sympy-random-LaTeX:

4. Clone ARQMathCode:

5. Add ARQMath to the PYTHONPATH:

Windows

Linux/maxOS

6. Verification [Optional]

7. Setup for Experiments [Optional]

Data Generation

Experimental Results Reproduction

Mathematical Pre-Training

Mathematical Evaluation

Implementation Details

CITATION

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

3. Install `aieng-lab/sympy-random-LaTeX`:

4. Clone `ARQMathCode`:

5. Add ARQMath to the `PYTHONPATH`: