math-mutator
Math Mutator (MAMUT): A framework for generating specialized math datasets for language model training.
Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
✓Institutional organization owner
Organization aieng-lab has institutional domain (www.fim.uni-passau.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.8%) to scientific vocabulary
Repository
Math Mutator (MAMUT): A framework for generating specialized math datasets for language model training.
Basic Info
Statistics
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Jonathan Drechsel, Anja Reusch, Steffen Herbold
This repository contains the official source code for the dataset generation of MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training, published in Transactions on Machine Learning Research 2025.
This repository contains the code for generating the datasets, including preprocessing of the original AMPS and ARQMath datasets, formula filtering, extraction, validation, and more. The framework responsible for generating equivalent and falsified versions of mathematical formulas is available in this SymPy fork. The generated datasets are available on Hugging Face:
| Dataset | Description | Example(s) |
|---------------------------------------------------------------------------------------------|---------------|---------------|
| Math Formulas (MF) | Mathematical formulas with high variance | $x\cdot x^N = x^{1 + N}$
$(a - b)/(ba) = -1/a + \frac{1}{b}$ |
| Math Text (MT) | Texts combining natural language and mathematical formulas | Identify $\sum{n=0}^\infty (yn - L)$ where $y{n + 1} = (1 + yn)^{\frac13}$ and $L^3 = L + 1$. Let $y > 2$ and let $f(y) = (1 + y)^{\frac13}$. Let $f^n(y)$ be the $n $ th iterate of $f(y)$. Let $ L $ be ... |
| Named Math Formulas (NMF) | High variance formulas of famous named identities | *Name:** Pythagorean Thm., Formula: $c^2=b^2+a^2$
Name: Binomial Formula, Formula: $(\alpha + z)^2 = z^2 + \alpha^2 + 2\cdot \alpha \cdot z$ |
| Math Formula Retrieval (MFR) | Pairs of formulas with labels indicating identical or different mathematical concepts | Formula 1: $1\cdot 2\cdot 3 \cdot \ldots \cdot n = n!$, Formula 2: $m!\coloneqq \prod_{k=1}^m k$, Label: Equivalent
Formula 1: $a^2+b^2=c^2$, Formula 2: $a^2+2^b=c^2$, Label: Not Equivalent |
Quick Links
- Mathematical Pre-Training Framework
- Mathematical Evaluation Framework
- Randomized LaTeX SymPy Fork
- Mathematical Datasets
- ddrg/math_formulas: Math Formulas (MF)
- ddrg/math_text: Math Text (MT)
- ddrg/namedmathformulas: Named Math Formulas (NMF)
- ddrg/mathformularetrieval: Math Formula Retrieval (MFR)
- Mathematical Models generated based on MAMUT-enhanced data:
- aieng-lab/bert-base-cased-mamut: based on bert-base-cased
- aieng-lab/mathpretrainedbert-mamut: based on AnReu/mathpretrainedbert (this MAMUT model performs best based on our evaluation)
- aieng-lab/MathBERT-mamut: based on tbs17/MathBERT
- [ddrg/mathstructuredeberta: mathematical further pretrained based on microsoft/deberta-v3-base - not published as part of the MAMUT paper, but trained with the same data and framework]
Install
Prerequisites
Installation Steps
1. Clone the repository
bash
git clone https://github.com/aieng-lab/math-mutator
cd math-mutator
2. Create a Conda Environment:
bash
conda create --name mamut python=3.10
conda activate mamut
conda install pip
pip install -r requirements.txt
3. Install aieng-lab/sympy-random-LaTeX:
bash
cd .. # go back to the root directory
git clone https://github.com/aieng-lab/sympy-random-LaTeX.git
cd sympy-random-LaTeX
pip install -r requirements.txt
pip install -e . # install this sympy fork in editable mode (alternative: add the sympy-random-LaTeX path to the PYTHONPATH)
cd .. # go back to the root directory
4. Clone ARQMathCode:
bash
git clone https://github.com/ARQMath/ARQMathCode.git
5. Add ARQMath to the PYTHONPATH:
Windows
Add the ARQMathCode directory to the system's PYTHONPATH:
bash
set PYTHONPATH=%PYTHONPATH%;/path/to/ARQMathCode
To make it permanent, edit the Environment Variables in the system settings.
Linux/maxOS
Append the path to your shell configuration file (e.g., ~/.bashrc, ~/.bash_profile, ~/.zshrc):
bash
export PYTHONPATH="$PYTHONPATH:/path/to/ARQMathCode"
source ~/.bashrc # or ~/.bash_profile, ~/.zshrc
6. Verification [Optional]
bash
python -c "import sympy; import post_reader_record; print('All packages are installed correctly')"
7. Setup for Experiments [Optional]
See below for the setup of the experiments (Mathematical Pretraining and Evaluation).
Data Generation
- Download the original data (AMPS and ARQMATH):
download.py- Downloads AMPS and ARQMath data into
data/ampsanddata/arqmathrespectively
- Downloads AMPS and ARQMath data into
- Generate Named Math Formulas (NMF):
named_math_formulas.py- Generates the NMF dataset into
data/nmfasdatasets.Dataset - Intermediate results and a more detailed dataset can be found in
data/nmf.csv(including an entrystatsinjsonformat containing the applied transformation steps)
- Generates the NMF dataset into
- Generate Math Formula Retrieval (MFR):
math_formula_retrieval.py- Generates the MFR dataset as
data/mfr.csvbased ondata/nmf.csv
- Generates the MFR dataset as
- Generate Math Formulas (MF):
math_formulas.py- Generates the MF dataset as
data/math-formulas.csv
- Generates the MF dataset as
- Generate Math Text (MT):
math_text.py- Requires the ARQMath package to be available as
arqmath(see Installation) - Generates the MT dataset as
data/math-text.csv - Due to a long run time of that script (several days), specifically for ARQMath, you can use
generate_math_text_arqmath_asynchto generate the data in parallel. You need to combine the data together afterwards.
- Requires the ARQMath package to be available as
Experimental Results Reproduction
The experiments are split into pre-training mathematical models and evaluating them based on an IR fine-tuning task.
Mathematical Pre-Training
- Install Mathematical Pretraining Framework
- Run
transformer-math-pretraining/scripts/ma.shto pre-train mathematical models.- Should run on a server with 8 A100 GPUs
- Rough Time Estimate: 12 hours per Pre-Training used (i.e. 48 hours for MF+MT+NMF+MFR)
Mathematical Evaluation
- Install Mathematical Evaluation Framework
- Run
transformer-math-evaluation/scripts/mamut.shto compute all fine-tuning results reported in the paper.- Copy the pre-trained models to the folder specified in the script
- Use the methods in
transformer-math-evaluation/src/export/nmf.pyto generate the tables and figures reporting the results.
Implementation Details
sympy-random-LaTeX/generator.pycontains the core functionality of MAMUT, implementing the version generation interface and falsifying strategies- Internally, the strategies Random and Manual (known from the MAMUT paper) are implemented as single strategy (
strategy_random_formula)- These can be distinguished based on the provided meta data (
strategy_random_formulacontains a json dict, entryno_versionis True for Manual and False for Random)
- These can be distinguished based on the provided meta data (
- The randomized LatexPrinter can be found in
sympy-random-LaTeX/sympy/printing/latex.py- The randomization settings can be found in
sympy-random-LaTeX/sympy/settings.py
- The randomization settings can be found in
CITATION
If you use this code, generated datasets, or published mathematical models, please cite the following paper:
bibtex
@article{
drechsel2025mamut,
title={{MAMUT}: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training},
author={Jonathan Drechsel and Anja Reusch and Steffen Herbold},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=khODmRpQEx}
}
Owner
- Name: aieng-lab
- Login: aieng-lab
- Kind: organization
- Website: https://www.fim.uni-passau.de/ai-engineering
- Repositories: 1
- Profile: https://github.com/aieng-lab
GitHub organization of the Chair for AI Engineering of the University of Passau
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite both the article from the preferred-citation and the software itself.
authors:
- family-names: Drechsel
given-names: Jonathan
- family-names: Reusch
given-names: Anja
- family-names: Herbold
given-names: Steffen
title: 'MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training'
version: 1.0.0
url: https://arxiv.org/abs/2502.20855
date-released: '2025-03-03'
preferred-citation:
authors:
- family-names: Drechsel
given-names: Jonathan
- family-names: Reusch
given-names: Anja
- family-names: Herbold
given-names: Steffen
title: 'MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training'
url: https://arxiv.org/abs/2502.20855
type: article
year: '2025'
publisher: arXiv
GitHub Events
Total
- Watch event: 6
- Push event: 6
- Public event: 1
Last Year
- Watch event: 6
- Push event: 6
- Public event: 1
Dependencies
- PySocks *
- PyYAML *
- TexSoup *
- aiohappyeyeballs *
- aiohttp *
- aiosignal *
- antlr4-python3-runtime ==4.12
- async-timeout *
- attrs *
- beautifulsoup4 *
- certifi *
- charset-normalizer *
- datasets *
- dill *
- filelock *
- frozendict *
- frozenlist *
- fsspec *
- gdown *
- huggingface-hub *
- humanize *
- idna *
- joblib *
- mpmath *
- multidict *
- multiprocess *
- numpy *
- packaging *
- pandas *
- propcache *
- pyarrow *
- python-dateutil *
- pytz *
- requests *
- scikit-learn *
- scipy *
- six *
- soupsieve *
- sympy *
- threadpoolctl *
- tqdm *
- typing_extensions *
- tzdata *
- urllib3 *
- xxhash *
- yarl *