tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
â—‹DOI references
-
â—‹Academic publication links
-
â—‹Committers with academic emails
-
â—‹Institutional organization owner
-
â—‹JOSS paper metadata
-
â—‹Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary
Keywords
Repository
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
Basic Info
- Host: GitHub
- Owner: louisbrulenaudet
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://huggingface.co/louisbrulenaudet
- Size: 85 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
Massive Text Embedding Benchmark for French Taxation 🤗
In this notebook, we will explore the process of adding a new task to the Massive Text Embedding Benchmark (MTEB). The MTEB is an open-source framework developed to facilitate the evaluation and benchmarking of multilingual and multi-task models across a diverse set of tasks and languages.
The task we will be integrating is the TaxRetrievalBenchmark, a retrieval task focused on retrieving relevant tax articles or content based on provided queries. This task is particularly useful in the legal and financial domains, where accurate and efficient retrieval of relevant information is crucial. To add this task to the MTEB framework, we will follow a structured approach:
- Understanding the task: We will start by analyzing the TaxRetrievalBenchmark task, its data format, and the evaluation metrics used to assess model performance.
- Preparing the data: Next, we will preprocess the data from the HuggingFace Hub, converting it to the MTEB format. This step involves organizing the corpus, queries, and relevant document information into the required data structures.
- Implementing the task class: We will then implement the TaxRetrievalBenchmark class, which inherits from the AbsTaskRetrieval class provided by the MTEB framework. This class will encapsulate the task-specific logic, including data loading, metadata management, and evaluation methods.
- Integrating with MTEB: Finally, we will integrate the TaxRetrievalBenchmark class into the MTEB framework, allowing it to be used alongside other tasks for multi-task training and evaluation.
By adding the TaxRetrievalBenchmark task to the MTEB framework, we will contribute to the growing collection of diverse tasks, enabling researchers and practitioners to develop and evaluate multilingual and multi-task models more effectively. This notebook will serve as a practical guide for anyone interested in extending the MTEB framework with new tasks, fostering collaboration and advancing the field of natural language processing.
Citing this project
If you use this code in your research, please use the following BibTeX entry.
BibTeX
@misc{louisbrulenaudet2024,
author = {Louis Brulé Naudet},
title = {Massive Text Embedding Benchmark for French Taxation},
year = {2024}
}
Feedback
If you have any feedback, please reach out at louisbrulenaudet@icloud.com.
Owner
- Name: Louis Brulé Naudet
- Login: louisbrulenaudet
- Kind: user
- Location: Paris
- Company: Université Paris-Dauphine (Paris Sciences et Lettres - PSL)
- Website: https://louisbrulenaudet.com
- Twitter: BruleNaudet
- Repositories: 81
- Profile: https://github.com/louisbrulenaudet
Research in business taxation and development (NLP, LLM, Computer vision...), University Dauphine-PSL 📖 | Backed by the Microsoft for Startups Hub program
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Brulé Naudet" given-names: "Louis" orcid: "https://orcid.org/0000-0001-9111-4879" title: "Massive Text Embedding Benchmark for French Taxation" version: 1.0.0 date-released: 2024-05-23
GitHub Events
Total
- Pull request event: 2
- Create event: 1
Last Year
- Pull request event: 2
- Create event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Louis Brulé Naudet | l****t@i****m | 11 |
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
Pull Request Authors
- dependabot[bot] (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- accelerate ==0.30.1
- datasets ==2.19.1
- mteb ==1.11.13
- sentence-transformers ==2.7.0
- tqdm ==4.66.4
- transformers ==4.41.1