tax-retrieval-benchmark

An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.

https://github.com/louisbrulenaudet/tax-retrieval-benchmark

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary

Keywords

benchmark droit embeddings fiscal fiscalite information-retrieval mteb rag retrieval retrieval-augmented-generation sbert semantic-search sentence-embeddings sentence-transformers stp tax taxation

Last synced: 6 months ago · JSON representation ·

Repository

An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.

Basic Info

Host: GitHub
Owner: louisbrulenaudet
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://huggingface.co/louisbrulenaudet
Size: 85 KB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Topics

benchmark droit embeddings fiscal fiscalite information-retrieval mteb rag retrieval retrieval-augmented-generation sbert semantic-search sentence-embeddings sentence-transformers stp tax taxation

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Funding License Code of conduct Citation

Massive Text Embedding Benchmark for French Taxation 🤗

In this notebook, we will explore the process of adding a new task to the Massive Text Embedding Benchmark (MTEB). The MTEB is an open-source framework developed to facilitate the evaluation and benchmarking of multilingual and multi-task models across a diverse set of tasks and languages.

The task we will be integrating is the TaxRetrievalBenchmark, a retrieval task focused on retrieving relevant tax articles or content based on provided queries. This task is particularly useful in the legal and financial domains, where accurate and efficient retrieval of relevant information is crucial. To add this task to the MTEB framework, we will follow a structured approach:

Understanding the task: We will start by analyzing the TaxRetrievalBenchmark task, its data format, and the evaluation metrics used to assess model performance.
Preparing the data: Next, we will preprocess the data from the HuggingFace Hub, converting it to the MTEB format. This step involves organizing the corpus, queries, and relevant document information into the required data structures.
Implementing the task class: We will then implement the TaxRetrievalBenchmark class, which inherits from the AbsTaskRetrieval class provided by the MTEB framework. This class will encapsulate the task-specific logic, including data loading, metadata management, and evaluation methods.
Integrating with MTEB: Finally, we will integrate the TaxRetrievalBenchmark class into the MTEB framework, allowing it to be used alongside other tasks for multi-task training and evaluation.

By adding the TaxRetrievalBenchmark task to the MTEB framework, we will contribute to the growing collection of diverse tasks, enabling researchers and practitioners to develop and evaluate multilingual and multi-task models more effectively. This notebook will serve as a practical guide for anyone interested in extending the MTEB framework with new tasks, fostering collaboration and advancing the field of natural language processing.

Citing this project

If you use this code in your research, please use the following BibTeX entry.

BibTeX @misc{louisbrulenaudet2024, author = {Louis Brulé Naudet}, title = {Massive Text Embedding Benchmark for French Taxation}, year = {2024} }

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.

Owner

Name: Louis Brulé Naudet
Login: louisbrulenaudet
Kind: user
Location: Paris
Company: Université Paris-Dauphine (Paris Sciences et Lettres - PSL)

Website: https://louisbrulenaudet.com
Twitter: BruleNaudet
Repositories: 81
Profile: https://github.com/louisbrulenaudet

Research in business taxation and development (NLP, LLM, Computer vision...), University Dauphine-PSL 📖 | Backed by the Microsoft for Startups Hub program

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Brulé Naudet"
  given-names: "Louis"
  orcid: "https://orcid.org/0000-0001-9111-4879"
title: "Massive Text Embedding Benchmark for French Taxation"
version: 1.0.0
date-released: 2024-05-23

GitHub Events

Total

Pull request event: 2
Create event: 1

Last Year

Pull request event: 2
Create event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 11
Total Committers: 1
Avg Commits per committer: 11.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Louis Brulé Naudet	l**t@i**m	11

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

Pull Request Authors

dependabot[bot] (3)

Top Labels

Issue Labels

Pull Request Labels

dependencies (3) python (1)

Dependencies

requirements.txt pypi

accelerate ==0.30.1
datasets ==2.19.1
mteb ==1.11.13
sentence-transformers ==2.7.0
tqdm ==4.66.4
transformers ==4.41.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

tax-retrieval-benchmark

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Massive Text Embedding Benchmark for French Taxation 🤗

Citing this project

Feedback

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies