tax-retrieval-benchmark

An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.

https://github.com/louisbrulenaudet/tax-retrieval-benchmark

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • ✓
    CITATION.cff file
    Found CITATION.cff file
  • ✓
    codemeta.json file
    Found codemeta.json file
  • ✓
    .zenodo.json file
    Found .zenodo.json file
  • â—‹
    DOI references
  • â—‹
    Academic publication links
  • â—‹
    Committers with academic emails
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

benchmark droit embeddings fiscal fiscalite information-retrieval mteb rag retrieval retrieval-augmented-generation sbert semantic-search sentence-embeddings sentence-transformers stp tax taxation
Last synced: 6 months ago · JSON representation ·

Repository

An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Topics
benchmark droit embeddings fiscal fiscalite information-retrieval mteb rag retrieval retrieval-augmented-generation sbert semantic-search sentence-embeddings sentence-transformers stp tax taxation
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme Funding License Code of conduct Citation

README.md

Massive Text Embedding Benchmark for French Taxation 🤗

Python Maintainer Open In Colab

In this notebook, we will explore the process of adding a new task to the Massive Text Embedding Benchmark (MTEB). The MTEB is an open-source framework developed to facilitate the evaluation and benchmarking of multilingual and multi-task models across a diverse set of tasks and languages.

The task we will be integrating is the TaxRetrievalBenchmark, a retrieval task focused on retrieving relevant tax articles or content based on provided queries. This task is particularly useful in the legal and financial domains, where accurate and efficient retrieval of relevant information is crucial. To add this task to the MTEB framework, we will follow a structured approach:

  • Understanding the task: We will start by analyzing the TaxRetrievalBenchmark task, its data format, and the evaluation metrics used to assess model performance.
  • Preparing the data: Next, we will preprocess the data from the HuggingFace Hub, converting it to the MTEB format. This step involves organizing the corpus, queries, and relevant document information into the required data structures.
  • Implementing the task class: We will then implement the TaxRetrievalBenchmark class, which inherits from the AbsTaskRetrieval class provided by the MTEB framework. This class will encapsulate the task-specific logic, including data loading, metadata management, and evaluation methods.
  • Integrating with MTEB: Finally, we will integrate the TaxRetrievalBenchmark class into the MTEB framework, allowing it to be used alongside other tasks for multi-task training and evaluation.

By adding the TaxRetrievalBenchmark task to the MTEB framework, we will contribute to the growing collection of diverse tasks, enabling researchers and practitioners to develop and evaluate multilingual and multi-task models more effectively. This notebook will serve as a practical guide for anyone interested in extending the MTEB framework with new tasks, fostering collaboration and advancing the field of natural language processing.

Citing this project

If you use this code in your research, please use the following BibTeX entry.

BibTeX @misc{louisbrulenaudet2024, author = {Louis Brulé Naudet}, title = {Massive Text Embedding Benchmark for French Taxation}, year = {2024} }

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.

Owner

  • Name: Louis Brulé Naudet
  • Login: louisbrulenaudet
  • Kind: user
  • Location: Paris
  • Company: Université Paris-Dauphine (Paris Sciences et Lettres - PSL)

Research in business taxation and development (NLP, LLM, Computer vision...), University Dauphine-PSL 📖 | Backed by the Microsoft for Startups Hub program

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Brulé Naudet"
  given-names: "Louis"
  orcid: "https://orcid.org/0000-0001-9111-4879"
title: "Massive Text Embedding Benchmark for French Taxation"
version: 1.0.0
date-released: 2024-05-23

GitHub Events

Total
  • Pull request event: 2
  • Create event: 1
Last Year
  • Pull request event: 2
  • Create event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 11
  • Total Committers: 1
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Louis Brulé Naudet l****t@i****m 11

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (3)
Top Labels
Issue Labels
Pull Request Labels
dependencies (3) python (1)

Dependencies

requirements.txt pypi
  • accelerate ==0.30.1
  • datasets ==2.19.1
  • mteb ==1.11.13
  • sentence-transformers ==2.7.0
  • tqdm ==4.66.4
  • transformers ==4.41.1