legalkit-pipeline

Publication pipeline for French legal codes on 🤗 Datasets from LegiFrance with concurrent upload and dynamic REAMDE.md.

https://github.com/louisbrulenaudet/legalkit-pipeline

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • ✓
    CITATION.cff file
    Found CITATION.cff file
  • ✓
    codemeta.json file
    Found codemeta.json file
  • ✓
    .zenodo.json file
    Found .zenodo.json file
  • â—‹
    DOI references
  • â—‹
    Academic publication links
  • â—‹
    Academic email domains
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary

Keywords

data datasets huggingface huggingface-datasets legal legaltech legifrance open-source parquet piste-api python
Last synced: 6 months ago · JSON representation ·

Repository

Publication pipeline for French legal codes on 🤗 Datasets from LegiFrance with concurrent upload and dynamic REAMDE.md.

Basic Info
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Topics
data datasets huggingface huggingface-datasets legal legaltech legifrance open-source parquet piste-api python
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Funding License Citation Security

README.md

LegalKit Pipeline: Open Access to French Legal Codes on 🤗 Datasets

Python License Maintainer

The LegalKit Pipeline project aims to provide open access to French legal codes on the 🤗 Datasets platform, thereby democratizing access to legal information and promoting transparency and understanding of the French legal system. Our mission is to compile and publish a comprehensive collection of French legal codes, spanning civil law, criminal law, and administrative regulations among other areas, to cater to the diverse needs of legal professionals, researchers, students, and enthusiasts alike.

With LegalKit Pipeline, individuals have the opportunity to explore, analyze, and leverage French legal texts for various purposes, empowering them to navigate and interpret the law with ease. By facilitating access to this valuable resource, we aim to foster greater transparency and knowledge accessibility in the legal domain, enabling stakeholders to make informed decisions and advance legal scholarship and practice.

Join us in our commitment to advancing legal transparency and knowledge accessibility through the LegalKit Pipeline project, as we strive to make French legal codes accessible to everyone on the 🤗 Datasets platform.

Inspiration and Ideas

The LegalKit Pipeline project draws inspiration for cutting-edge techniques such as fine-tuning and the use of Retrieval-Augmented Generation (RAG) to create efficient and accurate language models tailored for legal practice.

Tech Stack

Language: Python +3.9.0

Installation

Clone the repo sh git clone https://github.com/louisbrulenaudet/legalkit-pipeline.git

Concurrent reading of the LegalKit

To use all the legal data published on LegalKit, you can use this code snippet: ```python

-- coding: utf-8 --

import concurrent.futures import os

import datasets from tqdm.notebook import tqdm

def dataset_loader( name:str, streaming:bool=True ) -> datasets.Dataset: """ Helper function to load a single dataset in parallel.

Parameters
----------
name : str
    Name of the dataset to be loaded.

streaming : bool, optional
    Determines if datasets are streamed. Default is True.

Returns
-------
dataset : datasets.Dataset
    Loaded dataset object.

Raises
------
Exception
    If an error occurs during dataset loading.
"""
try:
    return datasets.load_dataset(
        name,
        split="train",
        streaming=streaming
    )

except Exception as exc:
    logging.error(f"Error loading dataset {name}: {exc}")

    return None

def load_datasets( req:list, streaming:bool=True ) -> list: """ Downloads datasets specified in a list and creates a list of loaded datasets.

Parameters
----------
req : list
    A list containing the names of datasets to be downloaded.

streaming : bool, optional
    Determines if datasets are streamed. Default is True.

Returns
-------
datasets_list : list
    A list containing loaded datasets as per the requested names provided in 'req'.

Raises
------
Exception
    If an error occurs during dataset loading or processing.

Examples
--------
>>> datasets = load_datasets(["dataset1", "dataset2"], streaming=False)
"""
datasets_list = []

with concurrent.futures.ThreadPoolExecutor() as executor:
    future_to_dataset = {executor.submit(dataset_loader, name): name for name in req}

    for future in tqdm(concurrent.futures.as_completed(future_to_dataset), total=len(req)):
        name = future_to_dataset[future]

        try:
            dataset = future.result()

            if dataset:
                datasets_list.append(dataset)

        except Exception as exc:
            logging.error(f"Error processing dataset {name}: {exc}")

return datasets_list

req = [ "louisbrulenaudet/code-artisanat", "louisbrulenaudet/code-action-sociale-familles", # ... ]

datasetslist = loaddatasets( req=req, streaming=True )

dataset = datasets.concatenatedatasets( datasetslist ) ```

Citing this project

If you use this code in your research, please use the following BibTeX entry.

BibTeX @misc{louisbrulenaudet2024, author = {Louis Brulé Naudet}, title = {LegalKit Pipeline: Open Access to French Legal Codes on 🤗 Datasets}, howpublished = {\url{https://github.com/louisbrulenaudet/legalkit-pipeline}}, year = {2024} }

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.

Owner

  • Name: Louis Brulé Naudet
  • Login: louisbrulenaudet
  • Kind: user
  • Location: Paris
  • Company: Université Paris-Dauphine (Paris Sciences et Lettres - PSL)

Research in business taxation and development (NLP, LLM, Computer vision...), University Dauphine-PSL 📖 | Backed by the Microsoft for Startups Hub program

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Brulé Naudet"
  given-names: "Louis"
  orcid: "https://orcid.org/0000-0001-9111-4879"
title: "LegalKit Pipeline: Open Access to French Legal Codes on 🤗 Datasets"
version: 1.0.0
date-released: 2024-03-31

GitHub Events

Total
  • Watch event: 3
  • Push event: 1
  • Fork event: 1
Last Year
  • Watch event: 3
  • Push event: 1
  • Fork event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels