https://github.com/artiks12/datasetpreperation

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: artiks12
License: mit
Language: Python
Default Branch: main
Size: 8.79 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Archived

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

DatasetPreparation

This is the repository to create datasets for fine-tuning models and instructions for model testing. It is a part of the master thesis "Evaluation and Adaptation of Large Language Models for Question-Answering on Legislation" made in University of Latvia.

Data used

The base data used is available here: http://hdl.handle.net/20.500.12574/130. You need to be able to sign in to download and use this dataset.

Python and packages

This script was used with Python 3.10 so it is recomended to use this version of python. You also need to install PyQuery and bs4 packages.

How to get instructions for testing

1) Put the test data in data folder 2) Run GetModelTestInstructions.py script 3) The instructions will be saved in instructions folder

These instructions are used to generate responses for models to test them. The link to response generation repository is here: https://github.com/artiks12/ResponseGeneration

How to prepare datasets for fine-tuning

1) Put the training and validation data in data folder. Make sure to create folders "Training" and "Validation" in data folder and store the necessary files there. 2) Run PrepareDatasets.py script 3) The datasets will be stored in datasets folder

These datasets are uused in model fine-tuning process. The link to model fine-tuning repository is here: https://github.com/artiks12/ModelFineTuningPipeline

Version 2

There are two problems with the GetmodelTestInstructions.py script - It does not filter out duplicates - It might crash if the link is incorrect or the HTML DOM is not made properly.

To prevent these issues use GetModelTestInstructions_v2.py script. The original was used for the thesis.

Owner

Login: artiks12
Kind: user

Repositories: 2
Profile: https://github.com/artiks12

GitHub Events

Total

Push event: 5

Last Year

Push event: 5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science