qlora_semantic_extraction

The project is intended for generating synthetic data that may be used for fine-tuning LLM models. The main use for this data is to prepare smaller, fine-tuned models, for increasing speed and lowering the cost of extracting relations and entities from text.

https://github.com/dehydratedwater/qlora_semantic_extraction

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: DehydratedWater
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 4.25 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

readme.md

Synthetic Data Creation Tool for Model Refinement

Goal of This Project

This repository is part of a larger project. The core idea of the project aims at building a system that extracts entities from text with semantic relations between them, and uses these entities for question answering with retrieval. By maintaining a graph of all semantic relations, it is possible to improve the results of the retrieval and make it not only precise but also deep, by providing a way to reason about second, third, ... order causes and effects.

This Repository

This part of the project is intended for generating synthetic data that may be used for fine-tuning LLM models. The main use for this data is to prepare smaller, fine-tuned models, for increasing speed and lowering the cost of extracting relations and entities from text.

Hugging Face

The dataset generated with this repository, as well as a backup of the full PostgreSQL database, is available at DehydratedWater42/semanticrelationsextraction

Generated Data

Generation Process

This data was generated based on the datasets/scientific_papers dataset. This dataset contains a list of scientific articles with separate abstracts and lists of contents. Here is the synthetic data generation overview:

All the abstracts and lists of contents were inserted into the database.
The main content of every article was split into overlapping segments of 1k LLaMA tokens with a 200-token overlap.
10k of the abstracts + lists of contents were summarized by LLaMA 13b.
Generated summaries + split text segments were transformed by LLaMA 13b into unprocessed JSONs.
All generated JSONs were validated and cleaned up.
Validated JSONs were reformatted into datasets that may be used for fine-tuning.

Example of output data

json { "section_description": "The article discusses the current reversal phenomenon in a classical deterministic ratchet system. The authors investigate the relationship between current and bifurcation diagrams, focusing on the dynamics of an ensemble of particles. They challenge Mateos' claim that current reversals occur only with bifurcations and present evidence for current reversals without bifurcations. Additionally, they show that bifurcations can occur without current reversals. The study highlights the importance of considering the characteristics of the ensemble in understanding the behavior of the system. The authors provide numerical evidence to support their claims and suggest that correlating abrupt changes in the current with bifurcations is more appropriate than focusing solely on current reversals.", "list_of_entities": [ "reversals", "mateos", "figures", "rules", "current_reversal", "ensemble", "bifurcation", "jumps", "thumb", "spikes", "current", "particles", "open_question", "behavior", "heuristics", "direction", "chaotic", "parameter" ], "relations": [ { "description": "bifurcations in single - trajectory behavior often corresponds to sudden spikes or jumps in the current for an ensemble in the same system", "source_entities": [ "bifurcation" ], "target_entities": [ "current" ] }, { "description": "current reversals are a special case of this", "source_entities": [ "current" ], "target_entities": [ "bifurcation" ] }, { "description": "not all spikes or jumps correspond to a bifurcation", "source_entities": [ "spikes" ], "target_entities": [ "bifurcation" ] }, { "description": "the open question is clearly to figure out if the reason for when these rules are violated or are valid can be made more concrete", "source_entities": [ "current" ], "target_entities": [ "open_question" ] } ] }

Expected output JSON schema

```json { "$schema": "extractionschema.json", "type": "object", "properties": { "sectiondescription": { "type": "string" } "listofentities": { "type": "array", "items": { "type": "string" } }, "relations": { "type": "array", "items": { "type": "object", "properties": { "description": { "type": "string" }, "sourceentities": { "type": "array", "items": { "type": "string" } }, "targetentities": { "type": "array", "items": { "type": "string" } }, "strength": { "type": "string", "enum": ["strong", "moderate", "weak"] } }, "required": ["description", "sourceentities", "targetentities"] } },

}, "required": ["listofentities", "relations", "section_description"] } ```

Decisions

I used Airflow because I had previously prepared a template that allowed me to run Airflow pipelines with local LLMs.
There is a whole section of the database with extracted relations and entities, mostly for estimating the connectivity and scale of the extracted data.
The final dataset is being created with jupyters/02_export_data.ipynb; it was just quicker than adding a new volume to the docker-compose.
I chose datasets/scientific_papers as it already provided a good base for summaries (i.e., Abstracts) and did not require me to iteratively summarize all the contents, which would require additional time.
This project does not use ChatGPT or other external APIs; all processing was done locally on 2x3090RTX + some OrangePIs. The goal is to generate a fine-tuned model that can be hosted more cheaply, and also provide the same utility as this two-step LLaMA 13b process. OpenAI does not allow using the results of generation for fine-tuning other models; hence, all this data was generated locally with LLaMA 2, as the license permits improving LLaMA 2 with data generated with LLaMA 2. This is not perfect, but as long as I use datasets/scientific_papers, there is still the issue of licensing; it all will need to be regenerated in the future with a more open stack.
The goal is to create a small 3B-7B model that can be used for the task of extracting entities and semantic relations, which may be run on a small ARM board like OrangePI, with minimal cost at a reasonable speed.
I used LLaMA 2 Chat because, in the past, I was able to achieve the most stable results with that model.
I set the temperature to 0.7 to allow the model to infer some missing information and generate better summaries, but the trade-off of using a non-zero temperature is more involved result cleanup. Still, almost 88% of the generated data had a fixable structure.

Future Plans for the Project

Fine-tune LLaMA 2 7B with synthetic data (try and evaluate the speed and quality of generation).
Generate more synthetic data, clean it, and fine-tune the model further.
Build a system for mixed querying of the data (I've built a prototype; now, I would like to recreate it as a whole standalone service).
After running it successfully, regenerate data based on the Wikipedia dataset or another fully open-source dataset, and replace LLaMA with a truly open-source model.

Statistics

I ran the generation on 4 instances of LLaMA 2-chat on 2x3090RTX + i7 4790K. The processing averaged around 1 result per minute (either a summary or JSON). The whole process, excluding coding and experimentation, took approximately 20,000 minutes, which is roughly 14 days of compute time, and required about 120 kWh of power. In the near future, I need to upgrade the CPU + RAM to remove that bottleneck. bash ./run_llm_servers_for_data_generation.sh -n 4 -t 1 -m "models/llama-2-13b-chat.Q4_K_M.gguf" -c 4096 -b 1512
I tested hosting on ARM boards; a 13b model quantized to q4 was able to be hosted with stable speed for an extended time, achieving a speed of 2.34 tokens/s per one OrangePI. With an RTX 3090 paired with my somewhat outdated CPU, an i7 4790K, I was able to achieve up to 20 tokens/s. I have 5 OrangePIs 5 16GB, and by running models on all of them, I achieved around 11.7 tokens/s for approximately 50W of power.

Running the project

Components Provided by This Project

Airflow integrated with Celery and Redis.
PostgreSQL 15.
Python 3.11 including libraries such as PyTorch, Langchain, OpenAI, pandas, etc.
Dockerized Python llama.cpp server with support for Nvidia GPU, accessible from Airflow.
Scripts for building and running Docker containers.
DAG example for connecting with a locally hosted LLM from Airflow using Langchain.

Prerequisites

Nvidia GPU(s) with sufficient VRAM to run the desired models.
Nvidia drivers with CUDA support for your system.
Run nvidia-smi to verify that the drivers are working correctly.

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3494 G /usr/lib/xorg/Xorg 152MiB | | 0 N/A N/A 3634 G /usr/bin/gnome-shell 144MiB | | 0 N/A N/A 5000 G ...irefox/3600/usr/lib/firefox/firefox 141MiB | | 0 N/A N/A 6614 G ...sion,SpareRendererForSitePerProcess 57MiB | | 1 N/A N/A 3494 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+

```

Python 3.11 installed.
Installed docker with configured sudo-less docker user. For Docker installation on Ubuntu 22.04, refer to this DigitalOcean tutorial.
Installed docker-compose. For Docker Compose installation on Ubuntu 22.04, see this DigitalOcean tutorial.
Installed poetry, used for local type checking and DAG development.
Model in GUFF format. A good place to find models is TheBloke on Hugging Face.

Warning: You may need to adapt the Nvidia image version used in LocalLLamaCPPServerDockerfile to match the CUDA version compatible with your driver. In my example, it is CUDA Version: 12.2.

How to Start

Run Airflow with LLama.cpp

Download this repository or create a new one using this template.
Download the chosen LLM model in GUFF format (or any other format supported by LLama.Cpp) and place it in the models folder.
Modify LocalLLamaCPPServerDockerfile to include the correct model name and the number of layers to be passed to the GPU in the ENTRYPOINT section.
To add extra Python packages, use poetry add [name of package] or modify pyproject.toml and then run poetry install and poetry update. Select the created poetry environment in your IDE for type checking.
Run ./build_all.sh after making it executable (use chmod +x name_of_script for all scripts, e.g., sudo chmod +x build_all.sh).
After the build is complete, there should be a Docker image llm-server containing the dockerized llama.cpp server with GPU support, and an extending_airflow image, containing Airflow extended with chosen Python libraries.
To run everything, execute ./start_all.sh and to stop it, use ./stop_all.sh.
Open a browser and navigate to http://0.0.0.0:8080 to launch the Airflow webserver. Log in with username: airflow and password: airflow.
Refer to dags/test_connection_to_local_llm.py and dags/test_multi_connection_to_local_llm_4x.py as starting points.

How to modify number of llama.cpp servers

The script run_llm_servers_for_data_generation.sh provides parameters to configure how many llama.cpp servers should be deployed

The default values: ```bash num=4 # number of servers to deploy use -n flag to change model="models/llama-2-13b-chat.Q5KM.gguf" # llm model to deploy with llama.cpp server, use -m flat split=0 # if 0 models will load between 2 gpus without spliting, if 1 models will be splited between all gpus, use -s flag n_threads=1 # number of threads for every llama.cpp server, use -t flag

```

Example how to run: bash ./run_llm_servers_for_data_generation.sh -n 4 -m models/llama-2-13b-chat.Q4_K_M.gguf -t 8 -s 0

You can modify it in ./start_all.sh script

You can also manually stop just inference servers by running command ./stop_llm_servers_for_data_generation.sh

Using Just the Dockerized Model Without Airflow

Download the chosen LLM model in GUFF format (or any other format supported by LLama.Cpp) and place it in the models folder.
Modify LocalLLamaCPPServerDockerfile to include the correct model name and the number of layers to be passed to the GPU in the ENTRYPOINT section.
To add extra Python packages, use poetry add [name of package] or modify pyproject.toml and then run poetry install and poetry update. Select the created poetry environment in your IDE for type checking.
Run ./build_llm_server.sh to build the dockerized version of the LLama.cpp server with GPU support.
Execute ./run_llm_server.sh and docker kill llm-server and to stop it. Server will run on 5556 port, you can check loaded models http://localhost:5556/v1/models or documentation http://localhost:5556/docs#/
See run_completion_on_local_llama.py as a starting point for development without Airflow. You can run it within the Poetry environment.

Create `.env` for Airflow

Example of an .env file: AIRFLOW_UID=1000 AIRFLOW_GID=0

Workflow

After installation, connect the poetry environment to your IDE for type checking.
Consider using DBeaver or a similar tool to create an extra database in the provided PostgreSQL and utilize airflow postgres hooks for communication from within Airflow DAGs.
Every DAG created in the dags folder will be visible and usable in Airflow.
The nvtop package can be used to monitor GPU usage. See nvtop on GitHub.
Volumes are already created for storing SQL scripts and raw data (sql and data). For more complex projects, consider integrating with a service like S3.

Owner

Name: Ignacy Daszkiewicz
Login: DehydratedWater
Kind: user

Repositories: 20
Profile: https://github.com/DehydratedWater

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Semantic relations extraction
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Ignacy Łukasz
    family-names: Daszkiewicz
    email: ignacy.daszkiewicz@gmail.com
repository-code: >-
  https://github.com/DehydratedWater/qlora_semantic_extraction
repository: >-
  https://huggingface.co/datasets/DehydratedWater42/semantic_relations_extraction
abstract: >-
  This project introduces a dataset designed for fine-tuning
  NLP models to improve semantic relation extraction between
  entities, addressing the need for both high accuracy and
  operational efficiency in NLP applications. By focusing on
  reducing computational costs and enhancing processing
  speed, the dataset aims to support the development of more
  cost-effective and agile NLP models. This contribution
  offers a valuable resource for advancing NLP technologies
  in efficiently understanding complex textual relationships
  while optimizing resource use.
license: Apache-2.0
version: '1.0'
date-released: '2024-02-12'

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: about 2 years ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

airflow_docker/Dockerfile docker

apache/airflow 2.7.3-python3.11 build

airflow_docker/requirements.txt pypi

aiohttp ==3.9.1
aiosignal ==1.3.1
alembic ==1.13.1
amqp ==5.2.0
annotated-types ==0.6.0
anyio ==4.2.0
apache-airflow ==2.7.3
apache-airflow-providers-celery ==3.5.1
apache-airflow-providers-common-sql ==1.10.0
apache-airflow-providers-docker ==3.9.1
apache-airflow-providers-ftp ==3.7.0
apache-airflow-providers-http ==4.8.0
apache-airflow-providers-imap ==3.5.0
apache-airflow-providers-postgres ==5.10.0
apache-airflow-providers-sqlite ==3.7.0
apispec ==6.3.1
argcomplete ==3.2.1
asgiref ==3.7.2
attrs ==23.2.0
babel ==2.14.0
backoff ==2.2.1
billiard ==4.2.0
blinker ==1.7.0
cachelib ==0.10.2
cattrs ==23.2.3
celery ==5.3.6
certifi ==2023.11.17
cffi ==1.16.0
charset-normalizer ==3.3.2
click ==8.1.7
click-didyoumean ==0.3.0
click-plugins ==1.1.1
click-repl ==0.3.0
clickclick ==20.10.2
colorama ==0.4.6
colorlog ==4.8.0
configupdater ==3.2
connexion ==2.14.2
cron-descriptor ==1.4.0
croniter ==2.0.1
cryptography ==41.0.7
dataclasses-json ==0.6.3
datasets ==2.16.1
deprecated ==1.2.14
dill ==0.3.7
distro ==1.9.0
dnspython ==2.4.2
docker ==7.0.0
docutils ==0.20.1
email-validator ==1.3.1
filelock ==3.13.1
flask ==2.2.5
flask-appbuilder ==4.3.6
flask-babel ==2.0.0
flask-caching ==2.0.1
flask-jwt-extended ==4.6.0
flask-limiter ==3.5.0
flask-login ==0.6.3
flask-session ==0.5.0
flask-sqlalchemy ==2.5.1
flask-wtf ==1.2.1
flower ==2.0.1
frozenlist ==1.4.1
fsspec ==2023.10.0
google-re2 ==1.1
googleapis-common-protos ==1.62.0
graphviz ==0.20.1
greenlet ==3.0.3
grpcio ==1.60.0
gunicorn ==21.2.0
h11 ==0.14.0
httpcore ==1.0.2
httpx ==0.26.0
huggingface-hub ==0.20.2
humanize ==4.9.0
idna ==3.6
importlib-metadata ==6.11.0
importlib-resources ==6.1.1
inflection ==0.5.1
itsdangerous ==2.1.2
jinja2 ==3.1.2
joblib ==1.3.2
jsonpatch ==1.33
jsonpointer ==2.4
jsonschema ==4.20.0
jsonschema-specifications ==2023.12.1
kombu ==5.3.4
langchain ==0.0.352
langchain-community ==0.0.8
langchain-core ==0.1.6
langsmith ==0.0.77
lazy-object-proxy ==1.10.0
limits ==3.7.0
linkify-it-py ==2.0.2
lockfile ==0.12.2
mako ==1.3.0
markdown ==3.5.1
markdown-it-py ==3.0.0
markupsafe ==2.1.3
marshmallow ==3.20.1
marshmallow-oneofschema ==3.0.1
marshmallow-sqlalchemy ==0.26.1
mdit-py-plugins ==0.4.0
mdurl ==0.1.2
mpmath ==1.3.0
multidict ==6.0.4
multiprocess ==0.70.15
mypy-extensions ==1.0.0
networkx ==3.2.1
numpy ==1.26.3
nvidia-cublas-cu12 ==12.1.3.1
nvidia-cuda-cupti-cu12 ==12.1.105
nvidia-cuda-nvrtc-cu12 ==12.1.105
nvidia-cuda-runtime-cu12 ==12.1.105
nvidia-cudnn-cu12 ==8.9.2.26
nvidia-cufft-cu12 ==11.0.2.54
nvidia-curand-cu12 ==10.3.2.106
nvidia-cusolver-cu12 ==11.4.5.107
nvidia-cusparse-cu12 ==12.1.0.106
nvidia-nccl-cu12 ==2.18.1
nvidia-nvjitlink-cu12 ==12.3.101
nvidia-nvtx-cu12 ==12.1.105
openai ==1.6.1
opentelemetry-api ==1.22.0
opentelemetry-exporter-otlp ==1.22.0
opentelemetry-exporter-otlp-proto-common ==1.22.0
opentelemetry-exporter-otlp-proto-grpc ==1.22.0
opentelemetry-exporter-otlp-proto-http ==1.22.0
opentelemetry-proto ==1.22.0
opentelemetry-sdk ==1.22.0
opentelemetry-semantic-conventions ==0.43b0
ordered-set ==4.1.0
packaging ==23.2
pandas ==2.1.4
pathspec ==0.12.1
pendulum ==2.1.2
pillow ==10.2.0
pluggy ==1.3.0
prison ==0.2.1
prometheus-client ==0.19.0
prompt-toolkit ==3.0.43
protobuf ==4.25.1
psutil ==5.9.7
psycopg2-binary ==2.9.9
pyarrow ==14.0.2
pyarrow-hotfix ==0.6
pycparser ==2.21
pydantic ==2.5.3
pydantic-core ==2.14.6
pygments ==2.17.2
pyjwt ==2.8.0
python-daemon ==3.0.1
python-dateutil ==2.8.2
python-dotenv ==1.0.0
python-nvd3 ==0.15.0
python-slugify ==8.0.1
pytz ==2023.3.post1
pytzdata ==2020.1
pywin32 ==306
pyyaml ==6.0.1
referencing ==0.32.1
regex ==2023.12.25
requests ==2.31.0
requests-toolbelt ==1.0.0
rfc3339-validator ==0.1.4
rich ==13.7.0
rich-argparse ==1.4.0
rpds-py ==0.16.2
safetensors ==0.4.1
scikit-learn ==1.3.2
scipy ==1.11.4
setproctitle ==1.3.3
setuptools ==69.0.3
six ==1.16.0
sniffio ==1.3.0
sqlalchemy ==1.4.50
sqlalchemy-jsonfield ==1.0.2
sqlalchemy-utils ==0.41.1
sqlparse ==0.4.4
sympy ==1.12
tabulate ==0.9.0
tenacity ==8.2.3
termcolor ==2.4.0
text-unidecode ==1.3
threadpoolctl ==3.2.0
tiktoken ==0.5.2
tokenizers ==0.15.0
torch ==2.1.2
torchaudio ==2.1.2
torchvision ==0.16.2
tornado ==6.4
tqdm ==4.66.1
transformers ==4.36.2
triton ==2.1.0
typing-extensions ==4.9.0
typing-inspect ==0.9.0
tzdata ==2023.4
uc-micro-py ==1.0.2
unicodecsv ==0.14.1
urllib3 ==2.1.0
vine ==5.1.0
wcwidth ==0.2.12
werkzeug ==2.2.3
wrapt ==1.16.0
wtforms ==3.0.1
xxhash ==3.4.1
yarl ==1.9.4
zipp ==3.17.0

poetry.lock pypi

274 dependencies

pyproject.toml pypi

apache-airflow 2.7.3
apache-airflow-providers-docker 3.9.1
apache-airflow-providers-postgres ^5.9.0
datasets ^2.16.1
jupyter ^1.0.0
langchain ^0.0.352
openai ^1.6.1
pandas ^2.1.4
pendulum 2.1.2
pillow ^10.1.0
python 3.11.x
scikit-learn ^1.3.2
sqlalchemy 1.4.50
tiktoken ^0.5.2
torch ^2.1.2
torchaudio ^2.1.2
torchvision ^0.16.2
transformers ^4.36.2

qlora_semantic_extraction

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

readme.md

Synthetic Data Creation Tool for Model Refinement

Goal of This Project

This Repository

Hugging Face

Generated Data

Generation Process

Example of output data

Expected output JSON schema

Decisions

Future Plans for the Project

Statistics

Running the project

Components Provided by This Project

Prerequisites

How to Start

Run Airflow with LLama.cpp

How to modify number of llama.cpp servers

Using Just the Dockerized Model Without Airflow

Create .env for Airflow

Workflow

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Create `.env` for Airflow