qlora_semantic_extraction
The project is intended for generating synthetic data that may be used for fine-tuning LLM models. The main use for this data is to prepare smaller, fine-tuned models, for increasing speed and lowering the cost of extracting relations and entities from text.
https://github.com/dehydratedwater/qlora_semantic_extraction
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
The project is intended for generating synthetic data that may be used for fine-tuning LLM models. The main use for this data is to prepare smaller, fine-tuned models, for increasing speed and lowering the cost of extracting relations and entities from text.
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
readme.md
Synthetic Data Creation Tool for Model Refinement
Goal of This Project
This repository is part of a larger project. The core idea of the project aims at building a system that extracts entities from text with semantic relations between them, and uses these entities for question answering with retrieval. By maintaining a graph of all semantic relations, it is possible to improve the results of the retrieval and make it not only precise but also deep, by providing a way to reason about second, third, ... order causes and effects.
This Repository
This part of the project is intended for generating synthetic data that may be used for fine-tuning LLM models. The main use for this data is to prepare smaller, fine-tuned models, for increasing speed and lowering the cost of extracting relations and entities from text.
Hugging Face
The dataset generated with this repository, as well as a backup of the full PostgreSQL database, is available at DehydratedWater42/semanticrelationsextraction
Generated Data
Generation Process
This data was generated based on the datasets/scientific_papers dataset. This dataset contains a list of scientific articles with separate abstracts and lists of contents. Here is the synthetic data generation overview:
- All the
abstractsandlists of contentswere inserted into the database. - The main content of every article was split into overlapping segments of 1k LLaMA tokens with a 200-token overlap.
- 10k of the
abstracts+lists of contentswere summarized by LLaMA 13b. - Generated
summaries+split text segmentswere transformed by LLaMA 13b into unprocessed JSONs. - All generated JSONs were validated and cleaned up.
- Validated JSONs were reformatted into datasets that may be used for fine-tuning.
Example of output data
json
{
"section_description": "The article discusses the current reversal phenomenon in a classical deterministic ratchet system. The authors investigate the relationship between current and bifurcation diagrams, focusing on the dynamics of an ensemble of particles. They challenge Mateos' claim that current reversals occur only with bifurcations and present evidence for current reversals without bifurcations. Additionally, they show that bifurcations can occur without current reversals. The study highlights the importance of considering the characteristics of the ensemble in understanding the behavior of the system. The authors provide numerical evidence to support their claims and suggest that correlating abrupt changes in the current with bifurcations is more appropriate than focusing solely on current reversals.",
"list_of_entities": [
"reversals",
"mateos",
"figures",
"rules",
"current_reversal",
"ensemble",
"bifurcation",
"jumps",
"thumb",
"spikes",
"current",
"particles",
"open_question",
"behavior",
"heuristics",
"direction",
"chaotic",
"parameter"
],
"relations": [
{
"description": "bifurcations in single - trajectory behavior often corresponds to sudden spikes or jumps in the current for an ensemble in the same system",
"source_entities": [
"bifurcation"
],
"target_entities": [
"current"
]
},
{
"description": "current reversals are a special case of this",
"source_entities": [
"current"
],
"target_entities": [
"bifurcation"
]
},
{
"description": "not all spikes or jumps correspond to a bifurcation",
"source_entities": [
"spikes"
],
"target_entities": [
"bifurcation"
]
},
{
"description": "the open question is clearly to figure out if the reason for when these rules are violated or are valid can be made more concrete",
"source_entities": [
"current"
],
"target_entities": [
"open_question"
]
}
]
}
Expected output JSON schema
```json { "$schema": "extractionschema.json", "type": "object", "properties": { "sectiondescription": { "type": "string" } "listofentities": { "type": "array", "items": { "type": "string" } }, "relations": { "type": "array", "items": { "type": "object", "properties": { "description": { "type": "string" }, "sourceentities": { "type": "array", "items": { "type": "string" } }, "targetentities": { "type": "array", "items": { "type": "string" } }, "strength": { "type": "string", "enum": ["strong", "moderate", "weak"] } }, "required": ["description", "sourceentities", "targetentities"] } },
}, "required": ["listofentities", "relations", "section_description"] } ```
Decisions
- I used Airflow because I had previously prepared a template that allowed me to run Airflow pipelines with local LLMs.
- There is a whole section of the database with extracted relations and entities, mostly for estimating the connectivity and scale of the extracted data.
- The final dataset is being created with
jupyters/02_export_data.ipynb; it was just quicker than adding a new volume to the docker-compose. - I chose
datasets/scientific_papersas it already provided a good base for summaries (i.e., Abstracts) and did not require me to iteratively summarize all the contents, which would require additional time. - This project does not use ChatGPT or other external APIs; all processing was done locally on 2x3090RTX + some OrangePIs. The goal is to generate a fine-tuned model that can be hosted more cheaply, and also provide the same utility as this two-step LLaMA 13b process. OpenAI does not allow using the results of generation for fine-tuning other models; hence, all this data was generated locally with LLaMA 2, as the license permits improving LLaMA 2 with data generated with LLaMA 2. This is not perfect, but as long as I use
datasets/scientific_papers, there is still the issue of licensing; it all will need to be regenerated in the future with a more open stack. - The goal is to create a small 3B-7B model that can be used for the task of extracting entities and semantic relations, which may be run on a small ARM board like OrangePI, with minimal cost at a reasonable speed.
- I used LLaMA 2 Chat because, in the past, I was able to achieve the most stable results with that model.
- I set the temperature to 0.7 to allow the model to infer some missing information and generate better summaries, but the trade-off of using a non-zero temperature is more involved result cleanup. Still, almost 88% of the generated data had a fixable structure.
Future Plans for the Project
- Fine-tune LLaMA 2 7B with synthetic data (try and evaluate the speed and quality of generation).
- Generate more synthetic data, clean it, and fine-tune the model further.
- Build a system for mixed querying of the data (I've built a prototype; now, I would like to recreate it as a whole standalone service).
- After running it successfully, regenerate data based on the Wikipedia dataset or another fully open-source dataset, and replace LLaMA with a truly open-source model.
Statistics
- I ran the generation on 4 instances of LLaMA 2-chat on 2x3090RTX + i7 4790K. The processing averaged around 1 result per minute (either a summary or JSON). The whole process, excluding coding and experimentation, took approximately 20,000 minutes, which is roughly 14 days of compute time, and required about 120 kWh of power. In the near future, I need to upgrade the CPU + RAM to remove that bottleneck.
bash ./run_llm_servers_for_data_generation.sh -n 4 -t 1 -m "models/llama-2-13b-chat.Q4_K_M.gguf" -c 4096 -b 1512 - I tested hosting on ARM boards; a 13b model quantized to q4 was able to be hosted with stable speed for an extended time, achieving a speed of 2.34 tokens/s per one OrangePI. With an RTX 3090 paired with my somewhat outdated CPU, an i7 4790K, I was able to achieve up to 20 tokens/s. I have 5 OrangePIs 5 16GB, and by running models on all of them, I achieved around 11.7 tokens/s for approximately 50W of power.
Running the project
Components Provided by This Project
- Airflow integrated with Celery and Redis.
- PostgreSQL 15.
- Python 3.11 including libraries such as PyTorch, Langchain, OpenAI, pandas, etc.
- Dockerized Python llama.cpp server with support for Nvidia GPU, accessible from Airflow.
- Scripts for building and running Docker containers.
- DAG example for connecting with a locally hosted LLM from Airflow using Langchain.
Prerequisites
- Nvidia GPU(s) with sufficient VRAM to run the desired models.
- Nvidia drivers with CUDA support for your system.
- Run
nvidia-smito verify that the drivers are working correctly.
```shell
(base) ➜ ~ nvidia-smi
Tue Dec 26 14:02:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 29C P8 38W / 350W | 513MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A |
| 0% 25C P8 17W / 370W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3494 G /usr/lib/xorg/Xorg 152MiB | | 0 N/A N/A 3634 G /usr/bin/gnome-shell 144MiB | | 0 N/A N/A 5000 G ...irefox/3600/usr/lib/firefox/firefox 141MiB | | 0 N/A N/A 6614 G ...sion,SpareRendererForSitePerProcess 57MiB | | 1 N/A N/A 3494 G /usr/lib/xorg/Xorg 4MiB | +---------------------------------------------------------------------------------------+
```
Python 3.11installed.- Installed
dockerwith configured sudo-less docker user. For Docker installation on Ubuntu 22.04, refer to this DigitalOcean tutorial. - Installed
docker-compose. For Docker Compose installation on Ubuntu 22.04, see this DigitalOcean tutorial. - Installed
poetry, used for local type checking and DAG development. - Model in
GUFFformat. A good place to find models is TheBloke on Hugging Face.
Warning: You may need to adapt the Nvidia image version used in LocalLLamaCPPServerDockerfile to match the CUDA version compatible with your driver. In my example, it is CUDA Version: 12.2.
How to Start
Run Airflow with LLama.cpp
- Download this repository or create a new one using this template.
- Download the chosen LLM model in
GUFFformat (or any other format supported by LLama.Cpp) and place it in themodelsfolder. - Modify
LocalLLamaCPPServerDockerfileto include the correct model name and the number of layers to be passed to the GPU in theENTRYPOINTsection. - To add extra Python packages, use
poetry add [name of package]or modifypyproject.tomland then runpoetry installandpoetry update. Select the created poetry environment in your IDE for type checking. - Run
./build_all.shafter making it executable (usechmod +x name_of_scriptfor all scripts, e.g.,sudo chmod +x build_all.sh). - After the build is complete, there should be a Docker image
llm-servercontaining the dockerizedllama.cppserver with GPU support, and anextending_airflowimage, containing Airflow extended with chosen Python libraries. - To run everything, execute
./start_all.shand to stop it, use./stop_all.sh. - Open a browser and navigate to http://0.0.0.0:8080 to launch the Airflow webserver. Log in with username:
airflowand password:airflow. - Refer to
dags/test_connection_to_local_llm.pyanddags/test_multi_connection_to_local_llm_4x.pyas starting points.
How to modify number of llama.cpp servers
The script run_llm_servers_for_data_generation.sh provides parameters to configure how many llama.cpp servers should be deployed
The default values: ```bash num=4 # number of servers to deploy use -n flag to change model="models/llama-2-13b-chat.Q5KM.gguf" # llm model to deploy with llama.cpp server, use -m flat split=0 # if 0 models will load between 2 gpus without spliting, if 1 models will be splited between all gpus, use -s flag n_threads=1 # number of threads for every llama.cpp server, use -t flag
```
Example how to run:
bash
./run_llm_servers_for_data_generation.sh -n 4 -m models/llama-2-13b-chat.Q4_K_M.gguf -t 8 -s 0
You can modify it in ./start_all.sh script
You can also manually stop just inference servers by running command ./stop_llm_servers_for_data_generation.sh
Using Just the Dockerized Model Without Airflow
- Download the chosen LLM model in
GUFFformat (or any other format supported by LLama.Cpp) and place it in themodelsfolder. - Modify
LocalLLamaCPPServerDockerfileto include the correct model name and the number of layers to be passed to the GPU in theENTRYPOINTsection. - To add extra Python packages, use
poetry add [name of package]or modifypyproject.tomland then runpoetry installandpoetry update. Select the created poetry environment in your IDE for type checking. - Run
./build_llm_server.shto build the dockerized version of the LLama.cpp server with GPU support. - Execute
./run_llm_server.shanddocker kill llm-serverand to stop it. Server will run on5556port, you can check loaded modelshttp://localhost:5556/v1/modelsor documentationhttp://localhost:5556/docs#/ - See
run_completion_on_local_llama.pyas a starting point for development without Airflow. You can run it within the Poetry environment.
Create .env for Airflow
Example of an .env file:
AIRFLOW_UID=1000
AIRFLOW_GID=0
Workflow
- After installation, connect the poetry environment to your IDE for type checking.
- Consider using
DBeaveror a similar tool to create an extra database in the providedPostgreSQLand utilizeairflow postgres hooksfor communication from withinAirflow DAGs. - Every
DAGcreated in thedagsfolder will be visible and usable inAirflow. - The
nvtoppackage can be used to monitor GPU usage. See nvtop on GitHub. - Volumes are already created for storing SQL scripts and raw data (
sqlanddata). For more complex projects, consider integrating with a service likeS3.
Owner
- Name: Ignacy Daszkiewicz
- Login: DehydratedWater
- Kind: user
- Repositories: 20
- Profile: https://github.com/DehydratedWater
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Semantic relations extraction
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Ignacy Łukasz
family-names: Daszkiewicz
email: ignacy.daszkiewicz@gmail.com
repository-code: >-
https://github.com/DehydratedWater/qlora_semantic_extraction
repository: >-
https://huggingface.co/datasets/DehydratedWater42/semantic_relations_extraction
abstract: >-
This project introduces a dataset designed for fine-tuning
NLP models to improve semantic relation extraction between
entities, addressing the need for both high accuracy and
operational efficiency in NLP applications. By focusing on
reducing computational costs and enhancing processing
speed, the dataset aims to support the development of more
cost-effective and agile NLP models. This contribution
offers a valuable resource for advancing NLP technologies
in efficiently understanding complex textual relationships
while optimizing resource use.
license: Apache-2.0
version: '1.0'
date-released: '2024-02-12'
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: about 2 years ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- apache/airflow 2.7.3-python3.11 build
- aiohttp ==3.9.1
- aiosignal ==1.3.1
- alembic ==1.13.1
- amqp ==5.2.0
- annotated-types ==0.6.0
- anyio ==4.2.0
- apache-airflow ==2.7.3
- apache-airflow-providers-celery ==3.5.1
- apache-airflow-providers-common-sql ==1.10.0
- apache-airflow-providers-docker ==3.9.1
- apache-airflow-providers-ftp ==3.7.0
- apache-airflow-providers-http ==4.8.0
- apache-airflow-providers-imap ==3.5.0
- apache-airflow-providers-postgres ==5.10.0
- apache-airflow-providers-sqlite ==3.7.0
- apispec ==6.3.1
- argcomplete ==3.2.1
- asgiref ==3.7.2
- attrs ==23.2.0
- babel ==2.14.0
- backoff ==2.2.1
- billiard ==4.2.0
- blinker ==1.7.0
- cachelib ==0.10.2
- cattrs ==23.2.3
- celery ==5.3.6
- certifi ==2023.11.17
- cffi ==1.16.0
- charset-normalizer ==3.3.2
- click ==8.1.7
- click-didyoumean ==0.3.0
- click-plugins ==1.1.1
- click-repl ==0.3.0
- clickclick ==20.10.2
- colorama ==0.4.6
- colorlog ==4.8.0
- configupdater ==3.2
- connexion ==2.14.2
- cron-descriptor ==1.4.0
- croniter ==2.0.1
- cryptography ==41.0.7
- dataclasses-json ==0.6.3
- datasets ==2.16.1
- deprecated ==1.2.14
- dill ==0.3.7
- distro ==1.9.0
- dnspython ==2.4.2
- docker ==7.0.0
- docutils ==0.20.1
- email-validator ==1.3.1
- filelock ==3.13.1
- flask ==2.2.5
- flask-appbuilder ==4.3.6
- flask-babel ==2.0.0
- flask-caching ==2.0.1
- flask-jwt-extended ==4.6.0
- flask-limiter ==3.5.0
- flask-login ==0.6.3
- flask-session ==0.5.0
- flask-sqlalchemy ==2.5.1
- flask-wtf ==1.2.1
- flower ==2.0.1
- frozenlist ==1.4.1
- fsspec ==2023.10.0
- google-re2 ==1.1
- googleapis-common-protos ==1.62.0
- graphviz ==0.20.1
- greenlet ==3.0.3
- grpcio ==1.60.0
- gunicorn ==21.2.0
- h11 ==0.14.0
- httpcore ==1.0.2
- httpx ==0.26.0
- huggingface-hub ==0.20.2
- humanize ==4.9.0
- idna ==3.6
- importlib-metadata ==6.11.0
- importlib-resources ==6.1.1
- inflection ==0.5.1
- itsdangerous ==2.1.2
- jinja2 ==3.1.2
- joblib ==1.3.2
- jsonpatch ==1.33
- jsonpointer ==2.4
- jsonschema ==4.20.0
- jsonschema-specifications ==2023.12.1
- kombu ==5.3.4
- langchain ==0.0.352
- langchain-community ==0.0.8
- langchain-core ==0.1.6
- langsmith ==0.0.77
- lazy-object-proxy ==1.10.0
- limits ==3.7.0
- linkify-it-py ==2.0.2
- lockfile ==0.12.2
- mako ==1.3.0
- markdown ==3.5.1
- markdown-it-py ==3.0.0
- markupsafe ==2.1.3
- marshmallow ==3.20.1
- marshmallow-oneofschema ==3.0.1
- marshmallow-sqlalchemy ==0.26.1
- mdit-py-plugins ==0.4.0
- mdurl ==0.1.2
- mpmath ==1.3.0
- multidict ==6.0.4
- multiprocess ==0.70.15
- mypy-extensions ==1.0.0
- networkx ==3.2.1
- numpy ==1.26.3
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu12 ==8.9.2.26
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-nccl-cu12 ==2.18.1
- nvidia-nvjitlink-cu12 ==12.3.101
- nvidia-nvtx-cu12 ==12.1.105
- openai ==1.6.1
- opentelemetry-api ==1.22.0
- opentelemetry-exporter-otlp ==1.22.0
- opentelemetry-exporter-otlp-proto-common ==1.22.0
- opentelemetry-exporter-otlp-proto-grpc ==1.22.0
- opentelemetry-exporter-otlp-proto-http ==1.22.0
- opentelemetry-proto ==1.22.0
- opentelemetry-sdk ==1.22.0
- opentelemetry-semantic-conventions ==0.43b0
- ordered-set ==4.1.0
- packaging ==23.2
- pandas ==2.1.4
- pathspec ==0.12.1
- pendulum ==2.1.2
- pillow ==10.2.0
- pluggy ==1.3.0
- prison ==0.2.1
- prometheus-client ==0.19.0
- prompt-toolkit ==3.0.43
- protobuf ==4.25.1
- psutil ==5.9.7
- psycopg2-binary ==2.9.9
- pyarrow ==14.0.2
- pyarrow-hotfix ==0.6
- pycparser ==2.21
- pydantic ==2.5.3
- pydantic-core ==2.14.6
- pygments ==2.17.2
- pyjwt ==2.8.0
- python-daemon ==3.0.1
- python-dateutil ==2.8.2
- python-dotenv ==1.0.0
- python-nvd3 ==0.15.0
- python-slugify ==8.0.1
- pytz ==2023.3.post1
- pytzdata ==2020.1
- pywin32 ==306
- pyyaml ==6.0.1
- referencing ==0.32.1
- regex ==2023.12.25
- requests ==2.31.0
- requests-toolbelt ==1.0.0
- rfc3339-validator ==0.1.4
- rich ==13.7.0
- rich-argparse ==1.4.0
- rpds-py ==0.16.2
- safetensors ==0.4.1
- scikit-learn ==1.3.2
- scipy ==1.11.4
- setproctitle ==1.3.3
- setuptools ==69.0.3
- six ==1.16.0
- sniffio ==1.3.0
- sqlalchemy ==1.4.50
- sqlalchemy-jsonfield ==1.0.2
- sqlalchemy-utils ==0.41.1
- sqlparse ==0.4.4
- sympy ==1.12
- tabulate ==0.9.0
- tenacity ==8.2.3
- termcolor ==2.4.0
- text-unidecode ==1.3
- threadpoolctl ==3.2.0
- tiktoken ==0.5.2
- tokenizers ==0.15.0
- torch ==2.1.2
- torchaudio ==2.1.2
- torchvision ==0.16.2
- tornado ==6.4
- tqdm ==4.66.1
- transformers ==4.36.2
- triton ==2.1.0
- typing-extensions ==4.9.0
- typing-inspect ==0.9.0
- tzdata ==2023.4
- uc-micro-py ==1.0.2
- unicodecsv ==0.14.1
- urllib3 ==2.1.0
- vine ==5.1.0
- wcwidth ==0.2.12
- werkzeug ==2.2.3
- wrapt ==1.16.0
- wtforms ==3.0.1
- xxhash ==3.4.1
- yarl ==1.9.4
- zipp ==3.17.0
- 274 dependencies
- apache-airflow 2.7.3
- apache-airflow-providers-docker 3.9.1
- apache-airflow-providers-postgres ^5.9.0
- datasets ^2.16.1
- jupyter ^1.0.0
- langchain ^0.0.352
- openai ^1.6.1
- pandas ^2.1.4
- pendulum 2.1.2
- pillow ^10.1.0
- python 3.11.x
- scikit-learn ^1.3.2
- sqlalchemy 1.4.50
- tiktoken ^0.5.2
- torch ^2.1.2
- torchaudio ^2.1.2
- torchvision ^0.16.2
- transformers ^4.36.2