https://github.com/ckdalskong/prag-baseline
MyData Pipeline Baseline Implementation
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (2.1%) to scientific vocabulary
Repository
MyData Pipeline Baseline Implementation
Basic Info
- Host: GitHub
- Owner: CkdalsKong
- Language: Python
- Default Branch: main
- Size: 432 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
MyData Pipeline
프로젝트 구조
데이터 및 실행 환경
``` /data/myPRAG/ └── baseline/ ├── corpus/ # 데이터 파일 │ ├── sampledchunks.jsonl │ ├── sampledchunkswithdoc.jsonl │ ├── sampledwikidoc.jsonl │ └── sampledembeddings.npy ├── errortype/ # 평가 기준 │ ├── checkacknowledge.txt │ ├── checkhallucination.txt │ ├── checkhelpful.txt │ └── checkviolation.txt ├── output/ # 결과 저장 │ ├── standard/ │ │ ├── kept.jsonl │ │ ├── faiss.index │ │ ├── embeddings.npy │ │ ├── genstandard{personaindex}.json │ │ └── evalstandard{personaindex}.json │ ├── cosineonly{personaindex}/ │ │ ├── kept.jsonl │ │ ├── faiss.index │ │ ├── embeddings.npy │ │ ├── gencosineonly{personaindex}.json │ │ └── evalcosineonly{personaindex}.json │ ├── naivep{personaindex}/ │ │ ├── kept.jsonl │ │ ├── faiss.index │ │ ├── embeddings.npy │ │ ├── gennaivep{personaindex}.json │ │ └── evalnaivep{personaindex}.json │ ├── indexingreport.csv │ ├── generationreport.csv │ └── evaluationreport.csv ├── prompt/ # 프롬프트 템플릿 │ ├── mydatageneration.txt │ ├── mydatallmfiltering.txt │ └── mydatallmsummarizing.txt └── finalpersona_tasks.json # Persona 태스크 정의
/home/ubuntu/changmin/Baseline/ # 소스 코드 ├── runvllm.sh # vLLM 서버 실행 스크립트 ├── mydataevaluation.py # 평가 모듈 ├── mydatageneration.py # 생성 모듈 ├── mydatamain.py # 메인 실행 파일 ├── mydataindexing.py # 인덱싱 모듈 └── mydatautils.py # 유틸리티 모듈 ```
환경 설정
vLLM 서버 실행
```bash
GPU 0,1을 사용하여 vLLM 서버 실행
./run_vllm.sh 0,1 ```
실행 방법
기본 실행
bash
python mydata_main.py --method [METHOD] --persona_index [INDEX] --mode [MODE] --chunk_mode [CHUNK_MODE] --output_dir [OUTPUT_DIR]
멀티 GPU 실행
bash
CUDA_VISIBLE_DEVICES=0,1 python mydata_main.py --method [METHOD] --persona_index [INDEX] --mode [MODE] --chunk_mode [CHUNK_MODE] --output_dir [OUTPUT_DIR] --use_multi_gpu
파라미터 설명
필수 파라미터
--method: 실행할 방법 선택naive_p: Naive Persona 방식standard: Standard 방식cosine_only: Cosine Only 방식all: 모든 방식 순차 실행- 예시:
--method naive_p또는--method all
--persona_index: Persona 인덱스 선택0-9: 특정 Persona 인덱스all: 모든 Persona 순차 실행- 예시:
--persona_index 0또는--persona_index all
--mode: 실행할 모드 선택indexing: 인덱싱만 실행generation: 생성만 실행evaluation: 평가만 실행all: 모든 모드 순차 실행- 예시:
--mode indexing또는--mode all
--chunk_mode: 청크 모드 선택wodoc: 문서 정보 없는 청크 사용wdoc: 문서 정보 포함된 청크 사용- 예시:
--chunk_mode wodoc
--output_dir: 출력 디렉토리 지정- 예시:
--output_dir output_1
- 예시:
--persona_task_file: 사용할 페르소나 데이터셋- 예시:
--persona_task_file final_persona_tasks.json
- 예시:
--emb_model_name: 사용할 임베딩 모델- 예시:
--emb_model_name facebook/contriever
- 예시:
선택 파라미터
--device: 사용할 GPU 디바이스 (기본값: "cuda:0")- 예시:
--device cuda:0
- 예시:
--use_multi_gpu: 멀티 GPU 사용 여부 (플래그)- 예시:
--use_multi_gpu
- 예시:
실행 예시
단일 방법, 단일 Persona 실행
bash
python mydata_main.py --method naive_p --persona_index 0 --mode all --chunk_mode wodoc --output_dir output_1
모든 방법, 모든 Persona 실행
bash
python mydata_main.py --method all --persona_index all --mode all --chunk_mode wodoc --output_dir output_1
멀티 GPU로 실행
bash
CUDA_VISIBLE_DEVICES=0,1 python mydata_main.py --method naive_p --persona_index all --mode all --chunk_mode wodoc --output_dir output_1 --use_multi_gpu
특정 모드만 실행
bash
python mydata_main.py --method standard --persona_index all --mode indexing --chunk_mode wdoc --output_dir output_1
Owner
- Login: CkdalsKong
- Kind: user
- Repositories: 1
- Profile: https://github.com/CkdalsKong
GitHub Events
Total
- Push event: 9
Last Year
- Push event: 9
Dependencies
- beautifulsoup4 >=4.9.3
- faiss-cpu >=1.7.0
- matplotlib >=3.4.0
- numpy >=1.21.0
- pandas >=1.3.0
- python-dotenv >=0.19.0
- pyyaml >=5.4.0
- requests >=2.26.0
- seaborn >=0.11.0
- sentence-transformers >=2.2.0
- torch >=1.9.0
- tqdm >=4.62.0
- transformers >=4.15.0
- GitPython ==3.1.44
- Jinja2 ==3.1.6
- MarkupSafe ==3.0.2
- PyYAML ==6.0.2
- Pygments ==2.19.1
- Send2Trash ==1.8.3
- accelerate ==1.6.0
- aiohappyeyeballs ==2.6.1
- aiohttp ==3.11.16
- aiohttp-cors ==0.8.1
- aiosignal ==1.3.2
- airportsdata ==20250224
- annotated-types ==0.7.0
- anyio ==4.9.0
- argon2-cffi ==23.1.0
- argon2-cffi-bindings ==21.2.0
- arrow ==1.3.0
- astor ==0.8.1
- asttokens ==3.0.0
- async-lru ==2.0.5
- async-timeout ==5.0.1
- attrs ==25.3.0
- babel ==2.17.0
- beautifulsoup4 ==4.13.4
- blake3 ==1.0.4
- bleach ==6.2.0
- bs4 ==0.0.2
- cachetools ==5.5.2
- certifi ==2025.1.31
- cffi ==1.17.1
- charset-normalizer ==3.4.1
- click ==8.1.8
- cloudpickle ==3.1.1
- colorful ==0.5.6
- comm ==0.2.2
- compressed-tensors ==0.8.1
- contourpy ==1.3.2
- cycler ==0.12.1
- datasets ==2.21.0
- debugpy ==1.8.14
- decorator ==5.2.1
- deepspeed ==0.16.9
- defusedxml ==0.7.1
- depyf ==0.18.0
- dill ==0.3.8
- diskcache ==5.6.3
- distlib ==0.3.9
- distro ==1.9.0
- docker-pycreds ==0.4.0
- einops ==0.8.1
- eval_type_backport ==0.2.2
- exceptiongroup ==1.2.2
- executing ==2.2.0
- faiss ==1.9.0
- fastapi ==0.115.12
- fastjsonschema ==2.21.1
- filelock ==3.18.0
- fonttools ==4.58.1
- fqdn ==1.5.1
- frozenlist ==1.5.0
- fsspec ==2024.6.1
- gguf ==0.10.0
- gitdb ==4.0.12
- google-api-core ==2.24.2
- google-auth ==2.39.0
- googleapis-common-protos ==1.70.0
- gritlm ==1.0.2
- grpcio ==1.71.0
- h11 ==0.14.0
- hipporag ==2.0.0a3
- hjson ==3.1.0
- httpcore ==1.0.8
- httptools ==0.6.4
- httpx ==0.28.1
- huggingface-hub ==0.30.2
- idna ==3.10
- igraph ==0.11.8
- importlib_metadata ==8.6.1
- interegular ==0.3.3
- ipykernel ==6.29.5
- ipython ==8.35.0
- isoduration ==20.11.0
- jedi ==0.19.2
- jiter ==0.9.0
- joblib ==1.4.2
- json5 ==0.12.0
- jsonpointer ==3.0.0
- jsonschema ==4.23.0
- jsonschema-specifications ==2024.10.1
- jupyter-events ==0.12.0
- jupyter-lsp ==2.2.5
- jupyter_client ==8.6.3
- jupyter_core ==5.7.2
- jupyter_server ==2.15.0
- jupyter_server_terminals ==0.5.3
- jupyterlab ==4.4.0
- jupyterlab_pygments ==0.3.0
- jupyterlab_server ==2.27.3
- kiwisolver ==1.4.8
- lark ==1.2.2
- lm-format-enforcer ==0.10.11
- markdown-it-py ==3.0.0
- matplotlib ==3.10.3
- matplotlib-inline ==0.1.7
- mdurl ==0.1.2
- mistral_common ==1.5.4
- mistune ==3.1.3
- mpmath ==1.3.0
- msgpack ==1.1.0
- msgspec ==0.19.0
- mteb ==1.37.0
- multidict ==6.4.3
- multiprocess ==0.70.16
- nbclient ==0.10.2
- nbconvert ==7.16.6
- nbformat ==5.10.4
- nest-asyncio ==1.6.0
- networkx ==3.4.2
- ninja ==1.11.1.4
- notebook_shim ==0.2.4
- nvidia-cublas-cu12 ==12.4.5.8
- nvidia-cuda-cupti-cu12 ==12.4.127
- nvidia-cuda-nvrtc-cu12 ==12.4.127
- nvidia-cuda-runtime-cu12 ==12.4.127
- nvidia-cudnn-cu12 ==9.1.0.70
- nvidia-cufft-cu12 ==11.2.1.3
- nvidia-curand-cu12 ==10.3.5.147
- nvidia-cusolver-cu12 ==11.6.1.9
- nvidia-cusparse-cu12 ==12.3.1.170
- nvidia-ml-py ==12.570.86
- nvidia-nccl-cu12 ==2.21.5
- nvidia-nvjitlink-cu12 ==12.4.127
- nvidia-nvtx-cu12 ==12.4.127
- openai ==1.58.1
- opencensus ==0.11.4
- opencensus-context ==0.1.3
- opencv-python-headless ==4.11.0.86
- outlines ==0.1.11
- outlines_core ==0.1.26
- overrides ==7.7.0
- packaging ==24.2
- pandas ==2.2.3
- pandocfilters ==1.5.1
- parso ==0.8.4
- partial-json-parser ==0.2.1.1.post5
- pexpect ==4.9.0
- pillow ==11.2.1
- platformdirs ==4.3.7
- polars ==1.27.1
- prometheus-fastapi-instrumentator ==7.1.0
- prometheus_client ==0.21.1
- prompt_toolkit ==3.0.51
- propcache ==0.3.1
- proto-plus ==1.26.1
- protobuf ==5.29.4
- psutil ==7.0.0
- ptyprocess ==0.7.0
- pure_eval ==0.2.3
- py-cpuinfo ==9.0.0
- py-spy ==0.4.0
- pyarrow ==14.0.1
- pyasn1 ==0.6.1
- pyasn1_modules ==0.4.2
- pycountry ==24.6.1
- pycparser ==2.22
- pydantic ==2.10.4
- pydantic_core ==2.27.2
- pyparsing ==3.2.3
- python-dateutil ==2.9.0.post0
- python-dotenv ==1.1.0
- python-igraph ==0.11.8
- python-json-logger ==3.3.0
- pytrec_eval-terrier ==0.5.7
- pytz ==2025.2
- pyzmq ==26.4.0
- ray ==2.44.1
- referencing ==0.36.2
- regex ==2024.11.6
- requests ==2.32.3
- rfc3339-validator ==0.1.4
- rfc3986-validator ==0.1.1
- rich ==14.0.0
- rpds-py ==0.24.0
- rsa ==4.9.1
- safetensors ==0.5.3
- scikit-learn ==1.6.1
- scipy ==1.15.2
- seaborn ==0.13.2
- sentence-transformers ==4.1.0
- sentencepiece ==0.2.0
- sentry-sdk ==2.26.1
- setproctitle ==1.3.5
- six ==1.17.0
- smart-open ==7.1.0
- smmap ==5.0.2
- sniffio ==1.3.1
- soupsieve ==2.7
- stack-data ==0.6.3
- starlette ==0.46.2
- sympy ==1.13.1
- tenacity ==8.5.0
- terminado ==0.18.1
- texttable ==1.7.0
- threadpoolctl ==3.6.0
- tiktoken ==0.7.0
- tinycss2 ==1.4.0
- tokenizers ==0.20.3
- tomli ==2.2.1
- torch ==2.5.1
- torchvision ==0.20.1
- tornado ==6.4.2
- tqdm ==4.67.1
- traitlets ==5.14.3
- transformers ==4.45.2
- triton ==3.1.0
- types-python-dateutil ==2.9.0.20241206
- typing_extensions ==4.13.2
- tzdata ==2025.2
- uri-template ==1.3.0
- urllib3 ==2.4.0
- uvicorn ==0.34.1
- uvloop ==0.21.0
- virtualenv ==20.30.0
- vllm ==0.6.6.post1
- wandb ==0.19.9
- watchfiles ==1.0.5
- wcwidth ==0.2.13
- webcolors ==24.11.1
- webencodings ==0.5.1
- websocket-client ==1.8.0
- websockets ==15.0.1
- wrapt ==1.17.2
- xformers ==0.0.28.post3
- xgrammar ==0.1.14
- xxhash ==3.5.0
- yarl ==1.19.0
- zipp ==3.21.0