https://github.com/ckdalskong/prag-baseline

MyData Pipeline Baseline Implementation

https://github.com/ckdalskong/prag-baseline

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (2.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

MyData Pipeline Baseline Implementation

Basic Info
  • Host: GitHub
  • Owner: CkdalsKong
  • Language: Python
  • Default Branch: main
  • Size: 432 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 9 months ago · Last pushed 7 months ago
Metadata Files
Readme

README.md

MyData Pipeline

프로젝트 구조

데이터 및 실행 환경

``` /data/myPRAG/ └── baseline/ ├── corpus/ # 데이터 파일 │ ├── sampledchunks.jsonl │ ├── sampledchunkswithdoc.jsonl │ ├── sampledwikidoc.jsonl │ └── sampledembeddings.npy ├── errortype/ # 평가 기준 │ ├── checkacknowledge.txt │ ├── checkhallucination.txt │ ├── checkhelpful.txt │ └── checkviolation.txt ├── output/ # 결과 저장 │ ├── standard/ │ │ ├── kept.jsonl │ │ ├── faiss.index │ │ ├── embeddings.npy │ │ ├── genstandard{personaindex}.json │ │ └── evalstandard{personaindex}.json │ ├── cosineonly{personaindex}/ │ │ ├── kept.jsonl │ │ ├── faiss.index │ │ ├── embeddings.npy │ │ ├── gencosineonly{personaindex}.json │ │ └── evalcosineonly{personaindex}.json │ ├── naivep{personaindex}/ │ │ ├── kept.jsonl │ │ ├── faiss.index │ │ ├── embeddings.npy │ │ ├── gennaivep{personaindex}.json │ │ └── evalnaivep{personaindex}.json │ ├── indexingreport.csv │ ├── generationreport.csv │ └── evaluationreport.csv ├── prompt/ # 프롬프트 템플릿 │ ├── mydatageneration.txt │ ├── mydatallmfiltering.txt │ └── mydatallmsummarizing.txt └── finalpersona_tasks.json # Persona 태스크 정의

/home/ubuntu/changmin/Baseline/ # 소스 코드 ├── runvllm.sh # vLLM 서버 실행 스크립트 ├── mydataevaluation.py # 평가 모듈 ├── mydatageneration.py # 생성 모듈 ├── mydatamain.py # 메인 실행 파일 ├── mydataindexing.py # 인덱싱 모듈 └── mydatautils.py # 유틸리티 모듈 ```

환경 설정

vLLM 서버 실행

```bash

GPU 0,1을 사용하여 vLLM 서버 실행

./run_vllm.sh 0,1 ```

실행 방법

기본 실행

bash python mydata_main.py --method [METHOD] --persona_index [INDEX] --mode [MODE] --chunk_mode [CHUNK_MODE] --output_dir [OUTPUT_DIR]

멀티 GPU 실행

bash CUDA_VISIBLE_DEVICES=0,1 python mydata_main.py --method [METHOD] --persona_index [INDEX] --mode [MODE] --chunk_mode [CHUNK_MODE] --output_dir [OUTPUT_DIR] --use_multi_gpu

파라미터 설명

필수 파라미터

  • --method: 실행할 방법 선택

    • naive_p: Naive Persona 방식
    • standard: Standard 방식
    • cosine_only: Cosine Only 방식
    • all: 모든 방식 순차 실행
    • 예시: --method naive_p 또는 --method all
  • --persona_index: Persona 인덱스 선택

    • 0-9: 특정 Persona 인덱스
    • all: 모든 Persona 순차 실행
    • 예시: --persona_index 0 또는 --persona_index all
  • --mode: 실행할 모드 선택

    • indexing: 인덱싱만 실행
    • generation: 생성만 실행
    • evaluation: 평가만 실행
    • all: 모든 모드 순차 실행
    • 예시: --mode indexing 또는 --mode all
  • --chunk_mode: 청크 모드 선택

    • wodoc: 문서 정보 없는 청크 사용
    • wdoc: 문서 정보 포함된 청크 사용
    • 예시: --chunk_mode wodoc
  • --output_dir: 출력 디렉토리 지정

    • 예시: --output_dir output_1
  • --persona_task_file: 사용할 페르소나 데이터셋

    • 예시: --persona_task_file final_persona_tasks.json
  • --emb_model_name: 사용할 임베딩 모델

    • 예시: --emb_model_name facebook/contriever

선택 파라미터

  • --device: 사용할 GPU 디바이스 (기본값: "cuda:0")

    • 예시: --device cuda:0
  • --use_multi_gpu: 멀티 GPU 사용 여부 (플래그)

    • 예시: --use_multi_gpu

실행 예시

단일 방법, 단일 Persona 실행

bash python mydata_main.py --method naive_p --persona_index 0 --mode all --chunk_mode wodoc --output_dir output_1

모든 방법, 모든 Persona 실행

bash python mydata_main.py --method all --persona_index all --mode all --chunk_mode wodoc --output_dir output_1

멀티 GPU로 실행

bash CUDA_VISIBLE_DEVICES=0,1 python mydata_main.py --method naive_p --persona_index all --mode all --chunk_mode wodoc --output_dir output_1 --use_multi_gpu

특정 모드만 실행

bash python mydata_main.py --method standard --persona_index all --mode indexing --chunk_mode wdoc --output_dir output_1

Owner

  • Login: CkdalsKong
  • Kind: user

GitHub Events

Total
  • Push event: 9
Last Year
  • Push event: 9

Dependencies

requirements.txt pypi
  • beautifulsoup4 >=4.9.3
  • faiss-cpu >=1.7.0
  • matplotlib >=3.4.0
  • numpy >=1.21.0
  • pandas >=1.3.0
  • python-dotenv >=0.19.0
  • pyyaml >=5.4.0
  • requests >=2.26.0
  • seaborn >=0.11.0
  • sentence-transformers >=2.2.0
  • torch >=1.9.0
  • tqdm >=4.62.0
  • transformers >=4.15.0
requirements_hippo.txt pypi
  • GitPython ==3.1.44
  • Jinja2 ==3.1.6
  • MarkupSafe ==3.0.2
  • PyYAML ==6.0.2
  • Pygments ==2.19.1
  • Send2Trash ==1.8.3
  • accelerate ==1.6.0
  • aiohappyeyeballs ==2.6.1
  • aiohttp ==3.11.16
  • aiohttp-cors ==0.8.1
  • aiosignal ==1.3.2
  • airportsdata ==20250224
  • annotated-types ==0.7.0
  • anyio ==4.9.0
  • argon2-cffi ==23.1.0
  • argon2-cffi-bindings ==21.2.0
  • arrow ==1.3.0
  • astor ==0.8.1
  • asttokens ==3.0.0
  • async-lru ==2.0.5
  • async-timeout ==5.0.1
  • attrs ==25.3.0
  • babel ==2.17.0
  • beautifulsoup4 ==4.13.4
  • blake3 ==1.0.4
  • bleach ==6.2.0
  • bs4 ==0.0.2
  • cachetools ==5.5.2
  • certifi ==2025.1.31
  • cffi ==1.17.1
  • charset-normalizer ==3.4.1
  • click ==8.1.8
  • cloudpickle ==3.1.1
  • colorful ==0.5.6
  • comm ==0.2.2
  • compressed-tensors ==0.8.1
  • contourpy ==1.3.2
  • cycler ==0.12.1
  • datasets ==2.21.0
  • debugpy ==1.8.14
  • decorator ==5.2.1
  • deepspeed ==0.16.9
  • defusedxml ==0.7.1
  • depyf ==0.18.0
  • dill ==0.3.8
  • diskcache ==5.6.3
  • distlib ==0.3.9
  • distro ==1.9.0
  • docker-pycreds ==0.4.0
  • einops ==0.8.1
  • eval_type_backport ==0.2.2
  • exceptiongroup ==1.2.2
  • executing ==2.2.0
  • faiss ==1.9.0
  • fastapi ==0.115.12
  • fastjsonschema ==2.21.1
  • filelock ==3.18.0
  • fonttools ==4.58.1
  • fqdn ==1.5.1
  • frozenlist ==1.5.0
  • fsspec ==2024.6.1
  • gguf ==0.10.0
  • gitdb ==4.0.12
  • google-api-core ==2.24.2
  • google-auth ==2.39.0
  • googleapis-common-protos ==1.70.0
  • gritlm ==1.0.2
  • grpcio ==1.71.0
  • h11 ==0.14.0
  • hipporag ==2.0.0a3
  • hjson ==3.1.0
  • httpcore ==1.0.8
  • httptools ==0.6.4
  • httpx ==0.28.1
  • huggingface-hub ==0.30.2
  • idna ==3.10
  • igraph ==0.11.8
  • importlib_metadata ==8.6.1
  • interegular ==0.3.3
  • ipykernel ==6.29.5
  • ipython ==8.35.0
  • isoduration ==20.11.0
  • jedi ==0.19.2
  • jiter ==0.9.0
  • joblib ==1.4.2
  • json5 ==0.12.0
  • jsonpointer ==3.0.0
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2024.10.1
  • jupyter-events ==0.12.0
  • jupyter-lsp ==2.2.5
  • jupyter_client ==8.6.3
  • jupyter_core ==5.7.2
  • jupyter_server ==2.15.0
  • jupyter_server_terminals ==0.5.3
  • jupyterlab ==4.4.0
  • jupyterlab_pygments ==0.3.0
  • jupyterlab_server ==2.27.3
  • kiwisolver ==1.4.8
  • lark ==1.2.2
  • lm-format-enforcer ==0.10.11
  • markdown-it-py ==3.0.0
  • matplotlib ==3.10.3
  • matplotlib-inline ==0.1.7
  • mdurl ==0.1.2
  • mistral_common ==1.5.4
  • mistune ==3.1.3
  • mpmath ==1.3.0
  • msgpack ==1.1.0
  • msgspec ==0.19.0
  • mteb ==1.37.0
  • multidict ==6.4.3
  • multiprocess ==0.70.16
  • nbclient ==0.10.2
  • nbconvert ==7.16.6
  • nbformat ==5.10.4
  • nest-asyncio ==1.6.0
  • networkx ==3.4.2
  • ninja ==1.11.1.4
  • notebook_shim ==0.2.4
  • nvidia-cublas-cu12 ==12.4.5.8
  • nvidia-cuda-cupti-cu12 ==12.4.127
  • nvidia-cuda-nvrtc-cu12 ==12.4.127
  • nvidia-cuda-runtime-cu12 ==12.4.127
  • nvidia-cudnn-cu12 ==9.1.0.70
  • nvidia-cufft-cu12 ==11.2.1.3
  • nvidia-curand-cu12 ==10.3.5.147
  • nvidia-cusolver-cu12 ==11.6.1.9
  • nvidia-cusparse-cu12 ==12.3.1.170
  • nvidia-ml-py ==12.570.86
  • nvidia-nccl-cu12 ==2.21.5
  • nvidia-nvjitlink-cu12 ==12.4.127
  • nvidia-nvtx-cu12 ==12.4.127
  • openai ==1.58.1
  • opencensus ==0.11.4
  • opencensus-context ==0.1.3
  • opencv-python-headless ==4.11.0.86
  • outlines ==0.1.11
  • outlines_core ==0.1.26
  • overrides ==7.7.0
  • packaging ==24.2
  • pandas ==2.2.3
  • pandocfilters ==1.5.1
  • parso ==0.8.4
  • partial-json-parser ==0.2.1.1.post5
  • pexpect ==4.9.0
  • pillow ==11.2.1
  • platformdirs ==4.3.7
  • polars ==1.27.1
  • prometheus-fastapi-instrumentator ==7.1.0
  • prometheus_client ==0.21.1
  • prompt_toolkit ==3.0.51
  • propcache ==0.3.1
  • proto-plus ==1.26.1
  • protobuf ==5.29.4
  • psutil ==7.0.0
  • ptyprocess ==0.7.0
  • pure_eval ==0.2.3
  • py-cpuinfo ==9.0.0
  • py-spy ==0.4.0
  • pyarrow ==14.0.1
  • pyasn1 ==0.6.1
  • pyasn1_modules ==0.4.2
  • pycountry ==24.6.1
  • pycparser ==2.22
  • pydantic ==2.10.4
  • pydantic_core ==2.27.2
  • pyparsing ==3.2.3
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.1.0
  • python-igraph ==0.11.8
  • python-json-logger ==3.3.0
  • pytrec_eval-terrier ==0.5.7
  • pytz ==2025.2
  • pyzmq ==26.4.0
  • ray ==2.44.1
  • referencing ==0.36.2
  • regex ==2024.11.6
  • requests ==2.32.3
  • rfc3339-validator ==0.1.4
  • rfc3986-validator ==0.1.1
  • rich ==14.0.0
  • rpds-py ==0.24.0
  • rsa ==4.9.1
  • safetensors ==0.5.3
  • scikit-learn ==1.6.1
  • scipy ==1.15.2
  • seaborn ==0.13.2
  • sentence-transformers ==4.1.0
  • sentencepiece ==0.2.0
  • sentry-sdk ==2.26.1
  • setproctitle ==1.3.5
  • six ==1.17.0
  • smart-open ==7.1.0
  • smmap ==5.0.2
  • sniffio ==1.3.1
  • soupsieve ==2.7
  • stack-data ==0.6.3
  • starlette ==0.46.2
  • sympy ==1.13.1
  • tenacity ==8.5.0
  • terminado ==0.18.1
  • texttable ==1.7.0
  • threadpoolctl ==3.6.0
  • tiktoken ==0.7.0
  • tinycss2 ==1.4.0
  • tokenizers ==0.20.3
  • tomli ==2.2.1
  • torch ==2.5.1
  • torchvision ==0.20.1
  • tornado ==6.4.2
  • tqdm ==4.67.1
  • traitlets ==5.14.3
  • transformers ==4.45.2
  • triton ==3.1.0
  • types-python-dateutil ==2.9.0.20241206
  • typing_extensions ==4.13.2
  • tzdata ==2025.2
  • uri-template ==1.3.0
  • urllib3 ==2.4.0
  • uvicorn ==0.34.1
  • uvloop ==0.21.0
  • virtualenv ==20.30.0
  • vllm ==0.6.6.post1
  • wandb ==0.19.9
  • watchfiles ==1.0.5
  • wcwidth ==0.2.13
  • webcolors ==24.11.1
  • webencodings ==0.5.1
  • websocket-client ==1.8.0
  • websockets ==15.0.1
  • wrapt ==1.17.2
  • xformers ==0.0.28.post3
  • xgrammar ==0.1.14
  • xxhash ==3.5.0
  • yarl ==1.19.0
  • zipp ==3.21.0