syncialo
Synthetic drop-in replacements for KIALO debate datasets
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Repository
Synthetic drop-in replacements for KIALO debate datasets
Basic Info
- Host: GitHub
- Owner: debatelab
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 2.22 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
What is this?
Synthetic drop-in replacements for Kialo debate datasets.
Why?
The Kialo debates are a 👑 gold mine for NLP researchers, AI engineers, computational sociologists, and Critical Thinking scholars. Yet, the mine is legally ⛔️ barred (for them): Debate data downloaded or scraped from the website may not be used for research or commercial purposes in the absence of explicit permission or license agreement.
That's why the DebateLab team has built this python module for creating synthetic debate corpora, which may serve as a drop-in replacements for the Kialo data. We're synthesizing such data from scratch, simulating multi-agent debate and collaborative argument-mapping with 🤖 LLM-based agents.
Features
- permissive ODC license
- reproducible and extendable
- open source code basis
- works with open LLMs
- one-line-import as networkx graphs
Corpora
| id | llm | # debates | ~# claims | link | contributed by | |---|---|---|---|---|---| | syntheticcorpus-001 |Llama-3.1-405B-Instruct|1000/50/50¹|560k/28k/28k¹|HF hub→|DebateLab²| | syntheticcorpus-001-DE |Llama-3.1-SauerkrautLM-70b-Instruct³|1000/50/50¹|560k/28k/28k¹|HF hub→|DebateLab|
¹ per train / eval / test split
² with ❤️ generous support from 🤗 HuggingFace
³ as translator
Simulation Design
The following steps sketch the procedure by which debates are simulated:
- Determine the debate's
tag cloudby randomly sampling 8 topic tags. - Given the
tag cloud, let 🤖 generate a debatetopic(e.g., a question). - Given the
topic, let 🤖 generate a suitablemotion(i.e., the central claim). - Recursively generate an argument tree, starting with the
motionastarget argument(code→):- Let 🤖 identify the implicit
premisesof thetarget argument(code→). - Let 🤖 generate k
prosfor differentpremisesof thetarget argument(code→):- Choose
premiseto target in function ofpremises' plausibility. - Let 🤖 assume randomly sampled persona.
- Generate 2k candidate arguments and select k most salient ones.
- Choose
- Let 🤖 generate k
consagainst differentpremisesof thetarget argument(code→):- Choose
premiseto target in function ofpremises' implausibility. - Let 🤖 assume randomly sampled persona.
- Generate 2k candidate arguments and select k most salient ones.
- Choose
- Check for and resolve duplicates via semantic similarity / vector store (code→).
- Add
prosandconsto argument tree, and use each of these as newtarget argumentthat is argued for and against, unless max depth has been reached.
- Let 🤖 identify the implicit
Usage
Configure workflows/synthetic_corpus_generation.py. Then:
sh
hatch shell
python workflows/synthetic_corpus_generation.py
Owner
- Name: DebateLab @ KIT
- Login: debatelab
- Kind: organization
- Website: https://debatelab.github.io/
- Repositories: 5
- Profile: https://github.com/debatelab
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
syncIALO: Synthetic drop-in replacements for KIALO debate
datasets
message: >-
If you use this software, please cite it using the
metadata from this file.
type: dataset
authors:
- name: Gregor Betz
website: 'https://gregorbetz.de'
repository-code: 'https://github.com/debatelab/syncIALO'
GitHub Events
Total
- Watch event: 1
- Public event: 1
- Push event: 1
Last Year
- Watch event: 1
- Public event: 1
- Push event: 1
Dependencies
- aiofiles *
- commentjson *
- datasets *
- faiss-cpu *
- langchain >=0.3,<0.4
- langchain-huggingface >=0.1,<0.2
- langchain-openai >=0.2,<0.3
- langchain_community >=0.3
- loguru *
- networkx <3.5
- prefect *
- python-dotenv *
- pyyaml *
- tenacity *
- ujson *