syncialo

Synthetic drop-in replacements for KIALO debate datasets

https://github.com/debatelab/syncialo

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Synthetic drop-in replacements for KIALO debate datasets

Basic Info

Host: GitHub
Owner: debatelab
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Size: 2.22 MB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

# syncIALO 🤖🗯️ [blog post](https://huggingface.co/blog/ggbetz/introducing-syncialo) | [syncialo-raw dataset](https://huggingface.co/datasets/DebateLabKIT/syncialo-raw)

What is this?

Synthetic drop-in replacements for Kialo debate datasets.

Why?

The Kialo debates are a 👑 gold mine for NLP researchers, AI engineers, computational sociologists, and Critical Thinking scholars. Yet, the mine is legally ⛔️ barred (for them): Debate data downloaded or scraped from the website may not be used for research or commercial purposes in the absence of explicit permission or license agreement.

That's why the DebateLab team has built this python module for creating synthetic debate corpora, which may serve as a drop-in replacements for the Kialo data. We're synthesizing such data from scratch, simulating multi-agent debate and collaborative argument-mapping with 🤖 LLM-based agents.

Features

permissive ODC license
reproducible and extendable
open source code basis
works with open LLMs
one-line-import as networkx graphs

Corpora

| id | llm | # debates | ~# claims | link | contributed by | |---|---|---|---|---|---| | syntheticcorpus-001 |Llama-3.1-405B-Instruct|1000/50/50¹|560k/28k/28k¹|HF hub→|DebateLab²| | syntheticcorpus-001-DE |Llama-3.1-SauerkrautLM-70b-Instruct³|1000/50/50¹|560k/28k/28k¹|HF hub→|DebateLab|

¹ per train / eval / test split
² with ❤️ generous support from 🤗 HuggingFace
³ as translator

Simulation Design

The following steps sketch the procedure by which debates are simulated:

Determine the debate's tag cloud by randomly sampling 8 topic tags.
Given the tag cloud, let 🤖 generate a debate topic (e.g., a question).
Given the topic, let 🤖 generate a suitable motion (i.e., the central claim).
Recursively generate an argument tree, starting with the motion as target argument (code→):
1. Let 🤖 identify the implicit premises of the target argument (code→).
2. Let 🤖 generate k pros for different premises of the target argument (code→):
  - Choose premise to target in function of premises' plausibility.
  - Let 🤖 assume randomly sampled persona.
  - Generate 2k candidate arguments and select k most salient ones.
3. Let 🤖 generate k cons against different premises of the target argument (code→):
  - Choose premise to target in function of premises' implausibility.
  - Let 🤖 assume randomly sampled persona.
  - Generate 2k candidate arguments and select k most salient ones.
4. Check for and resolve duplicates via semantic similarity / vector store (code→).
5. Add pros and cons to argument tree, and use each of these as new target argument that is argued for and against, unless max depth has been reached.

Usage

Configure workflows/synthetic_corpus_generation.py. Then:

sh hatch shell python workflows/synthetic_corpus_generation.py

Owner

Name: DebateLab @ KIT
Login: debatelab
Kind: organization

Website: https://debatelab.github.io/
Repositories: 5
Profile: https://github.com/debatelab

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  syncIALO: Synthetic drop-in replacements for KIALO debate
  datasets
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: dataset
authors:
  - name: Gregor Betz
    website: 'https://gregorbetz.de'
repository-code: 'https://github.com/debatelab/syncIALO'

GitHub Events

Total

Watch event: 1
Public event: 1
Push event: 1

Last Year

Watch event: 1
Public event: 1
Push event: 1

Dependencies

pyproject.toml pypi

aiofiles *
commentjson *
datasets *
faiss-cpu *
langchain >=0.3,<0.4
langchain-huggingface >=0.1,<0.2
langchain-openai >=0.2,<0.3
langchain_community >=0.3
loguru *
networkx <3.5
prefect *
python-dotenv *
pyyaml *
tenacity *
ujson *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science