syncialo

Synthetic drop-in replacements for KIALO debate datasets

https://github.com/debatelab/syncialo

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Synthetic drop-in replacements for KIALO debate datasets

Basic Info
  • Host: GitHub
  • Owner: debatelab
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 2.22 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

# syncIALO 🤖🗯️ [blog post](https://huggingface.co/blog/ggbetz/introducing-syncialo) | [syncialo-raw dataset](https://huggingface.co/datasets/DebateLabKIT/syncialo-raw)

What is this?

Synthetic drop-in replacements for Kialo debate datasets.

Why?

The Kialo debates are a 👑 gold mine for NLP researchers, AI engineers, computational sociologists, and Critical Thinking scholars. Yet, the mine is legally ⛔️ barred (for them): Debate data downloaded or scraped from the website may not be used for research or commercial purposes in the absence of explicit permission or license agreement.

That's why the DebateLab team has built this python module for creating synthetic debate corpora, which may serve as a drop-in replacements for the Kialo data. We're synthesizing such data from scratch, simulating multi-agent debate and collaborative argument-mapping with 🤖 LLM-based agents.

Features

  • permissive ODC license
  • reproducible and extendable
  • open source code basis
  • works with open LLMs
  • one-line-import as networkx graphs

Corpora

| id | llm | # debates | ~# claims | link | contributed by | |---|---|---|---|---|---| | syntheticcorpus-001 |Llama-3.1-405B-Instruct|1000/50/50¹|560k/28k/28k¹|HF hub→|DebateLab²| | syntheticcorpus-001-DE |Llama-3.1-SauerkrautLM-70b-Instruct³|1000/50/50¹|560k/28k/28k¹|HF hub→|DebateLab|

¹ per train / eval / test split
² with ❤️ generous support from 🤗 HuggingFace
³ as translator

Simulation Design

The following steps sketch the procedure by which debates are simulated:

  1. Determine the debate's tag cloud by randomly sampling 8 topic tags.
  2. Given the tag cloud, let 🤖 generate a debate topic (e.g., a question).
  3. Given the topic, let 🤖 generate a suitable motion (i.e., the central claim).
  4. Recursively generate an argument tree, starting with the motion as target argument (code→):
    1. Let 🤖 identify the implicit premises of the target argument (code→).
    2. Let 🤖 generate k pros for different premises of the target argument (code→):
      • Choose premise to target in function of premises' plausibility.
      • Let 🤖 assume randomly sampled persona.
      • Generate 2k candidate arguments and select k most salient ones.
    3. Let 🤖 generate k cons against different premises of the target argument (code→):
      • Choose premise to target in function of premises' implausibility.
      • Let 🤖 assume randomly sampled persona.
      • Generate 2k candidate arguments and select k most salient ones.
    4. Check for and resolve duplicates via semantic similarity / vector store (code→).
    5. Add pros and cons to argument tree, and use each of these as new target argument that is argued for and against, unless max depth has been reached.

Usage

Configure workflows/synthetic_corpus_generation.py. Then:

sh hatch shell python workflows/synthetic_corpus_generation.py

Owner

  • Name: DebateLab @ KIT
  • Login: debatelab
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  syncIALO: Synthetic drop-in replacements for KIALO debate
  datasets
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: dataset
authors:
  - name: Gregor Betz
    website: 'https://gregorbetz.de'
repository-code: 'https://github.com/debatelab/syncIALO'

GitHub Events

Total
  • Watch event: 1
  • Public event: 1
  • Push event: 1
Last Year
  • Watch event: 1
  • Public event: 1
  • Push event: 1

Dependencies

pyproject.toml pypi
  • aiofiles *
  • commentjson *
  • datasets *
  • faiss-cpu *
  • langchain >=0.3,<0.4
  • langchain-huggingface >=0.1,<0.2
  • langchain-openai >=0.2,<0.3
  • langchain_community >=0.3
  • loguru *
  • networkx <3.5
  • prefect *
  • python-dotenv *
  • pyyaml *
  • tenacity *
  • ujson *