telegram-dataset-builder

This code uses Telethon and Telegram API to retrieve messages from public channels. This allows a fast and easy dataset creation.

https://github.com/oeg-upm/telegram-dataset-builder

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.7%) to scientific vocabulary

Keywords

dataset telegram telethon
Last synced: 6 months ago · JSON representation ·

Repository

This code uses Telethon and Telegram API to retrieve messages from public channels. This allows a fast and easy dataset creation.

Basic Info
  • Host: GitHub
  • Owner: oeg-upm
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 30.3 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
dataset telegram telethon
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Telegram Dataset Builder

DOI

This project uses Telethon to build datasets of public Telegram groups. To do this, it needs the channel name or ID. The result is a json file with the channel information and several json files organised in folders containing the messages in batches.

Requirements

The required packages and versions are the following: Telethon==1.35.0 python-dotenv==1.0.1

Telegram API credentials

The default configuration uses a telegram.env file in the root folder to load the credentials. This file must follow the next schema (note that the phone number must be with prefix):

PHONE_NUMBER= "+34..." TELEGRAM_APP_ID = 9... TELEGRAM_APP_HASH = "d..."

How to gather groups messages?

To get all messages in some groups you can run dataset_creator.py and modify the next elements:

  1. You have to modify the channel_names= ["foo", "bar"] to the channel names you want to extract.
  2. You can set a different BATCH_SIZE if you want.
  3. If you put you telegram credentials in a different path, modify telegram_env_path.
  4. The output_chats_path is the folder were everythin is going to be stored. Both the channels chats and the channels info, it can be modified.

How to monitor groups messages?

To monitor new messages sent in some groups you can run engagement_monitor.py and modify the next elements:

  1. You have to modify the channel_names= ["foo", "bar"] to the channel names you want to extract.
  2. You can set a different BATCH_SIZE if you want.
  3. If you put you telegram credentials in a different path, modify telegram_env_path.
  4. The output_chats is the folder were everythin is going to be stored. Both the channels chats and the channels info, it can be modified.

Owner

  • Name: Ontology Engineering Group (UPM)
  • Login: oeg-upm
  • Kind: organization
  • Email: oeg-dev@delicias.dia.fi.upm.es
  • Location: Boadilla del Monte, Madrid, Spain

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Guillén-Pacho"
  given-names: "Ibai"
  orcid: "https://orcid.org/0000-0001-7801-8815"
title: "oeg-upm/telegram-dataset-builder"
doi: 10.5281/zenodo.12773159
date-released: 2024-07-18
url: "https://github.com/oeg-upm/telegram-dataset-builder"

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2