tiktok-content-scraper

TikTok Content Scraper -- No API-Key needed, minimal dependencies, citable | Download videos (MP4), slides (JPEG) and metadata of author, music, file, hashtags, content, interactions etc.

https://github.com/q-bukold/tiktok-content-scraper

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

TikTok Content Scraper -- No API-Key needed, minimal dependencies, citable | Download videos (MP4), slides (JPEG) and metadata of author, music, file, hashtags, content, interactions etc.

Basic Info
  • Host: GitHub
  • Owner: Q-Bukold
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.02 MB
Statistics
  • Stars: 37
  • Watchers: 3
  • Forks: 9
  • Open Issues: 1
  • Releases: 3
Created about 1 year ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

What is it?

This scraper allows you to download both TikTok videos and slides without an official API key. Additionally, it can scrape approximately 100 metadata fields related to the video, author, music, video file, and hashtags. The scraper is built as a Python class and can be inherited by a custom parent class, allowing for easy integration with databases or other systems.

Features

  • Download TikTok videos (mp4) and slides (jpeg's + mp3).
  • Scrape extensive metadata.
  • Customizable and extendable via inheritance.
  • Supports batch processing and progress tracking. > New Feature = Author metadata scraping!

Usage

Setup

  1. Clone the Repository: bash git clone https://github.com/Q-Bukold/TikTok-Content-Scraper.git

  2. Install All Dependencies in the Requirements File: bash pip install -r requirements.txt

  3. Run the Example Script: bash python3 example_script.py

Scrape a single video or slide

To scrape the metadata and content of a video, the TikTok ID is required. It can be found in the URL of a video. Let's use the ID 7460303767968156958 to scrape the associated video.

```python from TTScraper import TTScraper

Configure the scraper, this step is always needed

tt = TTScraper(waittime=0.3, outputfilesfp="data/")

Download all metadata as a .json and all content as .mp4/.jpeg

tt.scrape(id = 7460303767968156958, scrapecontent = True, downloadmetadata = True, download_content = True)

```

Scrape a single user profile

To scrape the metadata of a user, the TikTok username is required (with or without an @). It can be found in the URL of a user profile. Let's use the ID insidecdu to scrape the associated user profile.

```python from TTScraper import TTScraper

Configure the scraper, this step is always needed

tt = TTScraper(waittime=0.3, outputfilesfp="data/")

scrape user profile

tt.scrapeuser(username="insidecdu", downloadmetadata=True)

```

Scrape multiple videos and slides

You can also scrape a list of IDs with the following code. The scraper detects on it's own, if the content is a Slide or Video.

```python import pandas as pd from TTScraper import TTScraper

Configure the scraper, this step is always needed

tt = TTScraper(waittime=0.3, outputfilesfp="data/")

Define list of TikTok ids (ids can be string or integer)

data = pd.readcsv("data/seedlist.csv") mylist = data["ids"].tolist()

Insert list into scraper

tt.scrapelist(ids = mylist, scrapecontent = True, batchsize = None, clear_console = True) ```

The scrape_list function provides a useful overview of your progress. Enable clear_console to clear the terminal output after every scrape. Note that clear_console does not work on Windows machines.

``` Queue Information: Current Queue: 691 / 163,336 Errors in a row: 0 1.10 iteration time 2.89 sec. per video (averaged) ETA (current queue): 5 days, 10:23:19


-> id 7359982080861703457 -> is slide with 17 pictures

```

Scrape multiple user profiles

Development in progress...

Citation

Bukold, Q. (2025). TikTok Content Scraper (Version 1.0) [Computer software]. Weizenbaum Institute. https://doi.org/10.34669/WI.RD/4

Advanced Usage

Alternatives to saving the data on drive

The scraper can download metadata and content (video file, images) as well as return them as variables. Metadata is returned as a dictionary or saved as a .json file, and content is saved as .mp4 / .jpeg + .mp3 or returned as an array of binaries. Remember the rule: what is not downloaded is returned.

```python from TTScraper import TTScraper

Configure the scraper, this step is always needed

tt = TTScraper(waittime=0.3, outputfilesfp="data/")

Downloading Everything

tt.scrape( id = 7460303767968156958, scrapecontent = True, downloadmetadata = True, download_content = True)

Returning Everything

metadata, content = tt.scrape( id = 7460303767968156958, scrapecontent = True, downloadmetadata = False, download_content = False)

Returning one of the two and downloading the other

metadata = tt.scrape( id = 7460303767968156958, scrapecontent = True, downloadmetadata = False, download_content = True) ```

Alternatives to saving the data on the drive II: Overwriting the downloaddata function

Changing the output of scrape_list() is a bit more difficult, but can be achieved by overwriting a function called \_download_data() that is part of the TT_Scraper class. To overwrite the function, one must inherit the class. The variable metadata_batch is a list of dictionaries, each containing all the metadata of a video/slide as well as the binary content of a video/slide.

Let's save the content, but insert the metadata into a database: ```python from TTScraper import TTScraper

create a new class, that inherits the TT_Scraper

class TTScraperDB(TTScraper): def _init(self, waittime = 0.35, outputfiles_fp = "data/"): super().init_(waittime, outputfilesfp)

# overwriting download_data function to upsert metadata into database
def _download_data(self, metadata_batch, download_metadata = True, download_content = True):

    for metadata_package in metadata_batch:
        # insert metadata into database
        self.insert_metadata_to_db(metadata_package)

    # downloading content
    super()._download_data(metadata_batch, download_metadata=False, download_content=True)

def insert_metadata_to_db(metadata_package)
    ...
    return None

tt = TTScraperDB(waittime = 0.35, outputfilesfp = "data/") tt.scrapelist(my_list) ```

Owner

  • Name: Quentin Bukold
  • Login: Q-Bukold
  • Kind: user

Student Uni-Hildesheim B.A. Digitale Sozialwissenschaften

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Bukold"
  given-names: "Quentin"
title: "TikTok-Content-Scraper"
version: 1.0
date-released: 2025-02-12
identifiers:
  - type: doi
    value: https://doi.org/10.34669/WI.RD/4
url: "https://www.weizenbaum-library.de/handle/id/814"

GitHub Events

Total
  • Create event: 10
  • Release event: 4
  • Issues event: 4
  • Watch event: 35
  • Delete event: 4
  • Member event: 1
  • Issue comment event: 5
  • Public event: 1
  • Push event: 76
  • Pull request event: 15
  • Fork event: 11
Last Year
  • Create event: 10
  • Release event: 4
  • Issues event: 4
  • Watch event: 35
  • Delete event: 4
  • Member event: 1
  • Issue comment event: 5
  • Public event: 1
  • Push event: 76
  • Pull request event: 15
  • Fork event: 11

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 7
  • Average time to close issues: about 13 hours
  • Average time to close pull requests: about 10 hours
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.29
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 7
  • Average time to close issues: about 13 hours
  • Average time to close pull requests: about 10 hours
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.29
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tomasruizt (1)
  • Q-Bukold (1)
Pull Request Authors
  • Q-Bukold (5)
  • FeLoe (1)
  • mrtn3000 (1)
Top Labels
Issue Labels
Pull Request Labels