subtitle-word-frequencies

Analyse word frequencies from webVTT subtitles

https://github.com/uudigitalhumanitieslab/subtitle-word-frequencies

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Keywords

nlp word-frequency

Last synced: 11 months ago · JSON representation ·

Repository

Analyse word frequencies from webVTT subtitles

Basic Info

Host: GitHub
Owner: UUDigitalHumanitieslab
License: bsd-3-clause
Language: Python
Default Branch: develop
Homepage:
Size: 116 KB

Statistics

Stars: 0
Watchers: 5
Forks: 0
Open Issues: 2
Releases: 2

Topics

nlp word-frequency

Created over 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Subtitle word frequencies

This repository contains python scripts to extract word frequency data from a collection of subtitle files.

Notable features: - Frequency lists can be converted to the format used by T-scan. - Summarise the total data per genre based on an accompanying metadata file - Text can be lemmatised using Frog or spaCy.

The purpose of this repository is to provide transparency in our data processing and to make it easier to repeat the frequency analysis on newer data in the future. It is not developed to be of general use, but we include a licence for reuse (see below).

Data

The scripts are designed for a collection of subtitles from the NPO (Dutch public broadcast). This dataset is not provided in this repository and is not publicly available due to copyright restrictions. The Research Software Lab works on this data in agreement with the NPO, but we cannot share the data with others.

Our data encodes subtitles as WebVTT files, with an accompanying metadata file included as an .xlsx file.

Scripts

Scripts are written in Python and are structured into the following modules:

analysis for counting and lemmatising extracted text
metadata for parsing the metadata file to see the distribution of genres
tscan for converting frequency data to the format used by T-scan
vtt for extracting plain-text data from .vtt files

Requirements

You'll need:

Python 3.10 or higher
pip

Install required python packages with

bash pip install -r requirements.txt

Lemmatisers

To perform lemmatisation, you'll also need to download data for spacy and/or frog.

After installing the requirements, run:

sh python -m spacy download nl_core_news_sm python -c "import frog; frog.installdata()"

Usage

The following commands are supported.

Summary of genres

You can create a csv file that lists the genres and the number of files + total runtime per genre specified in a metadata spreadsheet. To run this:

bash python -m metadata.summary

to create a summary of the metadata file located in /data, which makes sense if the data folder contains a single xlsx file.

You can also specify the location:

bash python -m metadata.summary path/to/metadata.xlsx path/to/output.csv

Export plain text of VTT files

Takes a directory containing .vtt files as input and converts the contents to plain text files.

bash python -m vtt.convert_to_plain path/to/data

For each *.vtt file in the provided directory, the script will save a file next to it named *.plain.txt. This file contains the text of the subtitles, with one line per segment.

The script filters out some common non-utterances that appear in captions, e.g. (muziek), APPLAUS EN GEJUICH, 888.

Lemmatise plain text exports

After generating plain text files as above, you can generate a lemmatised version using either Frog or SpaCy.

bash python -m analysis.lemmatize path/to/data [--frog|--spacy]

The data directory is the same directory in which you ran vtt.convert_to_plain - it should contain the *.plain.txt files generated by that script. For each file, the lemmatisation script will generate *.lemmas.txt, which contains the lemmatised text.

Use the --frog or --spacy to set the lemmatiser. Frog is the default: it is also used in T-scan, so results are more likely to match. However, at the time of writing, spaCy is much faster than Frog.

Count token frequencies

You can count token frequencies in the cleaned files (generated by vtt.convert_to_plain or analysis.lemmatize) and export them to a csv with:

bash python -m analysis.collect_counts path/to/data

Use the option --level lemma to count in the lemmatised files. You can also specify the input directory and the output location:

bash python -m analysis.collect_counts path/to/data --output path/to-output.csv --level lemma

The resulting csv file lists the frequency for each word or lemma.

Convert frequencies to T-scan format

You can convert the output of the previous step into a file formatted for T-scan.

bash python -m tscan --input path/to/input.csv --output path/to/output

This is a tab-separated file without headers. Each row represents a term. Rows are sorted from most to least frequent and list:

the term
the absolute frequency
the cumulative absolute frequency
the cumulative percentile frequency

Developing

Unit tests

Run unit tests with

bash pytest

To add new python packages, add them to requirements.in and run

bash pip-compile requirements.in --outputfile requirements.txt

Licence

This repository is shared under a BSD 3-Clause licence.

Owner

Name: UU Digital Humanities Lab
Login: UUDigitalHumanitieslab
Kind: organization
Email: digitalhumanities@uu.nl
Location: Utrecht

Website: https://cdh.uu.nl/rsl/
Repositories: 102
Profile: https://github.com/UUDigitalHumanitieslab

Research Software Lab · Centre for Digital Humanities · Utrecht University

Citation (CITATION.cff)

cff-version: 1.2.0
title: Subtitle word frequencies
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - name: 'Research Software Lab, Centre for Digital Humanities, Utrecht University'
    website: >-
      https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/
    city: Utrecht
    country: NL
    email: cdh@uu.nl
repository-code: 'https://github.com/UUDigitalHumanitieslab/subtitle-word-frequencies'
identifiers:
  - type: doi
    value: 10.5281/zenodo.10607189
license: BSD-3-Clause
version: 0.0.0
date-released: '2024-02-01'

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 60
Total Committers: 1
Avg Commits per committer: 60.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
lukavdplas	l**s@g**m	60

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 3
Total pull requests: 0
Average time to close issues: 9 months
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.33
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

lukavdplas (3)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.in pypi

click *
openpyxl *
pandas *
pytest *
python-frog *
scikit-learn *
spacy *

requirements.txt pypi

annotated-types ==0.6.0
attrs ==22.2.0
blis ==0.7.11
catalogue ==2.0.10
certifi ==2023.11.17
charset-normalizer ==3.3.2
click ==8.1.7
cloudpathlib ==0.16.0
confection ==0.1.4
cymem ==2.0.8
cython ==3.0.6
et-xmlfile ==1.1.0
exceptiongroup ==1.1.1
idna ==3.6
iniconfig ==2.0.0
jinja2 ==3.1.2
joblib ==1.2.0
langcodes ==3.3.0
markupsafe ==2.1.3
murmurhash ==1.0.10
numpy ==1.24.2
openpyxl ==3.1.2
packaging ==23.0
pandas ==2.1.3
pluggy ==1.0.0
preshed ==3.0.9
pydantic ==2.5.2
pydantic-core ==2.14.5
pytest ==7.2.2
python-dateutil ==2.8.2
python-frog ==0.6.10
pytz ==2023.3.post1
requests ==2.31.0
scikit-learn ==1.2.2
scipy ==1.10.1
six ==1.16.0
smart-open ==6.4.0
spacy ==3.7.2
spacy-legacy ==3.0.12
spacy-loggers ==1.0.5
srsly ==2.4.8
thinc ==8.2.1
threadpoolctl ==3.1.0
tomli ==2.0.1
tqdm ==4.66.1
typer ==0.9.0
typing-extensions ==4.9.0
tzdata ==2023.3
urllib3 ==2.1.0
wasabi ==1.1.2
weasel ==0.3.4

subtitle-word-frequencies

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Subtitle word frequencies

Contents

Data

Scripts

Requirements

Lemmatisers

Usage

Summary of genres

Export plain text of VTT files

Lemmatise plain text exports

Count token frequencies

Convert frequencies to T-scan format

Developing

Unit tests

Licence

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies