movie-script-database

A database of movie scripts from several sources

https://github.com/aveek-saha/movie-script-database

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

imsdb movie-database movie-metadata movie-scripts moviedb-api omdb-api tmdb-api
Last synced: 6 months ago · JSON representation ·

Repository

A database of movie scripts from several sources

Basic Info
  • Host: GitHub
  • Owner: Aveek-Saha
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 130 KB
Statistics
  • Stars: 173
  • Watchers: 5
  • Forks: 26
  • Open Issues: 4
  • Releases: 0
Topics
imsdb movie-database movie-metadata movie-scripts moviedb-api omdb-api tmdb-api
Created over 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Funding License Citation

README.md

The Movie Script Database

This is an utility that allows you to collect movie scripts from several sources and create a database of 2.5k+ movie scripts as .txt files along with the metadata for the movies.

There are four steps to the whole process:

  1. Collect scripts from various sources - Scrape websites for scripts in HTML, txt, doc or pdf format
  2. Collect metadata - Get metadata about the scripts from TMDb and IMDb for additional processing
  3. Find duplicates from different sources - Automatically group and remove duplicates from different sources.
  4. Parse Scripts - Convert scripts into lines with just Character and dialogue

Usage

The following steps MUST be run in order

Clone

Clone this repository:

git clone https://github.com/Aveek-Saha/Movie-Script-Database.git cd Movie-Script-Database

Dependencies

Read the instructions for installing textract first here.

Then install all dependencies using pip

pip install -r requirements.txt

Collect movie scripts

Modify the sources you want to download in sources.json. If you want a source to be included, set the value to true, or else set it as false.

python get_scripts.py

Collect all the scripts from the sources listed below:

json { "imsdb": "true", "screenplays": "true", "scriptsavant": "true", "dailyscript": "true", "awesomefilm": "true", "sfy": "true", "scriptslug": "true", "actorpoint": "true", "scriptpdf": "true" }

  • This might take a while (4+ hrs) depending on your network connection.
  • The script takes advantage of parallel processing to speed up the download process.
  • If there are missing/incomplete downloads, the script will only download the missing scripts if run again.
  • In case of scripts in PDF or DOC format, the original file is stored in the temp directory.

Collect metadata

Collect metadata from TMDb and IMDb:

python get_metadata.py

You'll need an API key for using the TMDb api and you can find out more about it here. Once you get the API key it has to be stored in a file called config.py in this format:

py tmdb_api_key = "<Your API key>"

This step will also combine duplicates, and your final metadata will be in this format:

json { "uniquescriptname": { "files": [ { "name": "Duplicate 1", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" }, { "name": "Duplicate 2", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" } ], "tmdb": { "title": "Title from TMDb", "release_date": "Date released", "id": "TMDb ID", "overview": "Plot summary" }, "imdb": { "title": "Title from IMDb", "release_date": "Year released", "id": "IMDb ID" } } }

Remove duplicates

Run:

python clean_files.py

This will remove the duplicate files as best as possible without false positives. In the end, the files will be stored in the scripts\filtered directory.

A new metadata file is created where only one file exists for each unique script name, in this format:

json { "uniquescriptname": { "file": { "name": "Movie name from source", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" }, "tmdb": { "title": "Title from TMDb", "release_date": "Date released", "id": "TMDb ID", "overview": "Plot summary" }, "imdb": { "title": "Title from IMDb", "release_date": "Year released", "id": "IMDb ID" } } }

The scripts are also cleaned to remove as much formatting weirdness that comes from using OCR to read from a PDF as possible.

Parse Scripts

Run:

python parse_files.py

This will parse your non duplicate scripts from the previous step. The parsed scripts are put into three folders

  • scripts/parsed/tagged: Contains scripts where each line has been tagged. The tags are
    • S = Scene
    • N = Scene description
    • C = Character
    • D = Dialogue
    • E = Dialogue metadata
    • T = Transition
    • M = Metadata
  • scripts/parsed/dialogue: Contains scripts where each line has the character name, followed by a dialogue, in this format, C=>D
  • scripts/parsed/charinfo: Contains a list of each character in the script and the number of lines they have, in this format, C: Number of lines

A new metadata file is created with the following format:

json { "uniquescriptname": { "file": { "name": "Movie name from source", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" }, "tmdb": { "title": "Title from TMDb", "release_date": "Date released", "id": "TMDb ID", "overview": "Plot summary" }, "imdb": { "title": "Title from IMDb", "release_date": "Year released", "id": "IMDb ID" }, "parsed": { "dialogue": "name-of-the-file_dialogue.txt", "charinfo": "name-of-the-file_charinfo.txt", "tagged": "name-of-the-file_parsed.txt" } } }

Directory structure

After running all the steps, your folder structure should look something like this:

scripts │ ├── unprocessed // Scripts from sources │ ├── source1 │ ├── source2 │ └── source3 │ ├── temp // PDF files from sources │ ├── source1 │ ├── source2 │ └── source3 │ ├── metadata // Metadata files from sources/cleaned metadata │ ├── source1.json │ ├── source2.json │ ├── source3.json │ └── meta.json │ ├── filtered // Scripts with duplicates removed │ └── parsed // Scripts parsed using the parser ├── dialogue ├── charinfo └── tagged

Sources

Metadata:

Scripts:

Note:

Citing

If you use The Movie Script Database, please cite:

@misc{Saha_Movie_Script_Database_2021, author = {Saha, Aveek}, month = {7}, title = {{Movie Script Database}}, url = {https://github.com/Aveek-Saha/Movie-Script-Database}, year = {2021} }

Credits

The script for parsing the movie scripts come from this paper: Linguistic analysis of differences in portrayal of movie characters, in: Proceedings of Association for Computational Linguistics, Vancouver, Canada, 2017 and the code can be found here: https://github.com/usc-sail/mica-text-script-parser

Owner

  • Name: Aveek Saha
  • Login: Aveek-Saha
  • Kind: user
  • Location: Boston, MA
  • Company: @akamai

Cloud Computing, Machine Learning and Full Stack. SDE co-op @akamai. MSCS student @northeastern. Previously at @HewlettPackard, @altimetrik & @ IIT Kgp.

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Saha"
  given-names: "Aveek"
  orcid: "https://orcid.org/0000-0002-6112-3843"
title: "Movie Script Database"
version: 1.0.0
date-released: 2021-07-05
url: "https://github.com/Aveek-Saha/Movie-Script-Database"

GitHub Events

Total
  • Watch event: 21
  • Fork event: 4
Last Year
  • Watch event: 21
  • Fork event: 4

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 3
  • Average time to close issues: 11 minutes
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.5
  • Average comments per pull request: 0.33
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • dylanwalker (1)
  • gfunkgod (1)
Pull Request Authors
  • dependabot[bot] (2)
  • ldnovak (1)
  • copoer (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (1)

Dependencies

requirements.txt pypi
  • IMDbPY ==2021.4.18
  • Unidecode ==1.2.0
  • beautifulsoup4 ==4.9.3
  • fuzzywuzzy ==0.18.0
  • textract ==1.6.3
  • tqdm ==4.61.1