movie-script-database

A database of movie scripts from several sources

https://github.com/aveek-saha/movie-script-database

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary

Keywords

imsdb movie-database movie-metadata movie-scripts moviedb-api omdb-api tmdb-api

Last synced: 6 months ago · JSON representation ·

Repository

A database of movie scripts from several sources

Basic Info

Host: GitHub
Owner: Aveek-Saha
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 130 KB

Statistics

Stars: 173
Watchers: 5
Forks: 26
Open Issues: 4
Releases: 0

Topics

imsdb movie-database movie-metadata movie-scripts moviedb-api omdb-api tmdb-api

Created over 5 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Funding License Citation

The Movie Script Database

This is an utility that allows you to collect movie scripts from several sources and create a database of 2.5k+ movie scripts as .txt files along with the metadata for the movies.

There are four steps to the whole process:

Collect scripts from various sources - Scrape websites for scripts in HTML, txt, doc or pdf format
Collect metadata - Get metadata about the scripts from TMDb and IMDb for additional processing
Find duplicates from different sources - Automatically group and remove duplicates from different sources.
Parse Scripts - Convert scripts into lines with just Character and dialogue

Usage

The following steps MUST be run in order

Clone

Clone this repository:

git clone https://github.com/Aveek-Saha/Movie-Script-Database.git cd Movie-Script-Database

Dependencies

Read the instructions for installing textract first here.

Then install all dependencies using pip

pip install -r requirements.txt

Collect movie scripts

Modify the sources you want to download in sources.json. If you want a source to be included, set the value to true, or else set it as false.

python get_scripts.py

Collect all the scripts from the sources listed below:

json { "imsdb": "true", "screenplays": "true", "scriptsavant": "true", "dailyscript": "true", "awesomefilm": "true", "sfy": "true", "scriptslug": "true", "actorpoint": "true", "scriptpdf": "true" }

This might take a while (4+ hrs) depending on your network connection.
The script takes advantage of parallel processing to speed up the download process.
If there are missing/incomplete downloads, the script will only download the missing scripts if run again.
In case of scripts in PDF or DOC format, the original file is stored in the temp directory.

Collect metadata

Collect metadata from TMDb and IMDb:

python get_metadata.py

You'll need an API key for using the TMDb api and you can find out more about it here. Once you get the API key it has to be stored in a file called config.py in this format:

py tmdb_api_key = "<Your API key>"

This step will also combine duplicates, and your final metadata will be in this format:

json { "uniquescriptname": { "files": [ { "name": "Duplicate 1", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" }, { "name": "Duplicate 2", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" } ], "tmdb": { "title": "Title from TMDb", "release_date": "Date released", "id": "TMDb ID", "overview": "Plot summary" }, "imdb": { "title": "Title from IMDb", "release_date": "Year released", "id": "IMDb ID" } } }

Remove duplicates

Run:

python clean_files.py

This will remove the duplicate files as best as possible without false positives. In the end, the files will be stored in the scripts\filtered directory.

A new metadata file is created where only one file exists for each unique script name, in this format:

json { "uniquescriptname": { "file": { "name": "Movie name from source", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" }, "tmdb": { "title": "Title from TMDb", "release_date": "Date released", "id": "TMDb ID", "overview": "Plot summary" }, "imdb": { "title": "Title from IMDb", "release_date": "Year released", "id": "IMDb ID" } } }

The scripts are also cleaned to remove as much formatting weirdness that comes from using OCR to read from a PDF as possible.

Parse Scripts

Run:

python parse_files.py

This will parse your non duplicate scripts from the previous step. The parsed scripts are put into three folders

scripts/parsed/tagged: Contains scripts where each line has been tagged. The tags are
- S = Scene
- N = Scene description
- C = Character
- D = Dialogue
- E = Dialogue metadata
- T = Transition
- M = Metadata
scripts/parsed/dialogue: Contains scripts where each line has the character name, followed by a dialogue, in this format, C=>D
scripts/parsed/charinfo: Contains a list of each character in the script and the number of lines they have, in this format, C: Number of lines

A new metadata file is created with the following format:

json { "uniquescriptname": { "file": { "name": "Movie name from source", "source": "Source of the script", "file_name": "name-of-the-file", "script_url": "Original link to script", "size": "size of file" }, "tmdb": { "title": "Title from TMDb", "release_date": "Date released", "id": "TMDb ID", "overview": "Plot summary" }, "imdb": { "title": "Title from IMDb", "release_date": "Year released", "id": "IMDb ID" }, "parsed": { "dialogue": "name-of-the-file_dialogue.txt", "charinfo": "name-of-the-file_charinfo.txt", "tagged": "name-of-the-file_parsed.txt" } } }

Directory structure

After running all the steps, your folder structure should look something like this:

scripts │ ├── unprocessed // Scripts from sources │ ├── source1 │ ├── source2 │ └── source3 │ ├── temp // PDF files from sources │ ├── source1 │ ├── source2 │ └── source3 │ ├── metadata // Metadata files from sources/cleaned metadata │ ├── source1.json │ ├── source2.json │ ├── source3.json │ └── meta.json │ ├── filtered // Scripts with duplicates removed │ └── parsed // Scripts parsed using the parser ├── dialogue ├── charinfo └── tagged

Sources

Metadata:

TMDb
IMDb

Scripts:

Note:

~~Weeklyscript~~ (Site no longer active)

Citing

If you use The Movie Script Database, please cite:

@misc{Saha_Movie_Script_Database_2021, author = {Saha, Aveek}, month = {7}, title = {{Movie Script Database}}, url = {https://github.com/Aveek-Saha/Movie-Script-Database}, year = {2021} }

Credits

The script for parsing the movie scripts come from this paper: Linguistic analysis of differences in portrayal of movie characters, in: Proceedings of Association for Computational Linguistics, Vancouver, Canada, 2017 and the code can be found here: https://github.com/usc-sail/mica-text-script-parser

Owner

Name: Aveek Saha
Login: Aveek-Saha
Kind: user
Location: Boston, MA
Company: @akamai

Website: home.aveek.io
Repositories: 66
Profile: https://github.com/Aveek-Saha

Cloud Computing, Machine Learning and Full Stack. SDE co-op @akamai. MSCS student @northeastern. Previously at @HewlettPackard, @altimetrik & @ IIT Kgp.

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Saha"
  given-names: "Aveek"
  orcid: "https://orcid.org/0000-0002-6112-3843"
title: "Movie Script Database"
version: 1.0.0
date-released: 2021-07-05
url: "https://github.com/Aveek-Saha/Movie-Script-Database"

GitHub Events

Total

Watch event: 21
Fork event: 4

Last Year

Watch event: 21
Fork event: 4

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 2
Total pull requests: 3
Average time to close issues: 11 minutes
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.5
Average comments per pull request: 0.33
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

dylanwalker (1)
gfunkgod (1)

Pull Request Authors

dependabot[bot] (2)
ldnovak (1)
copoer (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Dependencies

requirements.txt pypi

IMDbPY ==2021.4.18
Unidecode ==1.2.0
beautifulsoup4 ==4.9.3
fuzzywuzzy ==0.18.0
textract ==1.6.3
tqdm ==4.61.1

movie-script-database

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

The Movie Script Database

Usage

Clone

Dependencies

Collect movie scripts

Collect metadata

Remove duplicates

Parse Scripts

Directory structure

Sources

Metadata:

Scripts:

Citing

Credits

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies