https://github.com/aryanvbw/maths_ai_assignment

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AryanVBW
Language: Jupyter Notebook
Default Branch: main
Size: 103 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

README.md

Presentation: Benford's Law & Zipf's Law in Movie Subtitles

You can visit and try all the analytics and layout implementations live using IPython/Jupyter—not just a static notebook, but fully interactive on Google Colab

Slide 1: Introduction

Topic: Exploring two fascinating statistical patterns: Benford's Law and Zipf's Law.
Context: Applying these laws to the movies_subtitles.csv dataset.
Goal: Understand how these non-intuitive laws manifest in real-world data.

Slide 2: What is Benford's Law? (The Law of First Digits)

Concept: Benford's Law describes the frequency distribution of leading digits in many real-life sets of numerical data.
Prediction: The law states that the digit '1' appears as the leading digit about 30% of the time, '2' about 17.6%, and so on, with higher digits appearing less frequently.
Formula: P(d) = log10(1 + 1/d) for d ∈ {1, 2, ..., 9}
Intuition: It often applies to data that spans several orders of magnitude.

Slide 3: Benford's Law - Key Question

Question: What is the probability that the leading digit of a number (in a naturally occurring dataset) is 1?
Benford's Answer: Approximately 30.1%
Contrast: If digits were uniformly distributed, we'd expect ~11.1% (1/9).

Slide 4: Applying Benford's Law to Movie Subtitles Data

Hypothesis: Numerical data within the movies_subtitles.csv (e.g., start/end timestamps, potentially other numeric fields if available) might follow Benford's Law.
Data Column(s) for Analysis:
- start_time (Subtitle start times in seconds)
Analysis:
1. Extract the start_time numerical data.
2. Isolate the first significant digit of each number.
3. Calculate the frequency distribution of these leading digits (1-9).
4. Compare the observed distribution to Benford's predicted distribution.

Slide 5: Benford's Law - Results

Dataset Used: movies_subtitles.csv
Column Analyzed: start_time
Observed vs. Expected Frequencies (from 10,358,275 valid start times):
- Digit 1: 20.1% (Expected: 30.1%)
- Digit 2: 19.5% (Expected: 17.6%)
- Digit 3: 18.8% (Expected: 12.5%)
- Digit 4: 16.9% (Expected: 9.7%)
- Digit 5: 11.1% (Expected: 7.9%)
- Digit 6: 5.7% (Expected: 6.7%)
- Digit 7: 3.3% (Expected: 5.8%)
- Digit 8: 2.5% (Expected: 5.1%)
- Digit 9: 2.2% (Expected: 4.6%)
Visualization: mermaid pie title Observed Frequencies "1 (Exp: 30.1%)" : 20.1 "2 (Exp: 17.6%)" : 19.5 "3 (Exp: 12.5%)" : 18.8 "4 (Exp: 9.7%)" : 16.9 "5 (Exp: 7.9%)" : 11.1 "6 (Exp: 6.7%)" : 5.7 "7 (Exp: 5.8%)" : 3.3 "8 (Exp: 5.1%)" : 2.5 "9 (Exp: 4.6%)" : 2.2
Conclusion: The start_time data does not strongly conform to Benford's Law. While the general trend of decreasing frequency for higher digits exists, the lower digits (especially '1') are underrepresented, and digits 2-5 are overrepresented compared to the theoretical distribution. This might be due to the nature of subtitle timings.

Slide 6: Benford's Law Plot

Here is the distribution of leading digits observed in the data compared to the expected Benford's Law distribution:

Benford's Law Plot

Slide 7: What is Zipf's Law? (The Law of Rank Frequency)

Concept: Zipf's Law relates the frequency of a word in a text corpus to its rank in the frequency table.
Prediction: The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.
Formula: Frequency ≈ Constant / Rank
Observation: A few words are used very frequently, while many words are used rarely.

Slide 8: Zipf's Law - Key Question

Question: How does the frequency of words relate to their rank in a large body of text?
Zipf's Answer: The frequency is inversely proportional to the rank.
Example: If "the" is the 1st most common word, "of" (2nd) should appear about 1/2 as often, and "and" (3rd) about 1/3 as often.

Slide 9: Applying Zipf's Law to Movie Subtitles Data

Hypothesis: The frequency of words used in the movie subtitles text should follow Zipf's Law.
Data Column for Analysis:
- text (The subtitle text content)
Analysis:
1. Extract all text from the text column.
2. Tokenize the text into words (lowercase, remove punctuation/numbers, remove common English stop words and subtitle artifacts like 'sighs').
3. Count the frequency of each unique word.
4. Rank the words from most frequent to least frequent.
5. Plot word frequency against word rank using log-log scales.

Slide 10: Zipf's Law - Results

Dataset Used: movies_subtitles.csv
Column Analyzed: text
Analysis Summary: Analyzed 34,611,884 words after filtering (from 306,477 unique words).
Top 10 Words & Frequencies:

| Rank | Word | Frequency | |------|---------|-----------| | 1 | dont | 417556 | | 2 | im | 406630 | | 3 | know | 384426 | | 4 | get | 308548 | | 5 | like | 286525 | | 6 | go | 284509 | | 7 | right | 266279 | | 8 | come | 253165 | | 9 | youre | 243768 | | 10 | well | 234644 |
Visualization (Frequency Share for Top 10 Words): mermaid pie title Frequency Share of Top 10 Words "dont (418k)" : 417556 "im (407k)" : 406630 "know (384k)" : 384426 "get (309k)" : 308548 "like (287k)" : 286525 "go (285k)" : 284509 "right (266k)" : 266279 "come (253k)" : 253165 "youre (244k)" : 243768 "well (235k)" : 234644
Conclusion: The word distribution in the text column conforms well to Zipf's Law. The log-log plot shows a clear linear relationship between log(rank) and log(frequency) with a negative slope, as predicted by the law. This indicates a typical natural language distribution where a few words dominate in frequency.

Slide 11: Zipf's Law Plot

The relationship between word rank and frequency is often visualized on a log-log scale, where Zipf's Law predicts a near-linear downward trend:

Zipf's Law Plot

Observation: The plot shows the top words clustering along a roughly linear path, consistent with Zipf's Law. The calculated trendline slope is close to the ideal value of -1.

Slide 12: Conclusion & Discussion

Summary: We investigated the applicability of Benford's Law to subtitle start times and Zipf's Law to word frequencies within the movies_subtitles.csv dataset.
Key Findings:
- Benford's Law & Start Times: The distribution of leading digits in start_time deviated significantly from Benford's Law expectations, as clearly visualized in the generated plot. This suggests that subtitle timings might not represent the kind of naturally occurring, logarithmically distributed data spanning multiple orders of magnitude where Benford's Law typically holds (e.g., they might be clustered or follow a different pattern).
- Zipf's Law & Word Frequency: In contrast, the text data exhibited a strong correlation with Zipf's Law. The analysis confirmed the characteristic inverse relationship where a few words dominate in frequency, and frequency drops rapidly with rank, which is typical for natural language corpora.
Broader Implications:
- This comparison highlights the specific conditions under which these statistical laws apply. Benford's Law is a powerful tool for detecting anomalies or fraud in certain numerical datasets (like financial records), but its applicability depends heavily on the data generation process.
- Zipf's Law's robustness in natural language underpins many Natural Language Processing (NLP) techniques, from search engine algorithms and text compression to keyword analysis and language modeling. Our findings reinforce its relevance even in the specific context of movie subtitles.
Avenues for Further Exploration:
- Could Benford's Law apply more effectively to other potential numerical data related to movies (e.g., budget, runtime, box office gross)?
- How does the Zipfian distribution change if we analyze dialogue separately from descriptive text within the subtitles?
- Do word frequency patterns (Zipf's Law parameters) differ significantly across different movie genres present in the dataset?
- Could the deviation from Benford's Law in start_time itself reveal interesting patterns about editing or pacing conventions in film?

Owner

Name: Vivek W
Login: AryanVBW
Kind: user
Location: india
Company: @TEch-Shop

Website: http://portfolio.aryanvbw.live/
Twitter: vivekwagadare
Repositories: 1
Profile: https://github.com/AryanVBW

🛰 𝐒𝐢𝐦𝐩𝐥𝐢𝐜𝐢𝐭𝐲 𝐒𝐩𝐚𝐫𝐤𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐢𝐨𝐧🛡️ 🌟 Dive into the Future: ✨𒆜🆅🅸🆅🅴🅺 🆆𒆜 Studying in Newton school of Technology🚀

GitHub Events

Total

Watch event: 2
Push event: 7
Create event: 2

Last Year

Watch event: 2
Push event: 7
Create event: 2

Committers

Last synced: about 1 year ago

All Time

Total Commits: 8
Total Committers: 1
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 8
Committers: 1
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Vivek W	9****W	8

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/aryanvbw/maths_ai_assignment

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Presentation: Benford's Law & Zipf's Law in Movie Subtitles

Slide 1: Introduction

Slide 2: What is Benford's Law? (The Law of First Digits)

Slide 3: Benford's Law - Key Question

Slide 4: Applying Benford's Law to Movie Subtitles Data

Slide 5: Benford's Law - Results

Slide 6: Benford's Law Plot

Slide 7: What is Zipf's Law? (The Law of Rank Frequency)

Slide 8: Zipf's Law - Key Question

Slide 9: Applying Zipf's Law to Movie Subtitles Data

Slide 10: Zipf's Law - Results

Slide 11: Zipf's Law Plot

Slide 12: Conclusion & Discussion

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels