https://github.com/cshnican/golhaye-ghali
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: cshnican
- License: mit
- Language: HTML
- Default Branch: main
- Size: 359 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Golhaye Ghali: a dataset for profanity constructions in Farsi and Mazani
What is this dataset for?
This repository explores profanity constructions in Farsi and Mazani (Mazanderani), focusing on idiomatic expressions rather than isolated swear words. While existing resources primarily list individual profanities, this project seeks to document multi-word constructions that convey offensive, humorous, or metaphorical meanings in everyday speech.
Why does it matter?
Language is believed to be largely compositional: the meaning of multi-word sequences can be mostly composed and deduced from the meaning of each word. For example, the meaning of "a white tower" is a tower that is white, which is a composition of its element, "white" and "tower". Compositionality allows speakers to generate complex meaning from a relatively smaller set of basic units, which is good, because speakers just need to memorize the meaning of each individual word, along with the rules to compose them together. On the other hand, if language is largely not-compositional, then speakers need to memorize 3 unrelated utterances for "book", "my book", and "Can I have my book?", which will be a nightmare for people using the language.
However, not everything about language is compositional. A counterexample is idioms in English: the sentence "put yourself in other people's shoes" doesn't mean you are actually wearing other people's shoes, as its individual words combined together will suggest. Instead, this sentence means to think in someone else's perspective. Another counterexample is some syntactic structures in English: "I am into laiko" does not mean I am physically into laiko music, as, again, its individual words combined together will suggest. Instead, it means I like laiko music.
Profanity constructions, which this dataset offers, is another example: words (usually vulgar) composed together mean something completely different from their individual parts. In English, the examples are "bullshit" vs. "batshit" vs. "horseshit" - they don't refer to the defecation of different animals. Rather, they refer to "nonsense", "something crazy", and "nonsense on a even deeper level".
Why is this the case? When is language not-compositional? How do these seemingly arbitraily meanings get assigned to these collection of words? To address these questions, in this dataset, we hope we can provide future researchers with data points to potentially draw inferences from. We choose Farsi and Mazani because they seem (to the judgement of curators) to be understudied languages with rich such constructions.
Another theoretical value of this dataset is that it might offer a probe of how languages were spoken a long time ago, since these constructions were thought to have been formed a long time ago and have survived language changes due to their sociological values (cf. Progovac, 2024).
A third, practical contribution of this dataset is to provide literal and implied translation of Farsi and Mazani expressions in order to train better machine translators.
What are Farsi and Mazani?
Both Farsi and Mazani are Indo-Iranian languages in the Indo-European (IE) language family. The IE langauge family is currently the language family with the most number of speakers in the world (3.4 billion worldwide), consisting of a diverse set of languages such as English, German, Swedish (Germanic), Ukranian, Russian, Bulgarian (Balto-Slavic), Spanish, French, Italian (Italic), Irish (Celtic), Hindi, Urdu, Farsi (Indo-Iranian), Greek (Hellenic), Armenian (Armenian), and Albanian (Albanian). The original form of the language (called Proto-Indo-European) is believed to be spoken by people living in Anatolia and the Steppe (Heggarty et al., 2023) around 8000 years ago.
Indo-Iranian languages are estimated to branch out from the Indo-European family tree around 5500 years ago (Gray & Atkinson, 2003), after Hittite (Anatolian), Tocharian, Armenian, Greek, and Albanian. Indo-Iranian languages than separated from Indo-Aryan languages around 4600 years ago (Gray & Atkinson, 2003). There are three main stages for Indo-Iranian languages (Fortson, 2009): Old Iranian (e.g. Old Persian, Avestan), Middle Iranian (e.g. Pahlavi), and Modern Iranian (e.g. Farsi, Tajik).
Farsi is a language with multiple varieties: the most well known variety is spoken in modern day Iran. Another variety, called Dari, is one of the official languages in Afghanistan. A third variety, Tajik, is the official language in Tajikstan. All these languages belong to the Southwestern Iranian branch of the Iranian language family (Hammarström and Forkel, 2022). Mazani (Mazanderani), on the other hand, belongs to the Norwestern Iranian branch (Hammarström and Forkel, 2022) and is mainly spoken in the northern part of Iran near the Caspian Sea in the Mazandaran Province.
Table
This dataset consists of 5 columns. The first column is the original profanity construction in Farsi or Mazani, written in the Persian script. The second column is the transliterated construction in Latin script for those who are unfamiliar with the Persian sript, with links to the corresponding Wiktionary entry. The third column is the literal meaning of the construction, usually a word-by-word translation (annotation may be added later, depending on the curators' bandwidth). The forth column is the inferred meaning, which is different from the literal meaning and is how the construction is actually used. The last column contains references.
| Construction | Construction (transliterated) | Literal Meaning | Inferred Meaning | References | |----------|----------|----------|----------|----------| | به کیرم | Be kiram | To my dick | I don't care | - | | به تخم چپم | Be tokhm chapam | To my left ball | I don't care | - | | تو کونم نمیره | Too koonam ne mire | It does not go to my ass. | I can't believe it | - | | تخم سگ | Tokhme sag | Dog's balls | Naughty | - | | سرت با کونت بازی میکنه | Saret ba koonet bazi mikone | Your head does game with your ass | You don't have focus | - | | دودول طلا | Doodool Tala | Golden dick | well-behaved kid | Brian Parsa's instagram reel | | مه موس | Me moos | (Mazani) My ass | The location of any lost item (use this to signal the speaker "why the hell should I know where your lost item is?) | - | | خایه مالو سگ بگاد | Khayeh malo sag begad | Ball rubber will be fucked by a dog | Use this phrase to a person who's being a kiss-ass (instead of saying it directly, people usually just sing a tune) | the tune source of the tune| |برو تو کونم رنگی در آ| Boro too koonam rangi dar a | Go in my ass and exit colorful | I believe very little (or care very little) about what you say | - | |کیر خر|Kir e khar|Donkey dick|Quiet down. (Said in response to a loud noise or exclamation)|-| |گاز و گوز نکن|Gaaz o gooz nakon|Don't accelarate and don't fart|one it down.|-| |کس موش چال کردن|Kos moosh chaal kardan.|Burrying mouse pussy|Engaging in a futile activity.|-| |خنجر به کس فیل زدن|Khanjar be kos e feel zadan.|Daggering an elephant's pussy.|Engaging in a futile and dangerous activity.|-| |گنده گوزی کردن|Gonde goozi kardan|Farting big farts.|excessively exaggerating|-| |مه موس ته دوربینه| Me moos te durbin e | (Mazani) My ass is your telescope | You're paying attention to me too much |-|
Owner
- Name: Sihan Chen
- Login: cshnican
- Kind: user
- Location: Cambridge, MA
- Company: MIT Brain and Cognitive Sciences
- Twitter: cshnican
- Repositories: 2
- Profile: https://github.com/cshnican
GitHub Events
Total
- Issue comment event: 1
- Member event: 1
- Push event: 31
- Pull request event: 2
- Create event: 3
Last Year
- Issue comment event: 1
- Member event: 1
- Push event: 31
- Pull request event: 2
- Create event: 3
Dependencies
- 1154 dependencies
- @docusaurus/module-type-aliases 3.7.0 development
- @docusaurus/types 3.7.0 development
- @docusaurus/core 3.7.0
- @docusaurus/preset-classic 3.7.0
- @mdx-js/react ^3.0.0
- clsx ^2.0.0
- prism-react-renderer ^2.3.0
- react ^19.0.0
- react-dom ^19.0.0