https://github.com/anthesevenants/gij-bent

Scripts for the article 'Zijt gij dat of bent gij dat?' – Een alternantiestudie van de tweede persoon enkelvoud van zijn in Vlaamse tussentaal.

https://github.com/anthesevenants/gij-bent

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Scripts for the article 'Zijt gij dat of bent gij dat?' – Een alternantiestudie van de tweede persoon enkelvoud van zijn in Vlaamse tussentaal.

Basic Info
  • Host: GitHub
  • Owner: AntheSevenants
  • Language: TeX
  • Default Branch: master
  • Homepage:
  • Size: 21.6 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

gij bent

Scripts for the article 'Zijt gij dat of bent gij dat?' -- Een alternantiestudie van de tweede persoon enkelvoud van zijn in Vlaamse tussentaal.

This repository houses all the scripts that were used for the analysis in my article on gij bent in Dutch. There are three general components:

  1. Tweet retrieval and sorting. This was done in Python using snscrape. Unfortunately, the library no longer works for Twitter, so you won't be able to replicate my output. All files pertaining to this process are in the root directory.
  2. Statistical analysis. This was done in R. All files for the analysis can be found in the analysis/ directory.
  3. Article. The Quarto document with the reporting can be found in the paper/ directory. It uses files from analysis/.

Tweet retrieval and sorting

All .py files in the root directory are used for the tweet retrieval and sorting process. These scripts are included for transparency purposes, since you cannot run them anymore for two reasons:

  1. It is no longer possible to use the Twitter API / scrape data from Twitter.
  2. I cannot share the dataset in full, because it contains personal information.

These are the files:

  1. 1-retrieve-tweets.py: used to query Twitter. Output is written to jsonl files in output/. Now defunct.
  2. 2-sort-tweets.py: used to create a TSV dataset from the jsonl files. Outputs to TSV.
  3. 3-geoguess.py: used to attach geolocation to every tweet. Outputs to a separate geo information dataset.
  4. 4-gender-detect.py: used to guess the gender of tweet authors. Outputs to a separate gender information dataset.
  5. 5-correct.py: used to find incorrectly retrieved tweets. Outputs a meta information dataset.
  6. 6-merge.py: used to merge all datasets together and filter wrong tweets. Outputs a final dataset.
  7. 7-anonymise.py: used to anonymise the dataset so it can be shared without personal data. Outputs the anonymised final dataset.

Statistical analysis

All .R files in the root directory are used for statistical analysis. All files are made to work in the report, except for gij-bent-gam2.R.

  • geo-map.R: prints a map of Flanders. Embedded in the report.
  • gij-bent2.R: loads the dataset. Embedded in the report.
  • kloeke.R: prints a map of the Low Countries with forms for 'you are' in dialect. Embedded in the report.
  • gij-bent-gam2.R: prints the map of gij bent. Not embedded in the report, since it takes a solid four minutes to generate the map. This means you need to generate the image first, which is then used in the report.

Article

I wrote the article in Quarto. The idea of Quarto is that you write your paper once, which you can then export to HTML, Word and PDF. The paper is generated dynamically, and all regression analyses, graphs and numbers are included on the fly. It uses the files from analysis/.

Reproducibility

I have anonymised the dataset with tweets. If you need to consult the full dataset with personal information, send me an email.

Owner

  • Name: Anthe Sevenants
  • Login: AntheSevenants
  • Kind: user
  • Location: Leuven, Belgium
  • Company: KU Leuven

AI & linguistics master. Linguistics PhD candidate @QLVL

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1