lassywoefwaf
What's in the Lassy corpora? You tell me!
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Repository
What's in the Lassy corpora? You tell me!
Basic Info
- Host: GitHub
- Owner: AntheSevenants
- License: agpl-3.0
- Language: Python
- Default Branch: main
- Size: 57.6 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
LassyWoefWaf
What's in the Lassy corpora? You tell me!
I was surprised to learn that the Lassy Klein and Lassy Groot corpora do not ship with information on what region the materials in the corpus come from. This region information is out there (mostly), but one has to collect bits and pieces from different locations. With the scripts in this repository, I aim to assign a region (either 'Belgium', 'The Netherlands' or 'Unknown') to each document in each of the two corpora.
If you are not interested in reproducing my results, you can download the extracted region information from the Releases.
My sources
- Gertjan van Noord has put together an overview of the contents of Lassy Klein and Lassy Groot. From this overview, region information can be inferred (but not for every section).
- The Lassy corpora share a lot of documents with SONAR. SONAR does ship with meta information on its documents, so we can use the information from SONAR for Lassy.
Prerequisites
Corpora
To extract Lassy Klein region information, you only need Lassy Klein:
- Lassy Klein
Warning: the version offered by the IVDNT does not contain the meta files! I do not know why!
To extract Lassy Groot region information, you need Lassy Groot and SONAR:
Preparation
These instructions only have to be run once.
- Download and install Python.
git clone https://github.com/AntheSevenants/LassyWoefWaf.git,
or download and unzip this archive.- Open a terminal window. Navigate to the
LassyWoefWafdirectory:
cd LassyWoefWaf - Create a new virtual environment:
python -m venv venvorpython3 -m venv venv - Activate the virtual environment:
venv\Scripts\activate(Windows) orsource venv/bin/activate(unix) - Install all dependencies:
pip install -r requirements.txt
Running
These instructions need to be followed every time you want to use the LassyWoefWaf program.
- Open a terminal window. Navigate to the
LassyWoefWafdirectory:
cd LassyWoefWaf - Activate the virtual environment:
venv\Scripts\activate(Windows) orsource venv/bin/activate(unix) - You can now run any of the scripts detailed below.
Extracting region information
Lassy Klein
To extract the region information for Lassy Klein, I used a combination of Gertjan van Noord's overview and the CMDI files included in the copy of Lassy Klein that was given to me by the CCL. If you download Lassy Klein from the link above, these CMDI files are not included. Maybe they come from somewhere else, but I do not have the time to track down their origin. Unfortunately, this means that for Lassy Klein, my region extraction process becomes less reproducible (unless you also get a copy of Lassy Klein from the CCL).
In any case, this is the command you should run to infer Lassy Klein region information (given that you have the CCL version):
bash
python3 LassyKlein.py "/path/to/lassy klein/"

Lassy Groot
To extract the region information for Lassy Groot, I used the CMDI files included in SONAR. You do not need any special internal CCL versions of these corpora, so this process is maximally reproducible (given that you have the patience to download and extract these corpora).
This is the command you should run to infer Lassy Groot region information:
bash
python3 LassyGroot.py "/path_to_lassy_groot/data/" "/path_to_sonar/SONAR500/CMDI/"
The script will use all your cores to compute the region information as fast as possible.

Owner
- Name: Anthe Sevenants
- Login: AntheSevenants
- Kind: user
- Location: Leuven, Belgium
- Company: KU Leuven
- Website: anthe.sevenants.net
- Repositories: 39
- Profile: https://github.com/AntheSevenants
AI & linguistics master. Linguistics PhD candidate @QLVL
Citation (citation.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Sevenants
given-names: Anthe
orcid: https://orcid.org/0000-0002-5055-770X
title: "LassyWoefWaf"