semanticaianalysis
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: DayJay1992
- Language: HTML
- Default Branch: main
- Size: 7.72 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
This project is an attempt to highlight diffenrences between German AI generated academic texts and human academic texts in the field of Germanic linguistics by semantic means. In a nutshell, the project consists of two main scripts (NLTK/scripts/main.py) and (NLTK/scripts/clustering.py), both with different focuses. There are additional scripts that produce more convenient overviews over the data.
The corpus Consists of human texts and AI generated texts from different Models and with two different prompts, namely:
| Service |Models | Prompts | | :---------------- | :------: | ----: | | OpenAI | ChatGPT4o and ChatGPTo1 | A and B | | Google | Gemini Flash 2.0 and Flash 2.0 Thinking | A and B | | Deepseek | Deepseek V3 and R1 | A and B |
Each human text consists of the introduction of a peer reviewed academic paper out of the field of germanic linguistics. All human texts can loosely be categorized as syntactic papers that deal with the left periphery of the sentence (V2, V3, pre-prefield, etc.), thus they are relatively similar but not topically identical. The AI models have been prompted to generate introductions to the exact same topics. Two separate prompts have been used, producing two independent texts for each human texts. Prompt A was a more simplistic prompt in a style like "Generate an academic introduction in the field of linguistics about ((topic))". Prompt B was more specific, asking specifically for academic tone, harvard quotation style and academic structure for the introduction text. For each human text, a total of 12 AI texts have been generated about exactly the same topic (6 Models á 2 prompts). 25 human texts have been extracted, resulting in a total of a combined 325 texts as a corpus. The corpus can be found in /NLTK/scripts/corpus/texte.json
main.py basically reads, filters, tokenizes and lemmatizes all texts and counts the total occurences of all relevant words and n-grams and calculates some stilometric features. The result is saved in textanalysegesamt.xlsx The scripts does several additional runs where it does the same but calculates adjectives (ADJ), adverbs (ADV), Nouns (NOUN) and Verbs (VERBS) seperately for convencience. The excel files only contain the top 100 lemmas of all models for performance reasons. The complete analysis is saved in /NLTK/scripts/uniquelemmataoutput/ as a.txt-file for each model and POS. The used language model is *decorenewslg* from the spacy package.
clustering.py clustering.py tries to put all lemmas into categories of lemmas with similar semantic meaning. "Semantic meaning", in this case, is the embedding vector assigned to each lemma by the decorenews_lg model. In this case, if two lemmas have a cosine similarity of at least 0.7, they are put into the same semantic category. Then, the occurences of each category in every text sort is counted. The results and the global categories are printed in the NLTK/scripts/Kategorisierungen*-Folders. Again, one run takes all POS into account (NLTK/scripts/KategorisierungenAlle), but there are additional runs for each POS (and different combinations of POS, such as adjectives and adverbs) separately.
abweichungen_kategorien.py based on the files produced by clustering.py, this script creates a plot that displays the top 30 over- and underrepresented lemmas compared to avarage appearance. The plot can be found in the resepctive NLTK/scripts/Kategorisierungen*-Folders as abweichungenkategorienplot.png. It creates plots for all NLTK/scripts/Kategorisierungen*-Folders automatically
Heatmap_Kategorien.py based on the files produced by clustering.py, this script creates an interactive heatmap that displays the top 100 occurences of each semantic category in each model. The heatmaps can be found in the resepctive NLTK/scripts/Kategorisierungen*-Folders as interaktiveheatmap.html. It creates heatmaps for all NLTK/scripts/Kategorisierungen_*-Folders automatically.
Acknowledgments
-Yannic Pixberg - for his contribution of texts to the corpus database
Owner
- Login: DayJay1992
- Kind: user
- Repositories: 1
- Profile: https://github.com/DayJay1992
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Wegerhoff
given-names: Dennis
orcid: https://orcid.org/0009-0006-7904-6117
title: "Semantic Analysis of AI Generated text"
version: 0.4-Beta
identifiers:
- type: doi
value:
date-released: 2025-05-02
GitHub Events
Total
- Release event: 1
- Push event: 13
- Create event: 4
Last Year
- Release event: 1
- Push event: 13
- Create event: 4