semanticaianalysis

https://github.com/dayjay1992/semanticaianalysis

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: DayJay1992
Language: HTML
Default Branch: main
Size: 7.72 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Citation

This project is an attempt to highlight diffenrences between German AI generated academic texts and human academic texts in the field of Germanic linguistics by semantic means. In a nutshell, the project consists of two main scripts (NLTK/scripts/main.py) and (NLTK/scripts/clustering.py), both with different focuses. There are additional scripts that produce more convenient overviews over the data.

The corpus Consists of human texts and AI generated texts from different Models and with two different prompts, namely:

| Service |Models | Prompts | | :---------------- | :------: | ----: | | OpenAI | ChatGPT4o and ChatGPTo1 | A and B | | Google | Gemini Flash 2.0 and Flash 2.0 Thinking | A and B | | Deepseek | Deepseek V3 and R1 | A and B |

Each human text consists of the introduction of a peer reviewed academic paper out of the field of germanic linguistics. All human texts can loosely be categorized as syntactic papers that deal with the left periphery of the sentence (V2, V3, pre-prefield, etc.), thus they are relatively similar but not topically identical. The AI models have been prompted to generate introductions to the exact same topics. Two separate prompts have been used, producing two independent texts for each human texts. Prompt A was a more simplistic prompt in a style like "Generate an academic introduction in the field of linguistics about ((topic))". Prompt B was more specific, asking specifically for academic tone, harvard quotation style and academic structure for the introduction text. For each human text, a total of 12 AI texts have been generated about exactly the same topic (6 Models á 2 prompts). 25 human texts have been extracted, resulting in a total of a combined 325 texts as a corpus. The corpus can be found in /NLTK/scripts/corpus/texte.json

main.py basically reads, filters, tokenizes and lemmatizes all texts and counts the total occurences of all relevant words and n-grams and calculates some stilometric features. The result is saved in textanalysegesamt.xlsx The scripts does several additional runs where it does the same but calculates adjectives (ADJ), adverbs (ADV), Nouns (NOUN) and Verbs (VERBS) seperately for convencience. The excel files only contain the top 100 lemmas of all models for performance reasons. The complete analysis is saved in /NLTK/scripts/uniquelemmataoutput/ as a.txt-file for each model and POS. The used language model is *decorenewslg* from the spacy package.

clustering.py clustering.py tries to put all lemmas into categories of lemmas with similar semantic meaning. "Semantic meaning", in this case, is the embedding vector assigned to each lemma by the decorenews_lg model. In this case, if two lemmas have a cosine similarity of at least 0.7, they are put into the same semantic category. Then, the occurences of each category in every text sort is counted. The results and the global categories are printed in the NLTK/scripts/Kategorisierungen*-Folders. Again, one run takes all POS into account (NLTK/scripts/KategorisierungenAlle), but there are additional runs for each POS (and different combinations of POS, such as adjectives and adverbs) separately.

abweichungen_kategorien.py based on the files produced by clustering.py, this script creates a plot that displays the top 30 over- and underrepresented lemmas compared to avarage appearance. The plot can be found in the resepctive NLTK/scripts/Kategorisierungen*-Folders as abweichungenkategorienplot.png. It creates plots for all NLTK/scripts/Kategorisierungen*-Folders automatically

Heatmap_Kategorien.py based on the files produced by clustering.py, this script creates an interactive heatmap that displays the top 100 occurences of each semantic category in each model. The heatmaps can be found in the resepctive NLTK/scripts/Kategorisierungen*-Folders as interaktiveheatmap.html. It creates heatmaps for all NLTK/scripts/Kategorisierungen_*-Folders automatically.

Acknowledgments

-Yannic Pixberg - for his contribution of texts to the corpus database

Owner

Login: DayJay1992
Kind: user

Repositories: 1
Profile: https://github.com/DayJay1992

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Wegerhoff
    given-names: Dennis
    orcid: https://orcid.org/0009-0006-7904-6117
title: "Semantic Analysis of AI Generated text"
version: 0.4-Beta
identifiers:
  - type: doi
    value: 
date-released: 2025-05-02

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

semanticaianalysis

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Acknowledgments

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year