Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Keywords
Repository
TRACE - Text Reuse Analysis and Comparison Engine
Basic Info
Statistics
- Stars: 6
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
TRACE - Text Reuse Analysis and Comparison Engine
TRACE is a simple Python script that compares the similarities between different text files using two methods: shingling and sentence embeddings. It allows you to specify the directory containing the text (txt) files. It also creates a network graph of the text similarities to see the relations of the different texts. The result of the analytics is stored in a json file (similarityresults.json_). TRACE can be useful for tasks such as plagiarism detection, document clustering, or identifying related documents in large text collections.
USAGE
TRACE takes the following command-line arguments:
--dir1 (required): Path to the first directory containing .txt files
--dir2 (required): Path to the second directory containing .txt files
--s (optional, default=5): Number of words or characters per shingle
--mode (required, default="word"): Shingling mode, either 'word' or 'character'
--t (optional, default=1): Threshold for minimum number of shared shingles
--m (optional): Use SentenceTransformer model. If specified without a value, it uses the default model "intfloat/multilingual-e5-large"
--similarity_threshold (optional, default=0.85): Similarity threshold for SentenceTransformer comparison
These arguments allow users to customize the text comparison process, including the directories to compare, the shingling method, and whether to use additional semantic similarity analysis with a SentenceTransformer model.
python
python trace.py --dir1 /path/to/first/directory --dir2 /path/to/second/directory --s 5 --mode word --t 3 --m "sentence-transformers/all-MiniLM-L6-v2" --similarity_threshold 0.8
What this command will do:
- Process all .txt files in both specified directories, creating 5-word shingles for each file.
- Perform an initial comparison using these shingles, identifying pairs of texts that share at least 3 shingles.
- For the text pairs that pass the initial shingling comparison, it will use the specified SentenceTransformer model to calculate semantic similarity between the complete docuemnts.
- Keep only the text pairs whose semantic similarity score is above 0.8.
- Generate a network graph visualizing the similarities between texts.
- Save the results to a JSON file.
result.json (sample)
json
[
{
"dir1_file": "2Thessalonians_chapter_3.txt",
"dir2_file": "PannHOSB_1320_IV_charter.txt",
"similarity": 0.9021626114845276,
"shared_shingles": [
"domini nostri iesu christi"
],
"num_shared_shingles": 1
},
Network graph
The generated GEXF file can be opened by a variety of graph visualization app, such as Gephi, Cytoscape, and Orange. These applications allow you to view the network data as a graph, and to explore the relationships between the nodes and edges. However, the script also generates a preliminary visualization picture (textsimilaritynetwork.png)

Owner
- Login: kreeedit
- Kind: user
- Location: Graz, Austria
- Company: University of Graz
- Website: https://orcid.org/0000-0002-3913-2946
- Twitter: tms_kovacs
- Repositories: 3
- Profile: https://github.com/kreeedit
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Kovács
given-names: Tamás
orcid: https://orcid.org/1234-5678-9101-1121
title: "TRACE: Text Reuse Analysis and Comparison Engine"
version: 0.1.4
doi: 10.5281/zenodo.8183258
date-released: 2023-07-25
GitHub Events
Total
- Watch event: 1
- Push event: 3
Last Year
- Watch event: 1
- Push event: 3