tweetloc
tweetloc - Finding Misspelled Location Names in Tweets
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
tweetloc - Finding Misspelled Location Names in Tweets
Basic Info
- Host: GitHub
- Owner: victorskl
- License: apache-2.0
- Language: Java
- Default Branch: master
- Size: 199 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
About
The tweetloc application attempt to perform Approximate String Searching and, experiment some String Matching algorithms (see below) and observe their effectiveness. It is using a gazetteer dictionary to approximate string match of a possible location name from Twitter user tweets.
Ideally, String Matching algorithms always perform an exact or best matching on the two comparing strings. However, the tweetloc main goal is trying to find possible Misspelled Location Names in Tweets.
Input Data Assumptions
In order to run and experiment tweetloc application, you should have the two input data in the following format.
Gazetteer Dictionary
- Download Free Gazetteer data from GeoNames e.g.
US.zip - Process it to extract the column
asciinamefield data only. e.g.US.txt - Optionally, sort the location names and remove duplicates.
- Save this gazetteer data in the plain text
.txtformat and specify it inconfig.properties,gazetteer=path/to/file.txt
Tweet Data
- Harvest Twitter user public tweets data e.g. using twitter4j with Twitter API
And process it in the following format:
user_id (tab) tweet_id (tab) tweet_text (tab) times_tamp (newline)Save it as plain text
.txtformat and specify it inconfig.properties,tweets=path/to/file.txt
Building Source
At core, tweetloc make use of the following key libraries:
Please refer tweetloc/pom.xml for details.
tweetloc can build with maven.
cd tweetloc
mvn clean
mvn test
mvn package
The build artifacts can find under tweetloc/target folder.
Configuration
- Open
config.propertiesand configure all the paths. - Adjust other parameters. Default values are a good starting point.
Running First Time
If config is not under the same root as where tweetloc.jar is, then pass -c option. e.g.
java -jar tweetloc.jar -c /path/to/config.properties [... and other options]
Preprocess Gazetteer (Run once)
java -jar tweetloc.jar -d 1 --preprocess gaze
Partition Tweets (Optional, if Tweets corpus is very big)
java -jar tweetloc.jar --preprocess parted
Dry Run
To dry run the first few tweets with GED
java -jar tweetloc.jar -d 3
Algorithms
You can pass -a option to specify the algorithm.
java -jar tweetloc.jar -a led -d 2
The following are the implemented String Matching algorithms. More algorithms can be developed by implementing StringSearch.java interface.
- ged = Global Edit Distance (default if no -a is pass)
- led = Local Edit Distance
- ngm = N-Gram Distance
- sdx = Soundex
- nbh = Neighbourhood Search (Agrep wrapper)
Specify Output File
java -jar tweetloc.jar -a ngm -o output_ngm.csv -d 5
Start index at 15 and run 5 more lines
java -jar tweetloc.jar -a sdx -o output_sdx.csv -i 15 -d 5
Upper/lower limit (low-pass/high-pass filter)
e.g. Run GED with cost [m=1, i,r,d=-1] with max score not more than 30
java -jar tweetloc.jar -zz 30
e.g. Run GED with cost [m=0, i,r,d=1] (Levenshtein distance) with min score not lower than 2
java -jar tweetloc.jar -xx 2
Single word matching
Run GED with single word matching (i.e. tokens) against dictionary. e.g. San Francisco becomes 'San' and 'Francisco'. Default is multi-word aware matching by using chunking heuristic approach.
java -jar tweetloc.jar --single
Running As a Job
It is good idea to run with screen on Linux as a background job.
screen
java -jar tweetloc.jar -a ged -o output_ged_00.csv &
[ctrl + a, d]
tail -f app.log
[ctrl + c]
screen -r
[ctrl + a, d]
Notes
This assignment work is done for COMP90049 Project 1 assessment 2016 SM2, The University of Melbourne. You can read the report on background context, though it discusses more on the data that I have worked with. You may also want to read the related tweetlocml assignment. The implementation still has room for improvement. You may cite this work as follows.
Zenodo:
LaTeX/BibTeX:
@misc{sanl1,
author = {Lin, San Kho},
title = {tweetloc - Finding Misspelled Location Names in Tweets},
year = {yyyy},
url = {https://github.com/victorskl/tweetloc},
urldate = {yyyy-mm-dd}
}
Further Reading:
Owner
- Name: Victor San Kho Lin
- Login: victorskl
- Kind: user
- Location: Melbourne
- Company: University of Melbourne
- Website: https://sankholin.com
- Twitter: vskl
- Repositories: 28
- Profile: https://github.com/victorskl
https://twitter.com/vskl https://keybase.io/victorskl https://www.linkedin.com/in/victorskl https://orcid.org/0000-0002-3940-4729
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Lin" given-names: "Victor San Kho" orcid: "https://orcid.org/0000-0002-3940-4729" title: "tweetloc - Finding Misspelled Location Names in Tweets" version: v201611 doi: 10.5281/zenodo.7240495 date-released: 2022-10-23 url: "https://github.com/victorskl/tweetloc"
GitHub Events
Total
Last Year
Dependencies
- org.apache.logging.log4j:log4j-bom 2.6.2 import
- args4j:args4j 2.33
- com.opencsv:opencsv 3.8
- commons-beanutils:commons-beanutils 1.9.4
- commons-codec:commons-codec 1.10
- commons-io:commons-io 2.5
- org.apache.commons:commons-collections4 4.1
- org.apache.commons:commons-configuration2 2.0
- org.apache.logging.log4j:log4j-api
- org.apache.logging.log4j:log4j-core
- org.apache.lucene:lucene-analyzers-common 8.6.3
- org.apache.lucene:lucene-core 8.6.3
- org.apache.lucene:lucene-suggest 8.6.3
- org.apache.opennlp:opennlp-tools 1.6.0
- junit:junit 4.13.1 test
- org.hamcrest:hamcrest-library 1.3 test