extract_urls_mentions_hashtags

The methods extracts urls, mentions and hashtags from a piece of text

https://github.com/bda-kts/extract_urls_mentions_hashtags

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

The methods extracts urls, mentions and hashtags from a piece of text

Basic Info
  • Host: GitHub
  • Owner: BDA-KTS
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 83 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Extract Entities (URLs, Mentions, Hashtags)

Description

The method extracts useful entities from social media posts such as URLs, hashtags, cashtags ($ and €), mentions (including Mastodon mentions), quoted texts, punctuations, punctuation emphasis (e.g., !!!), all caps words, negations, time expressions (e.g., today, next week), and emojis. It's a very simple method using only regular expressions to determine the mentioned entities. The method reads data from a CSV file with posts per row and writes output to a CSV file having post text and the extracted entities as respective columns.

Use Cases

This method can be used to extract entities mentioned in social media posts, e.g., URLs, hashtags, emojis, etc., contributing to the analysis of social behavior among user groups.

Input Data

The input data consists of social media posts (one per line) as a CSV file, i.e., data/input_social_posts.csv. The following are a few examples:

|Posts| |---------| |"@bob@infosec.exchange #Crypto €BMW ""Let’s go!"" https://t.co/xyz123 😀"| |#Startups 💡 $GOOG https://t.co/xyz123 @dave@mastodon.social 'Not sure about this'| |@bob@mastodon.social $AAPL 'This is amazing' 😀 #Crypto https://news.site/article| |"@dave@infosec.exchange ""Exciting times ahead!"" https://t.co/xyz123 €BMW #AI 😀"| |#AI @bob@mastodon.social €ETH 🚀 'Not sure about this' https://news.site/article| |...|

Output Data

The method writes output to a CSV file, i.e., data/output_posts_with_entities.csv. It has the first column as the original post's text, followed by columns representing entities extracted from the text. Each column value is a list of one or more entities extracted from a post.

| Posts | Mentions | Hashtags | Cashtags | URLs | Quotedtext | Punctuations | Punctuationemphasis | Allcaps | Negations | Timeexpressions | Emojis | |:-----:|:--------:|:--------:|:--------:|:----:|:-----------:|:------------:|:--------------------:|:--------:|:---------:|:----------------:|:------:| |"@bob@infosec.exchange #Crypto €BMW ""Let’s go!"" https://t.co/xyz123 😀"| ['@bob@infosec.exchange']| ['#Crypto'] |['€BMW'] |['https://t.co/xyz123'] |"['""Let’s go!""']"| [@, @, ., #, €, ", ’, !, ", :, /, /, ., /, 😀] | [] |['BMW']| []| []| ['😀']| |"#Startups 💡 \$GOOG https://t.co/xyz123 @dave@mastodon.social 'Not sure about this'"|['@dave@mastodon.social'] |['#Startups'] |['\$GOOG'] |['https://t.co/xyz123'] |"[""'Not sure about this'""]" | [#, 💡, \$, :, /, /, ., /, @, @, ., ', '] | [] |['GOOG']| ['not'] |[] |['💡']| |@bob@mastodon.social \$AAPL 'This is amazing' 😀 #Crypto https://news.site/article| ['@bob@mastodon.social']| ['#Crypto']| ['\$AAPL']| ['https://news.site/article']| "[""'This is amazing'""]"| [@, @, ., $, ', ', 😀, #, :, /, /, ., /] | []| ['AAPL']| []| []| ['😀']| |"@dave@infosec.exchange ""Exciting times ahead!"" https://t.co/xyz123 €BMW #AI 😀" |['@dave@infosec.exchange']| ['#AI']| ['€BMW']| ['https://t.co/xyz123']| "['""Exciting times ahead!""']"| [@, @, ., ", !, ", :, /, /, ., /, €, #, 😀] | []| ['BMW', 'AI']| []| ['times']| ['😀']| |#AI @bob@mastodon.social €ETH 🚀 'Not sure about this' https://news.site/article | ['@bob@mastodon.social']| ['#AI']| ['€ETH']| ['https://news.site/article']| "[""'Not sure about this'""]"| [#, @, @, ., €, 🚀, ', ', :, /, /, ., /] | []| ['AI', 'ETH']| ['not']| []| ['🚀']|

Hardware Requirements

The method runs on a small virtual machine provided by a cloud computing company (2 x86 CPU cores, 4 GB RAM, 40 GB HDD).

Environment Setup

Executing the requirements.txt file using the command pip install -r requirements.txt will deploy the working environment.

Please note that the requirements.txt contains fixed versions that require Python 3.10. If you want to use it with newer versions of Python, update the requirements.txt accordingly (or unpin the versions).

Alternatively, there's a Pipfile to install the method with pipenv

How to Use

Run code cells of the index.ipynb notebook

It imports functions from entity_extractor.py

Contact

For queries, please contact, taimoor.khan@gesis.org

Owner

  • Name: BDA-KTS
  • Login: BDA-KTS
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Khan
    given-names: Muhammad Taimoor
    orcid: https://orcid.org/0000-0002-6542-9217
title: "Extract Entities (URLs, Mentions, Hashtags)"
version: 1.0
identifiers:
  - type: doi
    value: 
date-released: 2025-06-25

GitHub Events

Total
  • Issues event: 2
  • Issue comment event: 7
  • Push event: 15
  • Public event: 1
  • Pull request event: 1
  • Create event: 1
Last Year
  • Issues event: 2
  • Issue comment event: 7
  • Push event: 15
  • Public event: 1
  • Pull request event: 1
  • Create event: 1

Dependencies

requirements.txt pypi
  • emoji ==1.4.1
  • json5 ==0.9.28
  • regex ==2024.11.6