semantic-search-over_social-media-posts

The method implements semantic search on social media posts for a given query using cosine similarity of their embedded vectors

https://github.com/bda-kts/semantic-search-over_social-media-posts

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary

Keywords

semantic-similarity-post-retrieval
Last synced: 6 months ago · JSON representation ·

Repository

The method implements semantic search on social media posts for a given query using cosine similarity of their embedded vectors

Basic Info
  • Host: GitHub
  • Owner: BDA-KTS
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 42.4 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Topics
semantic-similarity-post-retrieval
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Semantic Search Over Social Media Posts (Tweets)

Description

This method allows users to perform a semantic search across a collection of social media posts (e.g., tweets) and retrieve the most relevant posts for a given query. It computes sentence embeddings for the social media posts and the input query, where the most similar posts are extracted using cosine similarity.

Use Cases

This method is designed to identify social media posts (e.g., tweets) that are semantically similar to a given query. It is particularly useful for tasks such as:

  • Topic Exploration: Finding posts related to specific topics like social media, gender issues, or elections by comparing the semantic meaning of the query with the content of the posts.
  • Content Filtering: Extracting posts that match the intent or context of a query, even if the exact keywords are not present in the posts.
  • Trend Analysis: Analyzing discussions around a theme by retrieving posts that are contextually relevant to a set of predefined queries.

The method uses sentence embeddings to represent both the query and the posts, and ranks the posts based on their cosine similarity to the query. This ensures that the retrieved posts are not just keyword matches but are semantically aligned with the query's meaning.

Input Data

The input (query) text can be a word, phrase, or sentence. It can also be a social media post for semantic search over the corpus. For multiple queries, update data/input_queries.txt, having each query per line. A single query can be directly provided in the semantic-search-over_social-media-posts.ipynb. The input file contains the following query terms:

social media ,community interaction,cultural identity

The data dump for semantic search can be social media posts in JSON format, e.g., Tweets. We use NLTK sample tweets (corpora/tweets.20150430-223406.json) for demonstration.

Output Data

After running all the scripts in semantic-search-over_social-media-posts.ipynb, the results will be saved as a JSON file in data/output.json as;

  • Post ID: The unique identifier of the social media post.
  • Post Text: The content of the post.
  • Similarity Score: A numerical value (ranging from 0 to 1) indicating how closely the post matches the input query.

Below are the K=3 most similar posts to the input query (social media, women, and election). Only a few posts similar to a query are shown as examples:

json { "social media": [ { "post ID": "13567", "post text": "There's something a bit \"dad dancing\" about the way the Tories try to electioneer via social media https://t.co/WH0cmv76VD", "sim score": "0.9372139191497816" }, { "post ID": "9732", "post text": "It's extremely comforting to know that the power of mainstream media has been diluted by social media? #SNP", "sim score": "0.9371564729455584" }, { "post ID": "18324", "post text": "@mmaher70 @RichardJMurphy So why cant they defend the position thats just total incompetence constantly allow Tories to set agenda esp media", "sim score": "0.918129503287474" } ], }

Hardware Requirements

The method runs on a small virtual machine provided by a cloud computing company (2 x86 CPU core, 4 GB RAM, 40GB HDD).

Environment Setup

  • Python v3.8 (preferably through Anaconda)
  • Using Anaconda:

bash conda create -n semantic_search python=3.8 conda activate semantic_search conda install -c conda-forge notebook pip install -r requirements.txt

  • Using Python:

bash python -m venv semantic_search cd semantic_search Scripts\activate cd .. pip install -r requirements.txt

How to Use

Start Jupyter Lab or Notebook:

bash jupyter lab

Technical Details

This method performs semantic search by computing embeddings for both queries and social media posts, and ranking posts based on their similarity to the queries. Below is a detailed breakdown of the process:

Semantic search:

  • Word Embeddings: The method uses FastText embeddings stored in embeddings/en_embeddings.p.
  • Text-Level Embeddings: Word embeddings are aggregated at the document or query level to produce a single embedding vector for each text unit (query or post).
  • Cosine Similarity: The similarity between the query embeddings and the pre-computed embeddings of the corpus posts is calculated using cosine similarity.
  • Ranking: Posts are ranked in descending order of similarity scores, and only the K results are included in the output.

Workflow

  1. Load precomputed embeddings for the reference dataset
  2. Load the input queries and generate their embeddings
  3. Calculate cosine similarities for each query with all posts of the social media posts dataset using cosine similarity
  4. Rank posts by similarity and output the K most similar results.

semantic search workflow

Contact Details

For questions or feedback, contact Fakhri Momeni via fakhri.momeni@gesis.org.

Owner

  • Name: BDA-KTS
  • Login: BDA-KTS
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Momeni
    given-names: Fakhri
    orcid: https://orcid.org/0000-0002-5572-575X
title: "semantic search over social media posts"
version: 1.0
identifiers:
  - type: 
    value: 
date-released: 2024-11-26

GitHub Events

Total
  • Issues event: 8
  • Issue comment event: 1
  • Push event: 153
  • Pull request event: 2
  • Fork event: 1
Last Year
  • Issues event: 8
  • Issue comment event: 1
  • Push event: 153
  • Pull request event: 2
  • Fork event: 1