latentsemanticindexing_irproject

Final project of the Information Retreival Course

https://github.com/eferos93/latentsemanticindexing_irproject

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Final project of the Information Retreival Course

Basic Info
  • Host: GitHub
  • Owner: eferos93
  • License: mit
  • Language: Scala
  • Default Branch: master
  • Size: 42.9 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed over 4 years ago
Metadata Files
Readme License Citation

README.md

Information Retrieval System using Latent Semantic Indexing and Apache Spark

Final project of the Information Retreival Course of the master degree in Data Science and Scientific computing.

Project Requirements

Write an IR system that uses latent semantic indexing to answer queries. | Requirement | Satisfied | | ----------- | --------- | | The system must accept free-form text queries. | :heavycheckmark: | | Evaluate the system on a set of queries… | :heavycheckmark: | | …and try to use different dimensions for the dimensionality reduction. | :heavycheckmark: | | The system must be able to save the data structures used for querying. | :heavycheckmark: |

Data

For this project we used two datasets: CMU Movie Summary Corpus and NPL collection.

The second dataset is already in this repository, while the other must be downloaded from the link above (extract everythning inside the dicrectory data/)

Running the program

The are two applications: RankOptimisation and MovieSummariesTest. The first uses the NPL Corpus (that has also a set of queries and their correspondet relevance set) for evaluation of the optimal number of singular values, while the latter applies the results obtained in the other corpus.

There are two ways to run the project: on IntelliJ IDEA and on terminal via SBT. Note that, for an ununkown reason, performaces are better when running the program inside IntelliJ

Running on IntelliJ IDEA

To run one of those two applications, just right click and click run, IntelliJ will create a run configuration, but it needs to be modified. Thus, stop the program and modify the configuration by adding the VM Options flags -Xms10g -Xmx10g. It should look like the following: alt text

Similarly you can run also a Scala REPL. Just right click on any class and select Scala REPL. As before, stop the program and modify the Scala REPL configuration. Add the same flags mentioned above and it should look like the this: alt text

Running on SBT (not reccomended)

To run on SBT, firstly you need to install it. Afterwards, open a terminal inside the project directory and run sbt. This will open the sbt console. Then run compile and then either console to run a scala REPL console, or run and the terminal will prompt you to select which of the applications to run.

The memory configurations are already in the file .sbtopts inside the repository.

System Requirements

  • At least 16 Gb of RAM
  • JDK 1.8 or newer
  • Scala plugin for IntelliJ IDEA

Owner

  • Name: Eros Fabrici
  • Login: eferos93
  • Kind: user
  • Location: Athens
  • Company: Athena Research Center

Marie Curie PhD Student @ Athena Research center and @ Universitat Politècnica de Catalunya in Data Enineering for Data Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Fabrici
    given-names: Eros
title: "Information Retrieval System using Latent Semantic Indexing with Apache Spark"
version: 0.1
date-released: 2021-10-28

GitHub Events

Total
Last Year