latentsemanticindexing_irproject

Final project of the Information Retreival Course

https://github.com/eferos93/latentsemanticindexing_irproject

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Final project of the Information Retreival Course

Basic Info

Host: GitHub
Owner: eferos93
License: mit
Language: Scala
Default Branch: master
Size: 42.9 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 5 years ago · Last pushed over 4 years ago

Metadata Files

Readme License Citation

Information Retrieval System using Latent Semantic Indexing and Apache Spark

Final project of the Information Retreival Course of the master degree in Data Science and Scientific computing.

Project Requirements

Write an IR system that uses latent semantic indexing to answer queries. | Requirement | Satisfied | | ----------- | --------- | | The system must accept free-form text queries. | :heavycheckmark: | | Evaluate the system on a set of queries… | :heavycheckmark: | | …and try to use different dimensions for the dimensionality reduction. | :heavycheckmark: | | The system must be able to save the data structures used for querying. | :heavycheckmark: |

Data

For this project we used two datasets: CMU Movie Summary Corpus and NPL collection.

The second dataset is already in this repository, while the other must be downloaded from the link above (extract everythning inside the dicrectory data/)

Running the program

The are two applications: RankOptimisation and MovieSummariesTest. The first uses the NPL Corpus (that has also a set of queries and their correspondet relevance set) for evaluation of the optimal number of singular values, while the latter applies the results obtained in the other corpus.

There are two ways to run the project: on IntelliJ IDEA and on terminal via SBT. Note that, for an ununkown reason, performaces are better when running the program inside IntelliJ

Running on IntelliJ IDEA

To run one of those two applications, just right click and click run, IntelliJ will create a run configuration, but it needs to be modified. Thus, stop the program and modify the configuration by adding the VM Options flags -Xms10g -Xmx10g. It should look like the following: alt text

Similarly you can run also a Scala REPL. Just right click on any class and select Scala REPL. As before, stop the program and modify the Scala REPL configuration. Add the same flags mentioned above and it should look like the this: alt text

Running on SBT (not reccomended)

To run on SBT, firstly you need to install it. Afterwards, open a terminal inside the project directory and run sbt. This will open the sbt console. Then run compile and then either console to run a scala REPL console, or run and the terminal will prompt you to select which of the applications to run.

The memory configurations are already in the file .sbtopts inside the repository.

System Requirements

At least 16 Gb of RAM
JDK 1.8 or newer
Scala plugin for IntelliJ IDEA

Owner

Name: Eros Fabrici
Login: eferos93
Kind: user
Location: Athens
Company: Athena Research Center

Website: www.linkedin.com/in/eros-fabrici
Repositories: 65
Profile: https://github.com/eferos93

Marie Curie PhD Student @ Athena Research center and @ Universitat Politècnica de Catalunya in Data Enineering for Data Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Fabrici
    given-names: Eros
title: "Information Retrieval System using Latent Semantic Indexing with Apache Spark"
version: 0.1
date-released: 2021-10-28

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

latentsemanticindexing_irproject

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Information Retrieval System using Latent Semantic Indexing and Apache Spark

Project Requirements

Data

Running the program

Running on IntelliJ IDEA

Running on SBT (not reccomended)

System Requirements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year