latentsemanticindexing_irproject
Final project of the Information Retreival Course
https://github.com/eferos93/latentsemanticindexing_irproject
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Repository
Final project of the Information Retreival Course
Basic Info
- Host: GitHub
- Owner: eferos93
- License: mit
- Language: Scala
- Default Branch: master
- Size: 42.9 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Information Retrieval System using Latent Semantic Indexing and Apache Spark
Final project of the Information Retreival Course of the master degree in Data Science and Scientific computing.
Project Requirements
Write an IR system that uses latent semantic indexing to answer queries. | Requirement | Satisfied | | ----------- | --------- | | The system must accept free-form text queries. | :heavycheckmark: | | Evaluate the system on a set of queries… | :heavycheckmark: | | …and try to use different dimensions for the dimensionality reduction. | :heavycheckmark: | | The system must be able to save the data structures used for querying. | :heavycheckmark: |
Data
For this project we used two datasets: CMU Movie Summary Corpus and NPL collection.
The second dataset is already in this repository, while the other must be downloaded from the link above (extract everythning inside the dicrectory data/)
Running the program
The are two applications: RankOptimisation and MovieSummariesTest. The first uses the NPL Corpus (that has also a set of queries and their correspondet relevance set) for evaluation of the optimal number of singular values, while the latter applies the results obtained in the other corpus.
There are two ways to run the project: on IntelliJ IDEA and on terminal via SBT. Note that, for an ununkown reason, performaces are better when running the program inside IntelliJ
Running on IntelliJ IDEA
To run one of those two applications, just right click and click run, IntelliJ will create a run configuration, but it needs to be modified. Thus, stop the program and modify the configuration by adding the VM Options flags -Xms10g -Xmx10g. It should look like the following:

Similarly you can run also a Scala REPL. Just right click on any class and select Scala REPL. As before, stop the program and modify the Scala REPL configuration. Add the same flags mentioned above and it should look like the this:

Running on SBT (not reccomended)
To run on SBT, firstly you need to install it. Afterwards, open a terminal inside the project directory and run sbt. This will open the sbt console. Then run compile and then either console to run a scala REPL console, or run and the terminal will prompt you to select which of the applications to run.
The memory configurations are already in the file .sbtopts inside the repository.
System Requirements
- At least 16 Gb of RAM
- JDK 1.8 or newer
- Scala plugin for IntelliJ IDEA
Owner
- Name: Eros Fabrici
- Login: eferos93
- Kind: user
- Location: Athens
- Company: Athena Research Center
- Website: www.linkedin.com/in/eros-fabrici
- Repositories: 65
- Profile: https://github.com/eferos93
Marie Curie PhD Student @ Athena Research center and @ Universitat Politècnica de Catalunya in Data Enineering for Data Science
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Fabrici
given-names: Eros
title: "Information Retrieval System using Latent Semantic Indexing with Apache Spark"
version: 0.1
date-released: 2021-10-28