Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: MaheyDS
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 86.9 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Data Exploration Project

This project contains a Jupyter notebook for exploring Amazon Electronics product data and customer reviews. The notebook performs data filtering, analysis, and visualization of product metadata and customer reviews. You can find and download the Amazon Electronics Category Dataset Overview by clicking here

Project Structure

data-exploration/ ├── data/ │ ├── product_metadata/ │ │ ├── meta_Electronics.jsonl (4.9GB) - Original product metadata │ │ ├── meta_Electronics_2022_2023.jsonl - Filtered data from 2022-2023 │ │ ├── meta_Electronics_2022_2023_with_main_category.jsonl - Products with main category │ │ ├── meta_Electronics_2022_2023_with_main_category_ratings_100.jsonl - Products with 100+ ratings │ │ └── meta_Electronics_2022_2023_with_main_category_ratings_100_sample_1000.jsonl - 1000 sample products │ └── customer_reviews/ │ ├── Electronics.jsonl (21GB) - Original customer reviews │ ├── Electronics_2022_2023_with_main_category_ratings_100.jsonl - Reviews for products with 100+ ratings │ └── Electronics_2022_2023_with_main_category_ratings_100_sample_1000.jsonl - Reviews for 1000 sample products ├── notebooks/ │ └── dataexploration.ipynb - Main exploration notebook ├── pyproject.toml - Project configuration └── README.md - This file

Prerequisites

  • Python 3.12 or higher
  • Jupyter Notebook or JupyterLab
  • Required Python packages (see Installation section)

Installation

  1. Clone or download this repository bash git clone <repository-url> cd data-exploration

  2. Install Python dependencies bash pip install jupyter pandas matplotlib

Or if you prefer using conda: bash conda install jupyter pandas matplotlib

  1. Verify the data files are present
    • Ensure the data/ directory contains the required JSONL files
    • The original data files (meta_Electronics.jsonl and Electronics.jsonl) should be present for the notebook to work properly

Running the Notebook

  1. Start Jupyter Notebook bash jupyter notebook

  2. Navigate to the notebook

    • Open your web browser and go to the Jupyter interface
    • Navigate to the notebooks/ directory
    • Click on dataexploration.ipynb
  3. Run the notebook cells

    • The notebook is designed to be run sequentially from top to bottom
    • Each cell performs a specific data processing step
    • Make sure to run all cells in order to ensure proper data flow

Notebook Overview

The notebook performs the following data exploration steps:

1. Data Filtering (Cells 1-6)

  • Filters products to only include those first available since 2022
  • Separates products with and without main category definitions
  • Creates filtered datasets for further analysis

2. Data Analysis (Cells 7-18)

  • Explores the distribution of product categories
  • Analyzes rating distributions
  • Creates samples of products with 100+ ratings
  • Generates visualizations using matplotlib

3. Review Data Processing (Cells 19-21)

  • Extracts customer reviews for the filtered products
  • Creates review datasets corresponding to the product samples
  • Processes large JSONL files efficiently

Data Files Description

Product Metadata Files

  • Original: meta_Electronics.jsonl (4.9GB) - Complete Amazon Electronics product metadata
  • Filtered: Various filtered versions based on date, category, and rating criteria

Customer Review Files

  • Original: Electronics.jsonl (21GB) - Complete customer reviews for Electronics products
  • Filtered: Review datasets corresponding to the filtered product sets

Output Files

The notebook generates several processed data files: - Filtered product metadata with specific criteria - Sample datasets for analysis - Corresponding customer review datasets - Visualizations of data distributions

Notes

  • Large File Processing: The notebook handles large JSONL files (up to 21GB) efficiently
  • Memory Usage: Ensure sufficient RAM for processing large datasets
  • Processing Time: Some operations may take several minutes due to file sizes
  • Data Dependencies: The notebook expects specific file paths and structures

Troubleshooting

  1. File Not Found Errors: Ensure all data files are in the correct directories
  2. Memory Issues: Consider processing smaller samples if you encounter memory constraints
  3. Import Errors: Verify all required packages are installed
  4. Path Issues: Make sure you're running the notebook from the correct directory

Requirements

  • Python 3.12+
  • pandas
  • matplotlib
  • jupyter
  • json (built-in)
  • Standard library modules

License

This project is licensed under the terms specified in the LICENSE file.

Owner

  • Login: MaheyDS
  • Kind: user

Citation (CITATION.cff)

@article{hou2024bridging,
  title={Bridging Language and Items for Retrieval and Recommendation},
  author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
  journal={arXiv preprint arXiv:2403.03952},
  year={2024}
}

GitHub Events

Total
Last Year

Dependencies

pyproject.toml pypi