data-exploration
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: MaheyDS
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 86.9 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Data Exploration Project
This project contains a Jupyter notebook for exploring Amazon Electronics product data and customer reviews. The notebook performs data filtering, analysis, and visualization of product metadata and customer reviews. You can find and download the Amazon Electronics Category Dataset Overview by clicking here
Project Structure
data-exploration/
├── data/
│ ├── product_metadata/
│ │ ├── meta_Electronics.jsonl (4.9GB) - Original product metadata
│ │ ├── meta_Electronics_2022_2023.jsonl - Filtered data from 2022-2023
│ │ ├── meta_Electronics_2022_2023_with_main_category.jsonl - Products with main category
│ │ ├── meta_Electronics_2022_2023_with_main_category_ratings_100.jsonl - Products with 100+ ratings
│ │ └── meta_Electronics_2022_2023_with_main_category_ratings_100_sample_1000.jsonl - 1000 sample products
│ └── customer_reviews/
│ ├── Electronics.jsonl (21GB) - Original customer reviews
│ ├── Electronics_2022_2023_with_main_category_ratings_100.jsonl - Reviews for products with 100+ ratings
│ └── Electronics_2022_2023_with_main_category_ratings_100_sample_1000.jsonl - Reviews for 1000 sample products
├── notebooks/
│ └── dataexploration.ipynb - Main exploration notebook
├── pyproject.toml - Project configuration
└── README.md - This file
Prerequisites
- Python 3.12 or higher
- Jupyter Notebook or JupyterLab
- Required Python packages (see Installation section)
Installation
Clone or download this repository
bash git clone <repository-url> cd data-explorationInstall Python dependencies
bash pip install jupyter pandas matplotlib
Or if you prefer using conda:
bash
conda install jupyter pandas matplotlib
- Verify the data files are present
- Ensure the
data/directory contains the required JSONL files - The original data files (
meta_Electronics.jsonlandElectronics.jsonl) should be present for the notebook to work properly
- Ensure the
Running the Notebook
Start Jupyter Notebook
bash jupyter notebookNavigate to the notebook
- Open your web browser and go to the Jupyter interface
- Navigate to the
notebooks/directory - Click on
dataexploration.ipynb
Run the notebook cells
- The notebook is designed to be run sequentially from top to bottom
- Each cell performs a specific data processing step
- Make sure to run all cells in order to ensure proper data flow
Notebook Overview
The notebook performs the following data exploration steps:
1. Data Filtering (Cells 1-6)
- Filters products to only include those first available since 2022
- Separates products with and without main category definitions
- Creates filtered datasets for further analysis
2. Data Analysis (Cells 7-18)
- Explores the distribution of product categories
- Analyzes rating distributions
- Creates samples of products with 100+ ratings
- Generates visualizations using matplotlib
3. Review Data Processing (Cells 19-21)
- Extracts customer reviews for the filtered products
- Creates review datasets corresponding to the product samples
- Processes large JSONL files efficiently
Data Files Description
Product Metadata Files
- Original:
meta_Electronics.jsonl(4.9GB) - Complete Amazon Electronics product metadata - Filtered: Various filtered versions based on date, category, and rating criteria
Customer Review Files
- Original:
Electronics.jsonl(21GB) - Complete customer reviews for Electronics products - Filtered: Review datasets corresponding to the filtered product sets
Output Files
The notebook generates several processed data files: - Filtered product metadata with specific criteria - Sample datasets for analysis - Corresponding customer review datasets - Visualizations of data distributions
Notes
- Large File Processing: The notebook handles large JSONL files (up to 21GB) efficiently
- Memory Usage: Ensure sufficient RAM for processing large datasets
- Processing Time: Some operations may take several minutes due to file sizes
- Data Dependencies: The notebook expects specific file paths and structures
Troubleshooting
- File Not Found Errors: Ensure all data files are in the correct directories
- Memory Issues: Consider processing smaller samples if you encounter memory constraints
- Import Errors: Verify all required packages are installed
- Path Issues: Make sure you're running the notebook from the correct directory
Requirements
- Python 3.12+
- pandas
- matplotlib
- jupyter
- json (built-in)
- Standard library modules
License
This project is licensed under the terms specified in the LICENSE file.
Owner
- Login: MaheyDS
- Kind: user
- Repositories: 1
- Profile: https://github.com/MaheyDS
Citation (CITATION.cff)
@article{hou2024bridging,
title={Bridging Language and Items for Retrieval and Recommendation},
author={Hou, Yupeng and Li, Jiacheng and He, Zhankui and Yan, An and Chen, Xiusi and McAuley, Julian},
journal={arXiv preprint arXiv:2403.03952},
year={2024}
}