https://github.com/claudio-araya/data_spotify

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.2%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: claudio-araya
Language: Jupyter Notebook
Default Branch: main
Size: 20.2 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 11 months ago · Last pushed 11 months ago

Metadata Files

Readme

README.md

This project implements a basic data lakehouse architecture to analyze and transform a Spotify music dataset using Apache Spark and Delta Lake.

The data is organized into three layers: Bronze, Silver, and Gold.

📊 Dataset Overview

This dataset contains information about thousands of Spotify tracks, including audio features, popularity metrics, and metadata related to artists, albums, and genres.

📁 Source:
Spotify Tracks - Attributes and Popularity (Kaggle)

📄 Column Descriptions (from `dataset.csv`)

| Column Name | Description | |---------------------|-------------| | track_id | artists | album_name | track_name | popularity | duration_ms | explicit | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | time_signature | track_genre | Spotify's unique identifier for the track | | Name of the performing artist(s) | | Title of the album the track belongs to | | Title of the track | | Popularity score on Spotify (0–100 scale) | | Duration of the track in milliseconds | | Indicates whether the track contains explicit content | | How suitable the track is for dancing (0.0 to 1.0) | | Intensity and activity level of the track (0.0 to 1.0) | | Musical key (0 = C, 1 = C♯/D♭, …, 11 = B) | | Overall loudness of the track in decibels (dB) | | Modality (major = 1, minor = 0) | | Presence of spoken words in the track (0.0 to 1.0) | | Confidence measure of whether the track is acoustic (0.0 to 1.0) | | Predicts whether the track contains no vocals (0.0 to 1.0) | | Presence of an audience in the recording (0.0 to 1.0) | | Musical positivity conveyed (0.0 = sad, 1.0 = happy) | | Estimated tempo in beats per minute (BPM) | | Time signature of the track (e.g., 4 = 4/4) | | Assigned genre label for the track |

Bronze Layer (Preprocessed CSVs)

The original dataset (dataset.csv) was not ingested directly in raw form. Instead, it was processed using the Kimball.ipynb notebook to follow a dimensional modeling approach. This script splits the raw dataset into dimension and fact tables as .csv files:

dim_artists.csv: Unique artists with a generated artist_id
dim_albums.csv: Unique albums with associated artist_id and a generated album_id
dim_genres.csv: Unique track genres with a generated genre_id
dim_tracks.csv: Tracks with track_id, track_name, and references to album_id and genre_id
fact_tracks.csv: Numerical and categorical audio features per track_id

These files were saved into dim/ and fact/ folders and then loaded as Delta tables in the Silver layer.

Silver Layer

The CSV files created in the Bronze layer were read into Spark and stored in Delta format. These tables serve as the clean, structured foundation for analytics:

dim_artists
dim_albums
dim_genres
dim_tracks
fact_tracks

Gold Layer

From the Silver Delta tables, additional aggregated and analytical views were created for analysis purposes, such as:

all_info: All tracks with artist and album context, excluding IDs
avg_metrics_by_artist: Average track metrics per artist
avg_metrics_by_genre: Average track metrics per genre
avg_metrics_by_album_artist: Average metrics per album and artist
song_count_by_artist: Total number of songs per artist

These tables are written in Delta format and can be used for reporting, exploration, or ML pipelines.

Tools & Technologies

Apache Spark
Delta Lake
PySpark
Lakehouse architecture (Bronze → Silver → Gold)
Python 3.12

Owner

Login: claudio-araya
Kind: user

Repositories: 2
Profile: https://github.com/claudio-araya

GitHub Events

Total

Push event: 9
Create event: 1

Last Year

Push event: 9
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/claudio-araya/data_spotify

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

📊 Dataset Overview

📄 Column Descriptions (from `dataset.csv`)

Bronze Layer (Preprocessed CSVs)

Silver Layer

Gold Layer

Tools & Technologies

Owner

GitHub Events

Total

Last Year

https://github.com/claudio-araya/data_spotify

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

📊 Dataset Overview

📄 Column Descriptions (from dataset.csv)

Bronze Layer (Preprocessed CSVs)

Silver Layer

Gold Layer

Tools & Technologies

Owner

GitHub Events

Total

Last Year

📄 Column Descriptions (from `dataset.csv`)