Minjae Lee
Last.fm ETL: Project Overview
Data pipeline to extract music listening data using the Last.fm Music Discovery API. The project provides insight into personal music listening activity and trends.
On a technical level, the project is a showcase of Python, SQL, Pandas, ETL, dimensional modeling, and Airflow usage.
Detailed documentation is available on my Github Project Page and the Project Dashboard is available below.
ETL Architecture
- Extract music listening data from Last.fm API
- Model data and generate staging and final database using star schema
- Transform data using Python, Pandas, and SQLAlchemy to the staging layer
- Load data from staging layer to final SQLite database
- Visualize data using Google Data Studio
- Orchestrate tasks with Airflow
Data Model
Airflow Orchestration
The airflow DAG supports both one time historic load and incremental batch loads on the music listening data. The DAG is scheduled to run daily.
- `check_initialized`: Selects between the incremental batch load and the initial one time historic load branches.
- `initial_extract`: Extracts all historic data and skips tasks if no new data is available.
- `update_extract`: Extracts daily batch data and skips tasks if no new data is available.
- `initialize_database`: Initializes staging area and final database schema according to the data model above.
- `empty_stage`: Empties staging area for new data transformations.
- `transform`: Transforms and formats data to fit the data model above.
- `load`: Loads/appends transformed data from staging layer to final database.
- `export`: Exports database tables as csv for data visualization.
My Projects
Music Listening History ETL Pipeline
The project is an ETL pipeline which processes music listening data from Last.fm API to provide insight on listening activity. The data is transformed and orchestrated into Kimball's star schema model using Python, SQL, and Airflow. The data is then loaded into Google Data Studio for visualization.