Minjae Lee

Last.fm ETL: Project Overview

Data pipeline to extract music listening data using the Last.fm Music Discovery API. The project provides insight into personal music listening activity and trends.

On a technical level, the project is a showcase of Python, SQL, Pandas, ETL, dimensional modeling, and Airflow usage.

Detailed documentation is available on my Github Project Page and the Project Dashboard is available below.

ETL Architecture

  1. Extract music listening data from Last.fm API
  2. Model data and generate staging and final database using star schema
  3. Transform data using Python, Pandas, and SQLAlchemy to the staging layer
  4. Load data from staging layer to final SQLite database
  5. Visualize data using Google Data Studio
  6. Orchestrate tasks with Airflow

Data Model

Airflow Orchestration

The airflow DAG supports both one time historic load and incremental batch loads on the music listening data. The DAG is scheduled to run daily.

  • `check_initialized`: Selects between the incremental batch load and the initial one time historic load branches.
  • `initial_extract`: Extracts all historic data and skips tasks if no new data is available.
  • `update_extract`: Extracts daily batch data and skips tasks if no new data is available.
  • `initialize_database`: Initializes staging area and final database schema according to the data model above.
  • `empty_stage`: Empties staging area for new data transformations.
  • `transform`: Transforms and formats data to fit the data model above.
  • `load`: Loads/appends transformed data from staging layer to final database.
  • `export`: Exports database tables as csv for data visualization.

My Projects

Wordle Pipeline

The project is a pipeline which processes New York Times Wordle data from Twitter to provide insight on Wordle statistics. The data is transformed and orchestrated using Docker, Spark, Airflow, and dbt. The project is hosted on GCP and is loaded into Looker Studio for visualization.

Music Listening History ETL Pipeline

The project is an ETL pipeline which processes music listening data from Last.fm API to provide insight on listening activity. The data is transformed and orchestrated into Kimball's star schema model using Python, SQL, and Airflow. The data is then loaded into Google Data Studio for visualization.

Portfolio Website

The project websites hosts other projects I worked on and will work on in the future. The website is hosted through Cloudflare and built using HTML, CSS, and JavaScript.

Contact Me