Presentation Video

Project Summary

This project focuses on creating a seamless music transition model that predicts the best next segment of a song based on learned embeddings. The goal is to generate seamless transitions between music clips without abrupt changes as if there was never a change in music.

The approach consists of three main components:

  1. Feature Extraction: Using BEATs Iteration 3 (a Transformer-based model for audio understanding) to extract time-step embeddings from songs.
  2. LSTM Training: A Bidirectional LSTM trained with contrastive triplet loss, ensuring that the model learns to smoothly transition between music segments.
  3. Transition Matching: Computing cosine similarity between embeddings to predict the best transition point between songs.

The dataset consists of full-length songs (from the FMA dataset), which are preprocessed by segmenting them into 20-30 second clips with overlapping context to preserve musical continuity. The final goal is to deploy this model for automatic DJing, playlist blending, and AI-generated song transitions.

Approach

The goal is to create a seamless transition between two songs, as if they were part of the same piece. To create such a transition, we must identify a song whose beginning complements the ending of the previous song, making them sound unified.

1. Feature Extraction Using BEATs Iteration 3

Music data are highly complex and has a high-dimensional feature space, it includes multiple features such as timbre, tone, tempo, mood, key, progression scale, and etc…

In order to simplify the high-dimensional feature space and extract data that are easier to process, we used embedding to reduce the dimensionality while preserving the essential characteristics of the music.

We use BEATs-Large (Iteration 3, AS2M model) to extract per-frame embeddings from segmented audio clips. BEATs pre-trained model performs exceptionally well at capturing musical features such as rhythm, melody, harmony, and timbre. Which allow us to accurately captures the key characterstics of music data while lowering the complexity.

We used the open FMA database which contains 106,574 untrimmed tracks with 161 unbalanced genres of music. Due to overwhelming data size, We first trained the model on about 1% of the total data then increase the data size as we proceed further into development process.

Implementation Steps:

2. Training the LSTM for Transition Prediction

To achieve a seamless transition, we need to determine what group of songs, when paired with the ending of the previous track, will blend seamlessly.

To address this challenge, we decided to train an AI model that predicts how a song will continue based on a given section. We adopted the Long Short-Term Memory (LSTM) model to train on music/audio data, enabling it to forecast the continuation of input audio clips.

LSTM was chosen because music is composed of notes and sequences, and to capture the musical flow, we need a model that recognizes patterns and allow past inputs to influence the current output.

Given a segment of song, we split it into two parts. We feed the first part into the model and calculate the loss by comparing the model’s output with the second part of the file. This process compares the model’s predicted continuation to the actual continuation, giving us a loss function to optimize the model.

We train a Bidirectional LSTM with contrastive triplet loss to predict the most seamless transition between song clips.

Training is performed on SLURM using GPU nodes, with checkpointing enabled for preemption handling.

Project Outline

3. Transition Matching & Deployment

After training, the model will take a clip’s final embedding and produce the predicted continuing segment. Then the model will use FAISS(Facebook AI Similarity Search) to find the nearest neighbor of the predicted continuing segment in vector space. In this case, it will find the song that matches the most closely with the model predicted continuing segment. ONNX & PyTorch JIT are explored for fast inference deployment.

4. Evaluation Metrics

We evaluate both perceptual and mathematical similarity metrics:

Evaluation

There has been a change in evaluation method from the proposal. We originally meant to use different features of the music such as tempo, mood, and progression as metrics for quantitative evaluation. However, as mentioned previously, we utilize the BEATs pre-trained model to embed the audio data, so we will be evaluating using the embedded data instead of the higher dimensional music features.

We assess both quantitative model accuracy and qualitative smoothness of music transitions in following ways:

Quantitative Evaluation

Evaluation

All evaluations are logged in Weights & Biases (W&B) for experiment tracking.

Remaining Goals and Challenges

Our model did run into a performance failure at the current state. After the model is trained, we measured its performance to discovered its training loss, validation loss, and test loss is always around 30%. It does not converge to 30% but rather the errors has stayed 30% during the entire training process. Which led us to suspect no meaningful training has been done.

Possible Causes

Anchor and Positive Overlap Too Much

Negative Sample Is Too Similar to Anchor

Empty or Tiny Arrays in Embeddings

Small Dataset or Reusing an Old Checkpoint

Future Improvements

Other than fixing problems that are presented above, there are also some other improvements we can make on our basic model.

Two-Stage Embedding (Separate Start & End)

Predicting Next Music Segment for Seamless Transitions

Optimizing Crossfade Durations to Find the Best Seamless Song Transition

Interactive Interface

Music App Integration

Resources Used

Research Papers & Code

Libraries & Tools

Infrastructure