music retrieval techniques assignment 5

Use the template for the assignment. I will send more files and example code from class that will make the assignment easier. The assignment uses python 3 and jupyter notebooks as the IDE. There are a bunch of class notes and code examples

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

{
“cells”: [
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## CSC 475 / SENG 480B – Assignment 5\n”,
“\n”,
“#### [YOUR NAME HERE]\n”,
“\n”,
“**Note:** Please upload your completed notebook as a zip file with all audio files used to complete your assignment. Only include the audio files that you used to keep the upload size as small as possible. (i.e. do not upload all of GTZAN or any other dataset) ”
]
},
{
“cell_type”: “code”,
“execution_count”: 1,
“metadata”: {},
“outputs”: [],
“source”: [
“import numpy as np\n”,
“import matplotlib.pyplot as plt\n”,
“import librosa\n”,
“import librosa.display\n”,
“import IPython.display as ipd\n”,
“from scipy.interpolate import interp1d”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Question 1\n”,
“\n”,
“For this question use a 30 second clip of audio that you like. You can either use you own or use any of the clips in the GTZAN dataset. Try to select a clip that has a repeating rhythm, phrase, or harmonic structure that you can identify in a self-similarity matrix – you may need to experiment with a few different audio files to find one that works well. You can download GTZAN here: https://drive.google.com/file/d/16qBM8tYXn2z5LOl82mjnRAxQyZsJ0xdP/view?usp=sharing\n”,
“\n”,
“**1a**\n”,
“\n”,
“Repeat the 30 second recording to create a one minute recording. Plot the [cross-similarity matrix](https://librosa.org/doc/latest/generated/librosa.segment.cross_similarity.html) for this new one minute long clip with the original 30 second recording. Describe how the repetition can be seen in a plot of the cross-similarity matrix. Use MFCC features and the affinity mode.\n”,
“\n”,
“*(Minimum: 1 pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**1b**\n”,
“\n”,
“Plot the self-similarity matrix for your original 30 second clip (i.e the cross-similarity to itself). Visually identify a repeating structure (it could be a bar, a phrase, a segment) on the self-similarity matrix, describe it, and generate two audio fragments that demonstrate this repetition. Hint: repetition shows as block structure, you will need to map the dimensions of the repeating block to time to select the audio fragments.\n”,
“\n”,
“*(Minimum: 1 pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**1c**\n”,
“\n”,
“Use [time stretching](https://librosa.org/doc/latest/generated/librosa.effects.time_stretch.html) on your audio recording to create the following modified signal: the first 10 seconds should be slowed down (rate 0.75), the middle 10 seconds should remain the same, and the last 10 seconds should be sped up (rate 1.25). Plot the cross-similarity matrix between the original and modified recording (use MFCC features and the affinity mode) and describe how the time-stretching can be observed visually. \n”,
“\n”,
“*(Expected: 1pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**1d**\n”,
“\n”,
“Use [harmonic/percussive sound source separation](https://librosa.org/doc/latest/generated/librosa.effects.hpss.html) to generate a percussive track and a harmonic track from your 30 second example. Plot the self-similarity matrices using affinity for the percussive and harmonic versions using MFCCs as well as Chroma (use [chroma_cqt](https://librosa.org/doc/latest/generated/librosa.feature.chroma_cqt.html) or [chroma_stft](https://librosa.org/doc/latest/generated/librosa.feature.chroma_stft.html)). Based on the resulting four plots discuss feature set works better for each configuration (harmonic/percussive).\n”,
“\n”,
“*(Expected: 1pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**1e**\n”,
“\n”,
“Use [Dynamic Time Warping](https://librosa.org/doc/latest/generated/librosa.sequence.dtw.htm) using the original and modified (time-stretched) recording you created in the previous subquestion. Plot the cost matrix and associated optimal path and describe how the optimal path reflects the time stretching. Show how you can estimate the time-stretching rates from the optimal path. You can assume that you know that the rate is going to change every 10 seconds but you don’t know what the corresponding rates are. Test your procedure with a set of different time stretching rates.\n”,
“\n”,
“*(Expected: 1 pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Question 2\n”,
“\n”,
“In this question you will be using a state-of-the art deep learning sound source separation library: https://github.com/deezer/spleeter to explore sound source separation, and transcription. For answering the question you can either select a track that you like or use one of the tracks from musdb (a database for sound source separation for music for which spleeter performs quite well): https://sigsep.github.io/datasets/musdb.html.\n”,
“\n”,
“**2a**\n”,
“\n”,
“Run the sound source separation with the 4 stem model on your example audio recording. Listen and plot the time-domain waveforms of the 4 individual stems. Comment on what you hear. \n”,
“\n”,
“*(Minimum: 1pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**2b**\n”,
“\n”,
“Pitch shift the vocal stem by a major third (4 semitones) using [librosa.effects.pitch_shift](https://librosa.org/doc/latest/generated/librosa.effects.pitch_shift.html) and mix the pitch shifted audio with the other stems. Listen to the result and comment on it. \n”,
“\n”,
“*(Minimum: 1pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**2c**\n”,
“\n”,
“Run monophonic pitch detection (either your own implementation or using some library like librosa) on the vocal stem, sonify it using a sinusoid (you can use the sonification code provided below) and mix with the drum track. Listen to the result and comment on whether you still recognize the song. \n”,
“\n”,
“*(Expected: 1 pt)*”
]
},
{
“cell_type”: “code”,
“execution_count”: 2,
“metadata”: {},
“outputs”: [],
“source”: [
“# Function to sonify a pitch track into a sine wave\n”,
“def sonify(pitch_track, srate, hop_size):\n”,
“\n”,
” times = np.arange(0.0, float(hop_size * len(pitch_track)) / srate,\n”,
” float(hop_size) / srate)\n”,
“\n”,
” # sample locations in time (seconds) \n”,
” sample_times = np.linspace(0, np.max(times), int(np.max(times)*srate-1))\n”,
“\n”,
” # create linear interpolators for frequencies and amplitudes \n”,
” # so that we have a frequency and amplitude value for \n”,
” # every sample \n”,
” freq_interpolator = interp1d(times,pitch_track)\n”,
“\n”,
” # use the interpolators to calculate per sample frequency and \n”,
” # ampitude values \n”,
” sample_freqs = freq_interpolator(sample_times)\n”,
“\n”,
” # create audio signal \n”,
” audio = np.zeros(len(sample_times));\n”,
” T = 1.0 / srate\n”,
” phase = 0.0\n”,
” \n”,
” # update phase according to the sample frequencies \n”,
” for i in range(1, len(audio)):\n”,
” audio[i] = np.sin(phase)\n”,
” phase = phase + (2*np.pi*T*sample_freqs[i])\n”,
“\n”,
” return audio”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**2d**\n”,
“\n”,
“Inspect the magnitude spectrogram of the drum track and try to identify features that characterize the kick sound and the snare sound. Based on your visual inspection write a simple kick and snare detection method that outputs the times where a kick or snare drum occurs. Your method only needs to work for the specific example you are examining and does not need to be perfect. Create a new track by placing the kick.wav and snare.wav samples from the resources folder in the detected locations. Mix your synthetic toy drum track and sinusoid sonification from the previous question and listen to the result. Can you still identify the song? \n”,
“\n”,
“*(Advanced: 1 pt)*”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“**2e**\n”,
“\n”,
“Run beat tracking on the drum track (result from spleeter) and then use the resulting beat onsets combined with the extracted pitch from the vocal melody to transcribe the vocal melody to music notation using music21. For each beat map the median MIDI pitch of the vocals to a tiny notation string. Listen to the generated MIDI melody using Music21. Can you still identify the melody?\n”,
“\n”,
“*(Advanced: 1 pt)*”
]
}
],
“metadata”: {
“kernelspec”: {
“display_name”: “Python 3”,
“language”: “python”,
“name”: “python3”
},
“language_info”: {
“codemirror_mode”: {
“name”: “ipython”,
“version”: 3
},
“file_extension”: “.py”,
“mimetype”: “text/x-python”,
“name”: “python”,
“nbconvert_exporter”: “python”,
“pygments_lexer”: “ipython3”,
“version”: “3.7.7”
}
},
“nbformat”: 4,
“nbformat_minor”: 4
}

CSC

4

75 / SENG480B Music Retrieval Techniques

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Assignment 5 Fall

2

020

J. Shier

November

1

8, 2020

1 Introduction

In this assignment we will explore similarity and self-similarity matrices,
dynamic time warping as well as sound source separation and transcription.
There are three levels of engagement: Minimum, Expected, Advanced.
Minimum is the least amount of work you can do to keep up with the
course. If you are not able to complete the minimum work suggested then
this course is probably not for you. Expected is what a typical student
of the course should work on – the goal is to provide a solid foundation
and understanding but not go deeper into more interesting questions and
research topics. Advanced is what students who are particularly interested
in the topic of MIR, students who want a high grade, students interested in
graduate school, and graduate students should work on. Each level of en-
gagement includes the previous one so a student who is very interested in the
topic should complete all three levels of engagement (Minimum, Expected,
Advanced).

The assignment is worth 10% of the final grade. There is some variance
in the amount of time each question probably will require. Please provide
your answers as a single Jupyter Python notebook (.ipynb) uploaded through
the BrightSpace website for the course. Also please answer the questions in
order and explicitly mention if you have decided to skip a question. For the
questions that do not require programming use the ability to write Mark-
down in Jupyter notebooks for your answer. You are also welcome to do
the assignment in any other programming language/framework. If you do
so please include a PDF file with plots and figures for questions that re-
quire them, as well as answers for questions that don’t require coding. Also
include the code itself with instructions on how to run it in a README file.

1

Note: Please upload your completed notebook as a zip file with all audio
files used to complete your assignment. Only include the audio files that you
used to keep the upload size as small as possible. (i.e. do not upload all of
GTZAN or any other dataset)

2 Problems

The coding problems mention specific libraries/frameworks for Python that
can help you with implementation. However you should be able to find
corresponding libraries or code for most programming languages if you want
to try using a different programming language or environment.

1. For this question use a

3

0 second clip of audio that you like. You can
either use you own or use any of the clips in the GTZAN dataset. Try
to select a clip that has a repeating rhythm, phrase, or harmonic struc-
ture that you can identify in a self-similarity matrix – you may need
to experiment with a few different audio files to find one that works
well. You can download GTZAN here: https://drive.google.com/
file/d/16qBM8tYXn2z5LOl82mjnRAxQyZsJ0xdP/view?usp=sharing

• a) Repeat the 30 second recording to create a one minute record-
ing. Plot the cross-similarity matrix1 for this new one minute
long clip with the original 30 second recording. Describe how the
repetition can be seen in a plot of the cross-similarity matrix.
Use MFCC features and the affinity mode. (Minimum: 1pt)

• b) Plot the self-similarity matrix for your original 30 second clip
(i.e the cross-similarity to itself). Visually identify a repeating
structure (it could be a bar, a phrase, a segment) on the self-
similarity matrix, describe it, and generate two audio fragments
that demonstrate this repetition. Hint: repetition shows as block
structure, you will need to map the dimensions of the repeating
block to time to select the audio fragments. (Minimum: 1pt)

• c) Use time stretching2 on your audio recording to create the
following modified signal: the first 10 seconds should be slowed
down (rate 0.75), the middle 10 seconds should remain the same,

1https://librosa.org/doc/latest/generated/librosa.segment.cross_

similarity.

html

2https://librosa.org/doc/latest/generated/librosa.effects.time_stretch.

html
2

and the last 10 seconds should be sped up (rate 1.25). Plot the
cross-similarity matrix between the original and modified record-
ing (use MFCC features and the affinity mode) and describe how
the time-stretching can be observed visually.

• d) Use harmonic/percussive sound source separation3 to generate
a percussive track and a harmonic track from your 30 second ex-
ample. Plot the self-similarity matrices using affinity for the per-
cussive and harmonic versions using MFCCs as well as Chroma
(use chroma cqt4 or chroma stft5). Based on the resulting four
plots discuss feature set works better for each configuration (har-
monic/percussive). (Expected: 1pt)

• e) Use Dynamic Time Warping6 using the original and modified
(time-stretched) recording you created in the previous subques-
tion. Plot the cost matrix and associated optimal path and de-
scribe how the optimal path reflects the time stretching. Show
how you can estimate the time-stretching rates from the opti-
mal path. You can assume that you know that the rate is going
to change every 10 seconds but you dont know what the corre-
sponding rates are. Test your procedure with a set of different
time stretching rates. (Expected: 1pt)

2. In this question you will be using a state-of-the art deep learning sound
source separation library: https://github.com/deezer/spleeter to
explore sound source separation, and transcription. For answering the
question you can either select a track that you like or use one of the
tracks from musdb (a database for sound source separation for music
for which spleeter performs quite well): https://sigsep.github.io/
datasets/musdb.html.

• a) Run the sound source separation with the 4 stem model on
your example audio recording. Listen and plot the time-domain
waveforms of the 4 individual stems. Comment on what you hear.
(Minimum: 1pt)

• b) Pitch shift the vocal stem by a major third (4 semitones) using
librosa.effects.pitch_shift and mix the pitch shifted audio

3https://librosa.org/doc/latest/generated/librosa.effects.hpss.html
4https://librosa.org/doc/latest/generated/librosa.feature.chroma_cqt.html
5https://librosa.org/doc/latest/generated/librosa.feature.chroma_stft.

html
6https://librosa.org/doc/latest/generated/librosa.sequence.dtw.html

3

with the other stems. Listen to the result and comment on it.
(Minimum: 1pt)

• c) Run monophonic pitch detection (either your own implemen-
tation or using some library like librosa) on the vocal stem, sonify
it using a sinusoid (you can use the sonification code provided in
the template) and mix with the drum track. Listen to the result
and comment on whether you still recognize the song. (Expected:
1pt)

• d) Inspect the magnitude spectrogram of the drum track and
try to identify features that characterize the kick sound and the
snare sound. Based on your visual inspection write a simple kick
and snare detection method that outputs the times where a kick
or snare drum occurs. Your method only needs to work for the
specific example you are examining and does not need to be per-
fect. Create a new track by placing the kick.wav and snare.wav
samples from the resources folder in the detected locations. Mix
your synthetic toy drum track and sinusoid sonification from the
previous question and listen to the result. Can you still identify
the song? (Advanced: 1pt)

• e) Run beat tracking on the drum track (result from spleeter) and
then use the resulting beat onsets combined with the extracted
pitch from the vocal melody to transcribe the vocal melody to
music notation using music21. For each beat map the median
MIDI pitch of the vocals to a tiny notation string. Listen to the
generated MIDI melody using Music21. Can you still identify the
melody? (Advanced: 1pt)

4

a5_resources/kick.wav

a5_resources/snare.wav

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
4.9
Sitejabber
4.6
Trustpilot
4.8
Our Guarantees
100% Confidentiality
Information about customers is confidential and never disclosed to third parties.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
No missed deadlines – 97% of assignments are completed in time.
Money Back
If you're confident that a writer didn't follow your order details, ask for a refund.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.
Power up Your Study Success with Experts We’ve Got Your Back.

Order your essay today and save 30% with the discount code ESSAYHELP