UR-48 Using Semantic Segmentation in a Convoluted Neural Network for Vocal Localization in Music
Location
https://ccse.kennesaw.edu/computing-showcase/cday-programs/spring2021program.php
Document Type
Event
Start Date
26-4-2021 5:00 PM
Description
I. PROJECT OVERVIEW A. Research Question In this project, the question was asked: ”Is there an easier way to extract vocals from music?” Many other works are able to extract vocals with Deep Neural Networks using Multitask Learning, which are large and take a long time to train. To rival this, we wish to present a method to identify vocals with a Convolutional U-Network (U-Net) for Semantic Segmentation of audio files. B. Project Description This project differs from other works by identifying vocal locations by converting audio files in Short Time Fourier Transforms(STFT), and treating them as images in the UNet. By treating these as images, the U-Net is able to identify the location of ”vocal features” the same way a U-Net would identify desired features within an image. The object detection is what sets this project apart from similar works. Many of these other works treat each song as an audio signal with real and imaginary components which means these algorithms treat the issue as a signal processing problem. However, by looking at the STFT of the song as a graph, we are instead able to approach this as an image processing problem instead, which offers more tools within the realm of Deep Learning–such as Semantic Segmentation. II. EXPERIMENTATION A. Materials and Methods All Materials used were a form of software. Firstly, the UNetwork was created and ran in python on the CCSE Cluster for High Efficiency. A U-Network is a Convoluted Neural Network that has the ability to output images by Convoluting the original image to allow only the prominent features to be shown and Deconvoluting the Output to display these features in the original image resolution to be used for further processing. This gives the U-net it’s ”u” shape when drawn out. Secondly, the data created for the project were music files converted into Short Time Fourier Transforms(STFT) and processed as image files, where the input into the U-Network was an entire song’s STFT and the labeled data was the vocal audio file STFT for that same song. A Short-Time Fourier Transform can be considered the heatmap of the amplitudes of the song across frequency and time. B. Results The initial Results from the U-Network show a high level of accuracy for vocal location predictions. As the output from a U-Network is an image, these images are the initial song’s STFT with a mask applied to show the location of Vocal Waves. These trials have an accuracy greater than 80% which is a very good result this early in the processing. The vocals have been identified and located in this study, however the next step is to pull the vocals out and convert them back into a song wave. III. MARKETABILITY For the last 20 or so years, large record labels have been attempting to ”Remaster” old music, which is the process of digitizing old analog tracks of songs, mixing them on a new sound board, and releasing the remastered work at a marked up price. As recording methods, pre-computers, relied on tape, often times tracks were record over each other to save space on the real. When the song has this issue, a computer program has to pull out all of the pieces of the song so that the engineer can remaster it. This project shows the initial steps to a simpler audio extraction, where handling this issue as an image processing problem instead of a signal processing problem, we are able to create a more efficient Neural Network. Advisors(s): Dr. AledhariTopic(s): Artificial IntelligenceCS 4267
UR-48 Using Semantic Segmentation in a Convoluted Neural Network for Vocal Localization in Music
https://ccse.kennesaw.edu/computing-showcase/cday-programs/spring2021program.php
I. PROJECT OVERVIEW A. Research Question In this project, the question was asked: ”Is there an easier way to extract vocals from music?” Many other works are able to extract vocals with Deep Neural Networks using Multitask Learning, which are large and take a long time to train. To rival this, we wish to present a method to identify vocals with a Convolutional U-Network (U-Net) for Semantic Segmentation of audio files. B. Project Description This project differs from other works by identifying vocal locations by converting audio files in Short Time Fourier Transforms(STFT), and treating them as images in the UNet. By treating these as images, the U-Net is able to identify the location of ”vocal features” the same way a U-Net would identify desired features within an image. The object detection is what sets this project apart from similar works. Many of these other works treat each song as an audio signal with real and imaginary components which means these algorithms treat the issue as a signal processing problem. However, by looking at the STFT of the song as a graph, we are instead able to approach this as an image processing problem instead, which offers more tools within the realm of Deep Learning–such as Semantic Segmentation. II. EXPERIMENTATION A. Materials and Methods All Materials used were a form of software. Firstly, the UNetwork was created and ran in python on the CCSE Cluster for High Efficiency. A U-Network is a Convoluted Neural Network that has the ability to output images by Convoluting the original image to allow only the prominent features to be shown and Deconvoluting the Output to display these features in the original image resolution to be used for further processing. This gives the U-net it’s ”u” shape when drawn out. Secondly, the data created for the project were music files converted into Short Time Fourier Transforms(STFT) and processed as image files, where the input into the U-Network was an entire song’s STFT and the labeled data was the vocal audio file STFT for that same song. A Short-Time Fourier Transform can be considered the heatmap of the amplitudes of the song across frequency and time. B. Results The initial Results from the U-Network show a high level of accuracy for vocal location predictions. As the output from a U-Network is an image, these images are the initial song’s STFT with a mask applied to show the location of Vocal Waves. These trials have an accuracy greater than 80% which is a very good result this early in the processing. The vocals have been identified and located in this study, however the next step is to pull the vocals out and convert them back into a song wave. III. MARKETABILITY For the last 20 or so years, large record labels have been attempting to ”Remaster” old music, which is the process of digitizing old analog tracks of songs, mixing them on a new sound board, and releasing the remastered work at a marked up price. As recording methods, pre-computers, relied on tape, often times tracks were record over each other to save space on the real. When the song has this issue, a computer program has to pull out all of the pieces of the song so that the engineer can remaster it. This project shows the initial steps to a simpler audio extraction, where handling this issue as an image processing problem instead of a signal processing problem, we are able to create a more efficient Neural Network. Advisors(s): Dr. AledhariTopic(s): Artificial IntelligenceCS 4267