Identifying Ragas in Carnatic Music with Machine Learning
I played around with Machine Learning (ML) techniques recently to see if a computational model can be trained to identify ragas in Carnatic music. To my pleasant surprise, an AutoML model that evaluates different predictive models under the hood and chooses the best one worked nicely in classifying Raga with good precision (and recall). Once a feature set was constructed for all the music segments to be fed into the machine learning program, it was pretty straight forward to arrive at a good classification model in 3 easy steps using AutoML.
Ragas in Carnatic Music
Carnatic music is a classical style of music from India. The centuries old music system provides a framework for composition of songs primarily performed by a vocal artist with melodic and rhythmic accompaniments of musical instruments. Carnatic music rests on two main elements: rāga, the modes or melodic formulae, and tā?a, the rhythmic cycles. A raga in Carnatic music prescribes a set of rules for building a melody – very similar to the Western concept of mode. There are many ragas in Carnatic music, with each raga denoting the mood or expression of a musical piece.
Each raga is defined by an ascending and descending pattern of notes from the solfege Sa-Re-Ga-Ma-Pa-Dha-Ni-Sa (like Do-Re-Mi-Fa-Sol-La-Ti-Do of the western notes). In its simplest form, a raga forms a scale in which individual tones are treated with precise and unique ornamentations. When kids learn Carnatic music, they start with naming the raga to which the composition is set to along with the rhythmic count (tala) and practice the ascent and descent of the scale before singing the melodic phrases.
Video 1: Ananda Bhairavi Raga in Carnatic Music - Ascent, Descent, Signature Style
Like any art form, artists performing ragas in Carnatic music framework have a great latitude for experimentation and improvisation. They introduce reverberations through a lot of microtonal ornamentations called gamakas which are essential to bring out the true color of a raga, and thereby demonstrating the entire gamut of talents and depth of knowledge of the musician. These ornamentations produce a continuous flow or gliding through a continuum of frequencies and are distinguishing features of many ragas [1]. Observe these nuances in the video below.
Video 2: Raga Lessons in Carnatic Music
Deep Learning and Machine Learning for Audio/Music Analysis
Machine learning and Deep Learning are widely used in processing audio and music signals in various applications we come across every day that we may not even realize – speech recognition, NLP, language translation, voice activation (trigger word), environmental sound detection, music information retrieval, music synthesis, and genre identification are some of the examples. These applications involve training the computational models with lots of training data to learn the task, so when a new audio signal is presented, the related action is produced. Some of the popular architectures for audio processing are Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) or a combination of both, wherein the audio sample is transformed to a relevant feature set and used to train and validate the models. RNNs are popular to analyze time-series or sequential inputs like speech and audio signals, and CNNs for image recognition & classification, although both are successfully applied in many other applications. For classification tasks like instrument detection or genre identification, methods like Decision Trees and Random Forest classifiers and deep neural networks have been successfully used. Figure 1 shows an example of an RNN deep learning model for trigger word detection [2], which activates a specific action on a program or device on hearing a key phrase or word. Alexa, Siri, and Google devices that are voice activated by certain phrases are such examples. A good summary of deep learning for audio systems is presented in references [3][4][5][6].
Figure 1: Example Recurrent Neural Network (RNN) for Trigger Word Detection [2]
Processing Audio Files and Extracting Features for Machine Learning
The important task remains how the raw audio sample should be transformed to be useful for training a model to learn the task. An audio segment has thousands of points representing the waveforms, typically sampled at 44.1 KHz (i.e. 44100 points per second of audio). Even for a representative audio segment of 10-15 seconds, there are just too many points to feed to computational models directly (15-30 seconds can be roughly an average time for a human to recognize a raga, for other applications like trigger word detection, shorter segments are sufficient). A better approach will be to transform these audio signals from time domain to a frequency domain. Sound is nothing but a combination of pressure waves at different frequencies. Spectrograms present an efficient method of analyzing audio signals in frequency domain. A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. Spectrogram output is a two-dimensional matrix for the audio segment with the horizontal axis representing moving time window positions and vertical axis the frequencies. See Figure 2 for an example of a Mel-Spectrogram. The Mel-Spectrogram is an improvement to the spectrogram representing frequencies in human perceivable scales (mel scale). A further compressed representation for the audio is calculating Mel-frequency cepstral coefficients (MFCC). Depending on the type of application built and resources available, one of these representations can be utilized. There are other representations also to capture the dynamics, pitch etc. but not considered for my work. More details of how spectrograms and MFCC are calculated can be found in reference [7], but there are many open source libraries to easily extract them for wav or mp3 files. I have used librosa module in Python for spectrograms and MFCC calculations with just about 2-3 lines of code.
Figure 2: Mel Spectrogram
Raga Identification with Machine Learning
Identifying ragas is a learned skill that comes from years of practice or listening to many hours of Carnatic music. A knowledgeable person can recognize ragas by analysis of the technical elements in the song or with a lesser skill by similarities to some other familiar composition in the same raga. If a trained ear can spot the ragas, can a machine be trained to recognize the same? That’s what I tried to find out in this experiment. Simply identifying the dominant notes from the sounds/phrases in a composition may not be enough, however I did come across such approach presented in one reference [8] using chromagrams. As mentioned earlier, artists improvise and experiment quite a lot. Same composition performed by two different artists will sound different, and even the same artist can sound different from concert to concert making the raga identification a bit challenging.
I initially started out with spectrograms input to a model similar to what’s shown above in Figure 1. The RNN architecture was used for trigger word detection [2], but I added a dense layer at the end to produce a classification output – whether the sound clip is a certain raga or not. Initial results took long times to optimize with no noticeable improvement even in training phase. An accuracy of just about 50% was observed which is no better than a random guess exhibiting a high bias – meaning the model was not able to learn from the audio samples as modeled. Getting more training audio samples and increasing the complexity of the network by adding a CNN layer to it showed promise by improving the performance, however still lacked a good prediction capability. My next attempt shifted to using the MFCCs as features instead of spectrograms with good success, recollecting that the MFCCs provide a compressed representation of the audio. Before I get into the modeling details and results, let me explain a bit about the audio training data used and the processing done to make it ready for machine learning.
Audio Data for Training and Preprocessing
Audio samples from publicly available sources like YouTube and other webpages were grabbed in the Ananda bhairavi raga for the target class. Ananda bhairavi is one of the popular ragas in Carnatic music with many songs composed in that raga, so a simple search on YouTube produces hundreds of videos. The samples included full concert performances, both vocal and instrumental, instructional videos and other snippets in the Ananda bhairavi raga. Almost for same duration as the Ananda bhairavi raga (positive class) I also downloaded other ragas and non-Carnatic music samples to denote the negative class for training and validation.
My goal was to do as minimal processing as required when building the training set (both for positive and negative examples), so the audio clips were by no means a clean or controlled set. The collection had a mix of male and female vocal performances with background music including violin, percussion instruments, individual instrumental concerts (violin, flute), audience applause at different points, artists speaking at the beginning or in the middle of concerts engaging with the audience, instructional videos with raga explanation of music etc. The acoustic quality of the audio clips also varied as they were uploaded on YouTube by different users. My hope was that the machine learning algorithm should be able to handle all these variations with sufficient training. No careful labeling of segments if the recordings had outliers, no extraction of sequence of musical notes or no note-segmentation techniques, just use the recordings as available.
The duration of each audio piece was also different. There were short clips of 1 to 5 minutes as well as long ones up to 45 minutes. It’s not uncommon for an artist to perform a composition in a raga for that long by adding different structural elements to the song and leading the listener in a musical journey that captivates the senses. The only preprocessing step I performed is to split all audio files into 30 second segments, extract the mean MFCCs (40 coefficients) for those segments and feed these to the machine learning model. Figure 3 shows the MFCC distribution for the audio segments in Ananda bhairavi raga (positive class) and for those of other music segments (negative class). Each column in Figure 3 represents 40 MFCCs for a 30 second duration segment.
Figure 3: Feature Matrix of Mean MFCCs for Positive and Negative Classes of the Audio Segments
Building a Machine Learning Model To Learn and Predict Ragas
For the machine learning part, I used the AutoML node in Altair Knowledge Studio, a machine learning and predictive analytics tool to come up with the best model. With the feature matrix available for all the audio files (40 MFCCs for each 30 second segment and labels as 1 or 0 if the clip belongs to audio file in Ananda bhairavi raga or not), it was just 3 easy steps to get a trained model that performs the best –
1) Load the feature matrix to Knowledge Studio
2) Connect to an AutoML node and set parameters if needed (I just used the defaults)
3) Click Run
Figure 4: AutoML Modeling To Find the Best Predictive Model
Under the hood, AutoML does the variable selection based on the most predictive power, performs any transformations (if required), splits the data into training and validation set and goes about evaluating the best model on the user specified criteria. Decision Trees, Bagging, Boosting, Deep Learning, Regression, Random Forest, Regularization are the different models evaluated in the AutoML node and in my case Random Forest was chosen as the best one based on the Area under the ROC Curve (AUC) metric.
Figure 5: Output of AutoML Model in Altair Knowledge Studio
Findings
A report created at the end of training and validation showed 89% accuracy with Precision at 0.87 and a Recall value of 0.92 on the validation set. It’s not that surprising given that the validation and training set come from the same distribution. Testing on new songs completely unseen by the network until prediction time yielded satisfactory results as well. I picked different recordings that were completely new compositions in the same raga or new artists not used for training. Every song was predicted in the right class. There were a few 30 second segments not classified correctly, but when looked at the complete song, more than 60% of the segments were flagged to the right class, for both positive and negative samples.
In conclusion, what started out as a curious experiment turned into an exciting experience for me in going through different stages of setting up and building an ML problem. There was not much time spent in data preparation, with even the noisy data handled robustly by the computational models. MFCC features contained enough information to capture the nuances in ragas and singing styles. AutoML functionality made it easy to select the most important variables in building a trained model, validate and produce the one that had a best performance metric in a convenient manner. Many open source libraries and examples are available in published literature to help anyone get started and lower the barrier to explore machine learning and deep learning for different applications.
References
[1] Unnikrishnan G, Automatic recognition and classification of Carnatic Ragas. URL: https://hdl.handle.net/10603/111461
[2] Andrew Ng, Sequence Models. URL https://www.coursera.org/learn/nlp-sequence-models
[3] Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-yiin Chang, Tara Sainath, Deep Learning for Audio Signal Processing. URL: https://arxiv.org/pdf/1905.00078.pdf
[4] Gideon Mendels, Niko Laskaris, How to Apply Machine Learning and Deep Learning Methods to Audio Analysis. URL: https://towardsdatascience.com/how-to-apply-machine-learning-and-deep-learning-methods-to-audio-analysis-615e286fcbbc
[5] Jyotika Singh, An introduction to audio processing and machine learning using Python. URL: https://opensource.com/article/19/9/audio-processing-machine-learning-python
[6] Adam Geitgey, Machine Learning is Fun Part 6: How to do Speech Recognition with Deep Learning. URL: https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a
[7] Haytham Fayk, Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between. URL: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
[8] Hiteshwari Sharma, Rasmeet S.Bali, Comparison of ML classifiers for Raga recognition. URL https://www.ijsrp.org/research-paper-1015/ijsrp-p46102.pdf
[9] K Priya, Geetha R Ramani and Shomona Gracia Jacob, Data Mining Techniques for Automatic Recognition of Carnatic Raga Swaram Notes. URL https://research.ijcaonline.org/volume52/number10/pxc3881444.pdf
[10] B. Tarakeswara Rao, Sivakoteswararao Chinnam, P Lakshmi Kanth, M.Gargi, Automatic Melakarta Raaga Identification System: Carnatic Music. URL: https://thesai.org/Downloads/IJARAI/Volume1No4/Paper_6-Automatic_Melakarta_Raaga_Identification_Syste_Carnatic_Music.pdf
[11] Ekta Patel and Savita Chauhan, Raag detection in music using supervised machine learning approach. URL https://www.accentsjournals.org/PaperDirectory/Journal/IJATEE/2017/4/2.pdf
[12] Sarala Padi, Spencer Breiner, Eswaran Subrahmanian, Ram D. Sriram, Modeling and Analysis of Indian Carnatic Music Using Category Theory. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7837758
@Kaggle Expert
1 年Hello sir, your post was amazing and I tried to implement it using neural networks but I am getting an accuracy of 76% which is not quite good, can you please look into my work and suggest me any possible improvements and changes to improve my model. https://www.kaggle.com/code/vishwasmishra1234/final/notebook
Associate Professor at PES University
1 年Dear Sridhar..I would like to understand the complete process of your work. Can you please share
Trainee Counseling Psychologist
2 年Wow!! Wishing you good luck sir for your future projects.
Senior Technical Director at Cadence Design Systems
4 年Great Article Sridhar. Would like to discuss the technical details with you sometime.
Director Technology Strategist @ Microsoft | Experienced Technology Leader
4 年What a great project!! Are you working on extending this to Hindustani classical music?