Symphony of Data: Cloudera CDP Architecture for Musical Analysis
During a #Cloudera Iceberg training session for our Italian clients, numerous questions arose regarding the application potential of the CDP platform in a wide range of areas. This highlighted a growing interest in using Big Data solutions to tackle complex challenges and drive innovation across various fields. Here’s a "nerd" example:
Imagine being able to transform a Shostakovich symphony into a vast collection of data, ready to be analysed and interpreted. That’s what we aim to do today. We will use Cloudera Data Platform to ‘dissect’ a musical composition, extracting the fundamental elements that make it unique: rhythm, melody, harmony, and tempo. It’s an ambitious project that will allow us to explore the limitless possibilities offered by Big Data, applying them to a field seemingly far removed from technology—music. Together, we will create a kind of ‘digital score’ draft, where each note becomes data, and each pause an opportunity to discover new patterns and relationships. There are a variety of niche software applications designed specifically for this purpose I can suppose but BigData.
The development of an architecture on Cloudera Data Platform (CDP) to analyse symphonic musical works and extract fundamental elements such as rhythm, melody, harmony, and tempo requires a combination of data processing, distributed storage, and advanced analytics. While there are already apps developed with music experts, we are not dealing with Big Data or potential classifications.
Cloudera Data Platform with Ozone and Iceberg.
Data Ingestion and Storage Apache Ozone:
Apache Kafka:
Audio Signal Processing Apache Spark:
Cloudera DataFlow NiFi (CDF):
Results Storage and Metadata Apache Iceberg:
Storing binary music and metadata to HBase:
Advanced Analysis and Modelling Cloudera Machine Learning (CML):
Data Visualisation and Exploration Cloudera Data Warehouse (CDW):
Management and Orchestration Apache NiFi:
Workflow Summary
Ingestion: Music recordings are ingested in real-time or batch and stored in Ozone.
Processing: Apache Spark processes audio files in Ozone to extract musical features.
Storage: Results and metadata are stored in Apache Iceberg for efficient access.
Advanced Analysis: Machine learning models are trained and deployed in CML to identify and predict musical patterns.
Visualisation: Results are visualised in interactive dashboards via CDW with data in Iceberg.
Orchestration and Management: NiFi orchestrates the workflow, while Cloudera Manager manages the infrastructure.
This architecture enables robust and scalable analysis of music data, optimising storage and querying with Ozone and Iceberg, while leveraging the processing and machine learning capabilities of CDP.
Let's try.......
Example of Code in Apache Spark
This Spark code in PySpark processes audio files stored in Apache Ozone, extracts features such as Mel-Frequency Cepstral Coefficients (MFCC) using librosa, and stores the results in Apache Iceberg.
from pyspark.sql import SparkSession
import librosa
import numpy as np
# Spark
spark = SparkSession.builder \
? ? .appName("AudioFeatureExtraction") \
? ? .getOrCreate()
# MFCC
def extract_mfcc(file_path):
? ? y, sr = librosa.load(file_path)
? ? mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
? ? avg_mfccs = np.mean(mfccs.T, axis=0)
? ? return avg_mfccs.tolist()
# DataFrame
audio_files_df = spark.read.text("ozone://bucket/audio-files/")
#
features_rdd = audio_files_df.rdd.map(lambda row: (row.value, extract_mfcc(row.value)))
# RDD in DataFrame
features_df = features_rdd.toDF(["file_path", "mfcc_features"])
# save to Apache Iceberg
features_df.write \
? ? .format("iceberg") \
? ? .mode("append") \
? ? .save("iceberg://warehouse/audio_features")
领英推荐
Example of Code with librosa
This Python code uses librosa to load an audio file and extract its spectrogram and MFCCs.
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Load Data
file_path = 'path_to_your_audio_file.wav'
y, sr = librosa.load(file_path)
# extract
spectrogram = librosa.stft(y)
spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))
# MFCC
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# show
plt.figure(figsize=(12, 8))
librosa.display.specshow(spectrogram_db, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spettrogramma')
plt.show()
# showi MFCC
plt.figure(figsize=(10, 6))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.show()
Example of a Basic Flow in Apache NiFi
Below is a description of a basic flow in Apache NiFi to ingest audio files, process them, and store them in Apache Ozone.
Step-by-Step in NiFi:
Processor 1: GetFile Use: Read audio files from a local directory. Configuration:
Processor 2: ExecuteScript Use: Process the audio file with a Python script using librosa. Script:
import librosa
from org.apache.nifi.processor.io import StreamCallback
import json
class PyStreamCallback(StreamCallback):
? ? def init(self):
? ? ? ? pass
? ? def process(self, inputStream, outputStream):
? ? ? ? y, sr = librosa.load(inputStream)
? ? ? ? mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
? ? ? ? mfccs_avg = mfccs.mean(axis=1).tolist()
? ? ? ? result = {"mfccs": mfccs_avg}
? ? ? ? outputStream.write(bytearray(json.dumps(result, indent=4).encode('utf-8')))
flowFile = session.get()
if flowFile is not None:
? ? flowFile = session.write(flowFile, PyStreamCallback())
? ? session.transfer(flowFile, REL_SUCCESS)
Processor 3: PutOzoneObject Use: Store the processing result in Apache Ozone. Configuration:
Connection between Processors:
This basic NiFi flow takes audio files, extracts features using librosa, and then stores the results in Ozone. You can extend this flow by adding more processing steps or sending the data to other destinations.
Essentially, our goal is to go beyond traditional musical analysis, pushing towards an innovative approach that harnesses the power of Big Data. By utilising Cloudera Data Platform, we aim not only to gain a deeper understanding of Shostakovich's symphonies but also to pave new avenues for large-scale musical interpretation and classification. This project will enable us to capture the essence of music at an unprecedented level of detail, revealing hidden patterns and relationships that might elude traditional human analysis. Through this journey, we are laying the groundwork for a new paradigm in musical analysis, where art meets data science, and where each symphony can be explored and understood in entirely new ways.