Symphony of Data: Cloudera CDP Architecture for Musical Analysis

Symphony of Data: Cloudera CDP Architecture for Musical Analysis

During a #Cloudera Iceberg training session for our Italian clients, numerous questions arose regarding the application potential of the CDP platform in a wide range of areas. This highlighted a growing interest in using Big Data solutions to tackle complex challenges and drive innovation across various fields. Here’s a "nerd" example:

Imagine being able to transform a Shostakovich symphony into a vast collection of data, ready to be analysed and interpreted. That’s what we aim to do today. We will use Cloudera Data Platform to ‘dissect’ a musical composition, extracting the fundamental elements that make it unique: rhythm, melody, harmony, and tempo. It’s an ambitious project that will allow us to explore the limitless possibilities offered by Big Data, applying them to a field seemingly far removed from technology—music. Together, we will create a kind of ‘digital score’ draft, where each note becomes data, and each pause an opportunity to discover new patterns and relationships. There are a variety of niche software applications designed specifically for this purpose I can suppose but BigData.

The development of an architecture on Cloudera Data Platform (CDP) to analyse symphonic musical works and extract fundamental elements such as rhythm, melody, harmony, and tempo requires a combination of data processing, distributed storage, and advanced analytics. While there are already apps developed with music experts, we are not dealing with Big Data or potential classifications.

Cloudera Data Platform with Ozone and Iceberg.

Data Ingestion and Storage Apache Ozone:

  • Use: Storing raw music recordings in formats such as WAV, MP3, FLAC.
  • Advantage: Ozone is a distributed storage system designed to handle large volumes of unstructured data. It is highly scalable and offers better performance and metadata management compared to HDFS.

Apache Kafka:

  • Use: For real-time ingestion of audio data, useful when managing continuous streams of music or data from audio sensors.
  • Advantage: Kafka provides a robust and scalable data streaming solution for real-time processing.

Audio Signal Processing Apache Spark:

  • Use: Processing audio files stored in Ozone and extracting features such as spectrograms, MFCCs, rhythm, tempo, etc.
  • Libraries: Utilises PySpark alongside libraries like librosa for audio analysis.
  • Advantage: Spark is ideal for distributed and parallel processing, accelerating the analysis of large data volumes.

Cloudera DataFlow NiFi (CDF):

  • Use: For real-time processing streams, where continuous streaming and audio analysis processes can be integrated.
  • Advantage: Provides an easy-to-use tool for managing complex data streams in real-time.

Results Storage and Metadata Apache Iceberg:

  • Use: Storing audio analysis results in a structured format that allows efficient querying and data versioning.
  • Advantage: Iceberg offers improved partition management and is compatible with large datasets in distributed environments, facilitating SQL queries with Apache Spark and other tools.

Storing binary music and metadata to HBase:

  • Use: Storing metadata associated with music tracks and analysis results requiring real-time or low-latency access.
  • Advantage: offers fast and scalable storage, optimised for low-latency access.

Advanced Analysis and Modelling Cloudera Machine Learning (CML):

  • Use: Developing machine learning models for tasks such as genre classification, rhythm pattern identification, harmony prediction, etc.
  • Advantage: Provides a collaborative development environment with support for Python, R, and other machine learning tools.

Data Visualisation and Exploration Cloudera Data Warehouse (CDW):

  • Use: Creating interactive dashboards and visualisations to explore the results of music analysis, such as rhythm charts, chord diagrams, and spectrogram visualisations.
  • Tools: Cloudera Data Visualization integrated with Iceberg to create these visualisations.
  • Advantage: CDW combined with Iceberg allows scaling queries and visualisations across large data volumes.

Management and Orchestration Apache NiFi:

  • Use: Orchestrating the data flow from ingestion, through processing, to storage and analysis.
  • Advantage: NiFi facilitates the integration and management of complex data flows.

Workflow Summary

Ingestion: Music recordings are ingested in real-time or batch and stored in Ozone.

Processing: Apache Spark processes audio files in Ozone to extract musical features.

Storage: Results and metadata are stored in Apache Iceberg for efficient access.

Advanced Analysis: Machine learning models are trained and deployed in CML to identify and predict musical patterns.

Visualisation: Results are visualised in interactive dashboards via CDW with data in Iceberg.

Orchestration and Management: NiFi orchestrates the workflow, while Cloudera Manager manages the infrastructure.

This architecture enables robust and scalable analysis of music data, optimising storage and querying with Ozone and Iceberg, while leveraging the processing and machine learning capabilities of CDP.

Let's try.......

Example of Code in Apache Spark

This Spark code in PySpark processes audio files stored in Apache Ozone, extracts features such as Mel-Frequency Cepstral Coefficients (MFCC) using librosa, and stores the results in Apache Iceberg.

from pyspark.sql import SparkSession

import librosa

import numpy as np

# Spark

spark = SparkSession.builder \

? ? .appName("AudioFeatureExtraction") \

? ? .getOrCreate()


def extract_mfcc(file_path):

? ? y, sr = librosa.load(file_path)

? ? mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

? ? avg_mfccs = np.mean(mfccs.T, axis=0)

? ? return avg_mfccs.tolist()

# DataFrame

audio_files_df ="ozone://bucket/audio-files/")


features_rdd = row: (row.value, extract_mfcc(row.value)))

# RDD in DataFrame

features_df = features_rdd.toDF(["file_path", "mfcc_features"])

# save to Apache Iceberg

features_df.write \

? ? .format("iceberg") \

? ? .mode("append") \

? ? .save("iceberg://warehouse/audio_features")

Example of Code with librosa

This Python code uses librosa to load an audio file and extract its spectrogram and MFCCs.

import librosa

import librosa.display

import matplotlib.pyplot as plt

# Load Data

file_path = 'path_to_your_audio_file.wav'

y, sr = librosa.load(file_path)

# extract

spectrogram = librosa.stft(y)

spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))


mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# show

plt.figure(figsize=(12, 8))

librosa.display.specshow(spectrogram_db, sr=sr, x_axis='time', y_axis='log')

plt.colorbar(format='%+2.0f dB')


# showi MFCC

plt.figure(figsize=(10, 6))

librosa.display.specshow(mfccs, sr=sr, x_axis='time')



Example of a Basic Flow in Apache NiFi

Below is a description of a basic flow in Apache NiFi to ingest audio files, process them, and store them in Apache Ozone.

Step-by-Step in NiFi:

Processor 1: GetFile Use: Read audio files from a local directory. Configuration:

  • Directory: /path/to/audio/files
  • File Filter: .*\.wav

Processor 2: ExecuteScript Use: Process the audio file with a Python script using librosa. Script:

import librosa

from import StreamCallback

import json

class PyStreamCallback(StreamCallback):

? ? def init(self):

? ? ? ? pass

? ? def process(self, inputStream, outputStream):

? ? ? ? y, sr = librosa.load(inputStream)

? ? ? ? mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

? ? ? ? mfccs_avg = mfccs.mean(axis=1).tolist()

? ? ? ? result = {"mfccs": mfccs_avg}

? ? ? ? outputStream.write(bytearray(json.dumps(result, indent=4).encode('utf-8')))

flowFile = session.get()

if flowFile is not None:

? ? flowFile = session.write(flowFile, PyStreamCallback())

? ? session.transfer(flowFile, REL_SUCCESS)

Processor 3: PutOzoneObject Use: Store the processing result in Apache Ozone. Configuration:

  • Bucket: audio-features
  • Object Key: ${filename}.json

Connection between Processors:

  • GetFile → ExecuteScript → PutOzoneObject

This basic NiFi flow takes audio files, extracts features using librosa, and then stores the results in Ozone. You can extend this flow by adding more processing steps or sending the data to other destinations.

Essentially, our goal is to go beyond traditional musical analysis, pushing towards an innovative approach that harnesses the power of Big Data. By utilising Cloudera Data Platform, we aim not only to gain a deeper understanding of Shostakovich's symphonies but also to pave new avenues for large-scale musical interpretation and classification. This project will enable us to capture the essence of music at an unprecedented level of detail, revealing hidden patterns and relationships that might elude traditional human analysis. Through this journey, we are laying the groundwork for a new paradigm in musical analysis, where art meets data science, and where each symphony can be explored and understood in entirely new ways.


Mauro Benedetto Pazienza的更多文章

