登录查看更多内容

Symphony of Data: Cloudera CDP Architecture for Musical Analysis

Mauro Benedetto Pazienza

Senior Training Specialist at PUE Data

发布日期: 2024年9月6日

During a #Cloudera Iceberg training session for our Italian clients, numerous questions arose regarding the application potential of the CDP platform in a wide range of areas. This highlighted a growing interest in using Big Data solutions to tackle complex challenges and drive innovation across various fields. Here’s a "nerd" example:

Imagine being able to transform a Shostakovich symphony into a vast collection of data, ready to be analysed and interpreted. That’s what we aim to do today. We will use Cloudera Data Platform to ‘dissect’ a musical composition, extracting the fundamental elements that make it unique: rhythm, melody, harmony, and tempo. It’s an ambitious project that will allow us to explore the limitless possibilities offered by Big Data, applying them to a field seemingly far removed from technology—music. Together, we will create a kind of ‘digital score’ draft, where each note becomes data, and each pause an opportunity to discover new patterns and relationships. There are a variety of niche software applications designed specifically for this purpose I can suppose but BigData.

The development of an architecture on Cloudera Data Platform (CDP) to analyse symphonic musical works and extract fundamental elements such as rhythm, melody, harmony, and tempo requires a combination of data processing, distributed storage, and advanced analytics. While there are already apps developed with music experts, we are not dealing with Big Data or potential classifications.

Cloudera Data Platform with Ozone and Iceberg.

Data Ingestion and Storage Apache Ozone:

Use: Storing raw music recordings in formats such as WAV, MP3, FLAC.
Advantage: Ozone is a distributed storage system designed to handle large volumes of unstructured data. It is highly scalable and offers better performance and metadata management compared to HDFS.

Apache Kafka:

Use: For real-time ingestion of audio data, useful when managing continuous streams of music or data from audio sensors.
Advantage: Kafka provides a robust and scalable data streaming solution for real-time processing.

Audio Signal Processing Apache Spark:

Use: Processing audio files stored in Ozone and extracting features such as spectrograms, MFCCs, rhythm, tempo, etc.
Libraries: Utilises PySpark alongside libraries like librosa for audio analysis.
Advantage: Spark is ideal for distributed and parallel processing, accelerating the analysis of large data volumes.

Cloudera DataFlow NiFi (CDF):

Use: For real-time processing streams, where continuous streaming and audio analysis processes can be integrated.
Advantage: Provides an easy-to-use tool for managing complex data streams in real-time.

Results Storage and Metadata Apache Iceberg:

Use: Storing audio analysis results in a structured format that allows efficient querying and data versioning.
Advantage: Iceberg offers improved partition management and is compatible with large datasets in distributed environments, facilitating SQL queries with Apache Spark and other tools.

Storing binary music and metadata to HBase:

Use: Storing metadata associated with music tracks and analysis results requiring real-time or low-latency access.
Advantage: offers fast and scalable storage, optimised for low-latency access.

Advanced Analysis and Modelling Cloudera Machine Learning (CML):

Use: Developing machine learning models for tasks such as genre classification, rhythm pattern identification, harmony prediction, etc.
Advantage: Provides a collaborative development environment with support for Python, R, and other machine learning tools.

Data Visualisation and Exploration Cloudera Data Warehouse (CDW):

Use: Creating interactive dashboards and visualisations to explore the results of music analysis, such as rhythm charts, chord diagrams, and spectrogram visualisations.
Tools: Cloudera Data Visualization integrated with Iceberg to create these visualisations.
Advantage: CDW combined with Iceberg allows scaling queries and visualisations across large data volumes.

Management and Orchestration Apache NiFi:

Use: Orchestrating the data flow from ingestion, through processing, to storage and analysis.
Advantage: NiFi facilitates the integration and management of complex data flows.

Workflow Summary

Ingestion: Music recordings are ingested in real-time or batch and stored in Ozone.

Processing: Apache Spark processes audio files in Ozone to extract musical features.

Storage: Results and metadata are stored in Apache Iceberg for efficient access.

Advanced Analysis: Machine learning models are trained and deployed in CML to identify and predict musical patterns.

Visualisation: Results are visualised in interactive dashboards via CDW with data in Iceberg.

Orchestration and Management: NiFi orchestrates the workflow, while Cloudera Manager manages the infrastructure.

This architecture enables robust and scalable analysis of music data, optimising storage and querying with Ozone and Iceberg, while leveraging the processing and machine learning capabilities of CDP.

Let's try.......

Example of Code in Apache Spark

This Spark code in PySpark processes audio files stored in Apache Ozone, extracts features such as Mel-Frequency Cepstral Coefficients (MFCC) using librosa, and stores the results in Apache Iceberg.

from pyspark.sql import SparkSession

import librosa

import numpy as np

# Spark

spark = SparkSession.builder \

? ? .appName("AudioFeatureExtraction") \

? ? .getOrCreate()

# MFCC

def extract_mfcc(file_path):

? ? y, sr = librosa.load(file_path)

? ? mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

? ? avg_mfccs = np.mean(mfccs.T, axis=0)

? ? return avg_mfccs.tolist()

# DataFrame

audio_files_df = spark.read.text("ozone://bucket/audio-files/")

features_rdd = audio_files_df.rdd.map(lambda row: (row.value, extract_mfcc(row.value)))

# RDD in DataFrame

features_df = features_rdd.toDF(["file_path", "mfcc_features"])

# save to Apache Iceberg

features_df.write \

? ? .format("iceberg") \

? ? .mode("append") \

? ? .save("iceberg://warehouse/audio_features")

领英推荐

Mastering Data Preparation Techniques with Amazon…

Data & Analytics 2 个月前

The ClearScale Cloud Newsline - The Generative AI (Gen…

ClearScale 1 年前

University of Pisa: A New Paradigm in AI Data…

VAST Data 3 个月前

Example of Code with librosa

This Python code uses librosa to load an audio file and extract its spectrogram and MFCCs.

import librosa

import librosa.display

import matplotlib.pyplot as plt

# Load Data

file_path = 'path_to_your_audio_file.wav'

y, sr = librosa.load(file_path)

# extract

spectrogram = librosa.stft(y)

spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))

# MFCC

mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# show

plt.figure(figsize=(12, 8))

librosa.display.specshow(spectrogram_db, sr=sr, x_axis='time', y_axis='log')

plt.colorbar(format='%+2.0f dB')

plt.title('Spettrogramma')

plt.show()

# showi MFCC

plt.figure(figsize=(10, 6))

librosa.display.specshow(mfccs, sr=sr, x_axis='time')

plt.colorbar()

plt.title('MFCC')

plt.show()

Example of a Basic Flow in Apache NiFi

Below is a description of a basic flow in Apache NiFi to ingest audio files, process them, and store them in Apache Ozone.

Step-by-Step in NiFi:

Processor 1: GetFile Use: Read audio files from a local directory. Configuration:

Directory: /path/to/audio/files
File Filter: .*\.wav

Processor 2: ExecuteScript Use: Process the audio file with a Python script using librosa. Script:

import librosa

from org.apache.nifi.processor.io import StreamCallback

import json

class PyStreamCallback(StreamCallback):

? ? def init(self):

? ? ? ? pass

? ? def process(self, inputStream, outputStream):

? ? ? ? y, sr = librosa.load(inputStream)

? ? ? ? mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

? ? ? ? mfccs_avg = mfccs.mean(axis=1).tolist()

? ? ? ? result = {"mfccs": mfccs_avg}

? ? ? ? outputStream.write(bytearray(json.dumps(result, indent=4).encode('utf-8')))

flowFile = session.get()

if flowFile is not None:

? ? flowFile = session.write(flowFile, PyStreamCallback())

? ? session.transfer(flowFile, REL_SUCCESS)

Processor 3: PutOzoneObject Use: Store the processing result in Apache Ozone. Configuration:

Bucket: audio-features
Object Key: ${filename}.json

Connection between Processors:

GetFile → ExecuteScript → PutOzoneObject

This basic NiFi flow takes audio files, extracts features using librosa, and then stores the results in Ozone. You can extend this flow by adding more processing steps or sending the data to other destinations.

Essentially, our goal is to go beyond traditional musical analysis, pushing towards an innovative approach that harnesses the power of Big Data. By utilising Cloudera Data Platform, we aim not only to gain a deeper understanding of Shostakovich's symphonies but also to pave new avenues for large-scale musical interpretation and classification. This project will enable us to capture the essence of music at an unprecedented level of detail, revealing hidden patterns and relationships that might elude traditional human analysis. Through this journey, we are laying the groundwork for a new paradigm in musical analysis, where art meets data science, and where each symphony can be explored and understood in entirely new ways.

要查看或添加评论，请登录

Mauro Benedetto Pazienza的更多文章

Bach, Artificial Intelligence, and Me: An Unexpected Harmony

2024年11月30日

Bach, Artificial Intelligence, and Me: An Unexpected Harmony

I never imagined that Johann Sebastian Bach’s music would become the soundtrack to my career in artificial…

1 条评论
Why Generative AI Will Never Create a Symphony Like Mahler’s Second

2024年11月8日

Why Generative AI Will Never Create a Symphony Like Mahler’s Second

In an era of generative artificial intelligence, capable of producing text, images, and even music, one might wonder if…
The Importance of Training in Big Data, GenAI, and Cloud: Leading Vendors and the Value of Qualified Instructors

2024年8月29日

The Importance of Training in Big Data, GenAI, and Cloud: Leading Vendors and the Value of Qualified Instructors

In today’s technological landscape, the fields of Big Data, Generative AI (GenAI), and Cloud computing are constantly…

1 条评论
From Big Data to Artificial Intelligence (in IT Training)

2024年8月20日

From Big Data to Artificial Intelligence (in IT Training)

In recent decades, the accumulation of data has reached unprecedented levels. This phenomenon, commonly known as "Big…
Passion as the Driving Force of Learning: A Journey of Success in Teaching New Technologies.

2024年8月14日

Passion as the Driving Force of Learning: A Journey of Success in Teaching New Technologies.

In a world of continuous technological evolution, the desire to learn knows no age limits. As an IT trainer, I have the…
Inteligencia Artificial Generativa y la Filosofía de Schopenhauer: Una Reflexión Contemporánea

2024年7月4日

Inteligencia Artificial Generativa y la Filosofía de Schopenhauer: Una Reflexión Contemporánea

La idea de explorar la intersección entre la inteligencia artificial generativa y la filosofía de Arthur Schopenhauer…

2 条评论
Europe: A Melting Pot of Diversity and a Unifying Philosophy

2023年6月1日

Europe: A Melting Pot of Diversity and a Unifying Philosophy

Today, I would like to reflect on what Europe represents—a continent that embodies a unique concept of diversity…

1 条评论
The Relationship Between Philosophy and Artificial Intelligence: Highlighting Nietzsche's Thought

2023年5月2日

The Relationship Between Philosophy and Artificial Intelligence: Highlighting Nietzsche's Thought

Philosophy and artificial intelligence (AI) are two seemingly unrelated fields. However, upon closer examination, it…

1 条评论
?Ha vuelto Catilina?

2020年10月9日

?Ha vuelto Catilina?

A pesar del progreso tecnologico y cientifico que hemos conseguido en nuestro peque?o mundo que es Europa no existe…

1 条评论
May Leopardi will be able to help us with the coronavirus?

2020年3月28日

May Leopardi will be able to help us with the coronavirus?

Coronavirus puts in check not only the physical health of thousands of Europeans, but also mental integrity, weakened…

See all articles

Symphony of Data: Cloudera CDP Architecture for Musical Analysis

Mauro Benedetto Pazienza

Senior Training Specialist at PUE Data

Cloudera Data Platform with Ozone and Iceberg.

Workflow Summary

Example of Code in Apache Spark

领英推荐

Example of Code with librosa

Example of a Basic Flow in Apache NiFi

Mauro Benedetto Pazienza的更多文章

社区洞察

其他会员也浏览了

The Database Powering Zepto’s 10-Minute Delivery

The Redpanda Newsletter (Issue #029)

How Can Organizations Build a Scalable Data Infrastructure for the Age of Large Language Models (LLMs)?

Vector Databases: Unleashing the full potential of AI

Why Companies Deploying RAG-Powered AI on Kubernetes See a 3x Boost in Customer Personalization

AWS Kinesis Data Firehose -Seamless and Scalable Data Ingestion for Real-Time Insights EP:15

Why Are Your Data Engineers Washing Dishes?

Data streaming visualization - why is it necessary?

Vector Databases for Amazon Bedrock

Fueling Generative AI's Potential through Databases

Cloudera Data Platform with Ozone and Iceberg.

Workflow Summary

Example of Code in Apache Spark

领英推荐

Example of Code with librosa

Example of a Basic Flow in Apache NiFi

Mauro Benedetto Pazienza的更多文章

Bach, Artificial Intelligence, and Me: An Unexpected Harmony

Why Generative AI Will Never Create a Symphony Like Mahler’s Second

The Importance of Training in Big Data, GenAI, and Cloud: Leading Vendors and the Value of Qualified Instructors

From Big Data to Artificial Intelligence (in IT Training)

Passion as the Driving Force of Learning: A Journey of Success in Teaching New Technologies.

Inteligencia Artificial Generativa y la Filosofía de Schopenhauer: Una Reflexión Contemporánea

Europe: A Melting Pot of Diversity and a Unifying Philosophy

The Relationship Between Philosophy and Artificial Intelligence: Highlighting Nietzsche's Thought

?Ha vuelto Catilina?

May Leopardi will be able to help us with the coronavirus?

社区洞察

其他会员也浏览了

The Database Powering Zepto’s 10-Minute Delivery

The Redpanda Newsletter (Issue #029)

How Can Organizations Build a Scalable Data Infrastructure for the Age of Large Language Models (LLMs)?

Vector Databases: Unleashing the full potential of AI

Why Companies Deploying RAG-Powered AI on Kubernetes See a 3x Boost in Customer Personalization

AWS Kinesis Data Firehose -Seamless and Scalable Data Ingestion for Real-Time Insights EP:15

Why Are Your Data Engineers Washing Dishes?

Data streaming visualization - why is it necessary?

Vector Databases for Amazon Bedrock

Fueling Generative AI's Potential through Databases