University of Pisa: A New Paradigm in AI Data Management
As both a professor and CIO at the University of Pisa, my work constantly bridges the worlds of research, production, and innovation. I want to share some key insights on how artificial intelligence is reshaping data management—particularly in the transition from traditional systems to more dynamic AI-ready architectures.
RAG Systems: The Future of Data Management
When I began working on projects like Oraculum and CPLR, the concept of Retrieval-Augmented Generation (RAG) immediately stood out as revolutionary. The beauty of RAG systems lies in their ability to inject real-time information into AI prompts, reducing the risks of hallucination (when AI generates incorrect or misleading content).
The logic is simple: By querying an AI-enhanced database—what I like to call a "smart DB"—we get precisely the data needed, indexed not by traditional methods but by neural embeddings. These embeddings let the system order content based on meaning rather than rigid data structures.
In RAG systems, the data being retrieved isn’t just "files" or "database entries"—it’s something in between. Files can be too bulky, and traditional database rows too rigid. We need a semi-structured data format that’s AI-friendly, self-contained, and easily interpretable LLMs.
The Challenge of Structure in AI-Driven Systems
Here’s where things get interesting: Most AI systems treat everything as text—whether it’s numbers, paragraphs, or structured datasets. While text-based prompts are powerful, they come with a challenge. If you embed too much information in a single prompt, the model can get overwhelmed, resulting in less accurate answers. Similarly, fragmenting information into smaller pieces creates issues because text is inherently designed to be read sequentially, with cross-references throughout.
This is why I believe the future lies in a new type of information chunk. It’s neither a bulky file nor a fully structured database entry, but a hybrid—self-contained, small enough to fit into prompts, and formatted with elements like XML or Markdown to aid AI interpretation. LLMs are surprisingly good at reading and understanding such structured text.
As I see it, this new data type will bridge the gap between current file systems and NoSQL databases. It provides just enough structure to make it manageable while still allowing AI to use it effectively. This shift in data management isn’t theoretical—it’s already happening.
Practical Implications: A Real-World Example
In our development of Oraculum, an open-source AI framework, we’ve seen these concepts come to life. Oraculum is actively used in production where it manages complex data environments.
One key takeaway from my experience is that enterprises must focus on smart data management, not just expensive GPUs. AI models can run efficiently on cloud platforms, but data management on-premises ensures data security and faster operations. In fact, for many organizations, the cost of running GPUs locally outweighs the benefits—making data management the real differentiator.
The Road Ahead: AI-Optimized Data Management
Looking forward, I see a growing trend toward hybrid AI solutions—where enterprises use cloud-based models while keeping RAG systems and critical data on-premises for security and efficiency. This approach ensures that sensitive information stays within the organization, while external AI tools are used selectively through APIs.
For example, enterprises can leverage powerful AI platforms like OpenAI, but they should control the embeddings process locally to maintain data privacy. This hybrid model is especially critical in regions with strict data regulations, like Europe, where compliance with GDPR is non-negotiable.
The future of AI isn’t just about algorithms—it’s about the infrastructure that supports them. VAST Data has allowed us to store and manage both structured and unstructured data in one place, keeping pace with the growing demands of AI. The flexibility to run models on local GPUs gives us control over where our data processing happens, allowing us to adapt if our AI needs change.?
VAST’s platform also supports both our core and AI workloads, eliminating the need for specialized systems. With a unified security setup, we can manage access controls more simply, giving us confidence as we scale our AI initiatives. We’re also looking forward to the upcoming InsightEngine feature, which promises to streamline the RAG pipeline, making AI deployments faster and easier.
About Antonio Cisternino
Antonio Cisternino is a professor and the Chief Information Officer at the University of Pisa, where he operates at the intersection of research, education, and digital infrastructure development. His expertise spans multiple domains, including programming languages, robotics, cloud computing, and the Internet of Things (IoT). In addition to leading several open-source initiatives, he has collaborated with prominent companies such as Microsoft, Ferrari, and Intel on cutting-edge technology projects.
As CIO, Cisternino focuses on translating research outcomes into practical, real-world applications. One notable example is Oraculum, an open-source AI framework currently deployed at several institutions, which demonstrates the potential of AI-driven solutions across diverse environments. Whether developing meta-programming tools or influencing national digital policies, Cisternino consistently aims to push the boundaries of technological innovation. More of his work can be found on Github.
Digital Transformation | Change Management | Research Tech for Learning, Outreach & Collaboration
1 周Salve Prof. Antonio Cisternino, anche all'European University Institute abbiamo sviluppato con un progetto pilota un sistema RAG molto simile a quello che ha descritto, anche se con tecnologie diverse. Sarebbe davvero un piacere confrontarci sul lavoro che avete realizzato a Pisa. Attualmente, la nostra principale sfida consiste nel testare un sistema di KB che consenta l'uso di applicazioni AI condivise tra più utenti, integrando un meccanismo di "iniezione" dati basato sui permessi. Lo schema che stiamo esplorando si basa su cerchi concentrici, includendo: KB personale, gruppo di ricerca, dipartimento, università e, infine, ambito pubblico. La ringrazio per la condivisione del vostro lavoro, davvero molto interessante.
Customer Advocacy
1 周Many thanks Antonio Cisternino for sharing your knowledge with others! VAST Data
Fantastico!