Building Big Data Center of Excellence with IBM Cloud and Hadoop
Karan Sachdeva
IBM AWS Global Strategic Partnership Executive for AI @ IBM | NYU Stern MBA ‘27
I often ask customers what inhibits big data initiatives in their organization. Frequent answers include: no compelling business need, or difficulty identifying use cases; lack of data science skills; not enough staff to support them; and the complexity of collecting and managing the data. The concept of a center of excellence (CoE) for big data, which I attempt to demystify here, helps ensure these responses are not inhibitors in any organization.
The key to a data-driven business is in bringing data and insight to all workflows in the business and integrating it into the decision making at every step. This approach enables organizations to take advantage of the longitudinal analytics available with new technology advances such as Hadoop and Spark as well as machine learning for past-, present- and future-looking analytics simultaneously.
Defining big data centers of excellence
A big data CoE is a framework that takes an organization from zero knowledge to having a fully functional practice of Hadoop, Spark and emerging open source technologies to deliver robust business results. A CoE is where organizations identify new technologies, learn new skills and develop appropriate processes that are then deployed into the business to accelerate adoption.
A centralized big data CoE can be the bedrock for establishing a data-driven company that treats data as a strategic asset. The big data CoE can partner with the business to identify data that is invaluable, explore use cases that differentiate its products and services in the market and help jump-start the business with insights that can yield real-time client value. Data’s strategic importance is the value it represents for the business, but success with big data is not just about data. The people and the organization also play a vital role in that success:
A) Building big data success stories with Use Cases
In many cases, the business comes up with the use cases, but the CoE has the responsibility of facilitating this work. The CoE needs to assume a leadership role in understanding which applications and use cases can be driven with available sets of data sources. Sometimes businesses can be more proactive by bringing use cases to the CoE because the list of use cases can be overwhelming and put a strain on available resources. A transparent process for prioritizing these use cases is important and should be adopted. The CoE needs to prioritize use cases based on parameters such as ease of data availability, data quality, business revenue–based value and impact, costs and risks.
B) Applying agile methodology—the fail-fast approach
Agility and the ability to fail fast are essential to reaching the potential of big data. A lightweight agile process provides tools to deliver outcomes quickly and transparently, typically within two- to three-week sprints. The ability to fail fast is a key big data opportunity; business and technical roadmaps for delivering value need to change more often than in a traditional waterfall environment.
Data itself is also highly agile when it is collected in native form and transformed potentially many times to meet the needs of different use cases. Using the basic ideas of agile development methodology, a CoE can provide the leadership across the organization to ensure business users can quickly gain value from the data.
C) Developing financial models
At the heart of a big data CoE is creative financial models that support the innovation. The charge-back strategy can be a function of data as a service, insights as a service or analytics as a service.
As is often the case with shared services, a charge-back model is necessary to properly handle the maintenance and growth of the emerging technologies, which in this case can be Hadoop and Spark clusters. An organization needs to develop a charge-back model for the business units that will be engaging with the CoE for project, personnel, infrastructure and application resources. Some important questions need to be considered when determining the charge-back model for business units:
- How many users will access the application and cluster?
- How much data will be ingested initially?
- How much data growth is expected over time?
- What is the data retention policy?
Business leaders and decision makers acknowledge that creating a data-driven organization requires a change of culture. Big data CoEs can be the key to this culture change. An important recommendation for building a CoE framework is starting with a small, secure data lake—a Hadoop- or Spark-based service—that can store and process data from various internal groups to support multiple use cases. When building a data lake, organizations learn and employ operational best practices for a number of processes:
- Cluster build out
- Data exploration
- Data ingestion and processing
- Disaster recovery
- General operations and maintenance
- Hadoop and Spark development
- Infrastructure integration
- Model building and testing
- Multitenancy and security
- Third-party software evaluation and integration
- Use-case evaluation
A leading telecommunications firm, for example, began by developing a CoE that asked each business division to come up with business use cases that would generate powerful insights through analytics. It then established regular training boot camps in which business users learned how to use data with self-service tools, and it created a community of data scientists and data engineers to support line-of-business managers in their analyses and to validate findings. As a result, this CoE enabled big data as a shared service that opened up the conversation for creative financial models that involve charge backs and show backs.
Leveraging big data centers of excellence
I foresee creative CoE adaptations such as the one just described helping businesses move beyond the hope of becoming a data-driven organization enabled by big data to the reality of an organization using a data-ingrained business model.
The article has been adapted from my original post at IBM Big Data Hub. The Big Data Hub is created and curated by IBM. It is the home for current content and conversation regarding big data and analytics for the enterprise from thought-leaders, subject matter experts and big data practitioners.