Keeping Up With Big Data...
Antonio Figueiredo
Agentforce AI Innovator | Architecting the Future of Customer Interactions
For many years, companies have been collecting transactional data and storing them in databases using a "one-size-fits-all" approach. Currently, the data we see flying around doesn't present the same "well behaved" characteristics as before; both the sources and volume of data collected have exploded exponentially. It is now possible to take advantage of this variety of data in a manner that was not even possible to be considered in the past in a reasonable timeframe and cost. It is now possible to click-stream data about every potential customer interaction with your web site and to log events that could indicate potential system or integration latency issues, etc. Marketing teams can also tap into comments people are making about their products and brand on social sites, blogs, and other media.
These different types of sources of data have become a great insight into companies' products, customers (for better KYC) and services. It is becoming more and more feasible for organizations to achieve these new capabilities as technologies and processes to better tame this data have become more available to handle what is being called Big Data. These technology and process enabled capabilities are definitely getting more attention from businesses.
Big Data is not the solution, but the problem that organizations are looking to address to gain insights into their business or to be more responsive to customer expectations, ideally leading to more proactivity.
" Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it." -- [Big Data Now: O'Reilly Media]
Identify Your Goals and Build A Big Data Strategy
A good approach to finding alternatives and solutions for your organization is to start by defining your big data strategy to materialize your goals and objectives in an attempt to find your valuable golden nuggets.
Organizations are now reinvesting in technologies to find those hidden nuggets of insights with the business goals of:
- Improving revenue growth
- Reducing churn and improving customer loyalty
- Identifying operational efficiencies
- Creating market differentiation
- Expanding the brand presence
- Mitigating regulatory risks
This process usually includes:
- Big data strategy -- goals, objectives, data, and use case definitions and their prioritization.
- Gap analysis and readiness assessment -- what is available today in terms of access to the data, tools and capabilities.
- Architecture and analytics recommendations -- a flexible and scalable architecture that includes ingestion, storage and processing of the data as well as the supporting analytics and data visualization.
- Roadmap -- a phased actionable roadmap detailing tangible steps to achieve your goals.
The picture below shows the major components involving, enabling and automating the collection, publishing and managing big data flow as well as high-level steps to be addressed by a Big Data Architecture.
A Big Data Reference Architecture
The understanding and consumption of the business value found in big data sets helps organizations to tackle the above business goals, but the existing technologies can't extract that value effectively using traditional processes and architecture. This reference architecture supports most of the big data requirements addressing typical scenarios: real-time and batch cases as well as interactive accesses. This reference architecture also brings an additional API Gateway recommended for exposing data to its consumers (real-time dashboard, analytics tools, custom-fit solutions, etc.).
The architecture reflects a framework coined by Nathan Marz as Lambda Architecture. Some of the key requirements in building this architecture include:
- Fault-tolerance against hardware failures and human errors
- Support for a variety of use cases that include low latency querying as well as updates
- Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done
- Extensibility so that the system is manageable and can accommodate newer features easily
Understanding the high level view of this architecture and its components gives a good perspective on tackling Big Data and how it complements existing enterprise systems.
The diagram below depicts a high-level view of a typical Big Data Architecture:
The data is routed to either a data streaming process where it gets processed as it flows or via batch processes into Hadoop HDFS (Hadoop Distributed File System). The processed data becomes structured as it travels through the architecture hitting map reduce processes (in Hadoop or outside Hadoop) and eventually gets exposed to API Gateways, analytics tools and/or ad-hoc queries. Data mining and machine learning processes like R, Mahout, SparkML, etc. can be used on the data to produce better insights. Depending on different factors and approaches each component of this architecture can be represented at least by several other alternatives each with its own advantages and disadvantages for a particular workload
A Few Considerations
The actual implementation of some of the components highlighted in this architecture varies in different use cases. Some recommendations and considerations need to be observed:
- This architecture is flexible and scalable. It is not very complex to implement preferably on the cloud (e.g. AWS, Rackspace, etc) which makes its operationalization easier to manage. However, this can also be deployed successfully in on-premises environments (see virtualization options and/or OpenStack for instance).
- Data sources having considerably different scale – from multi-terabyte to even petabyte range. Use the right tool for the job, from data collection through data processing to advanced analytics tools and approaches.
- Speed requirements may be critical in some cases – nightly ETL (extract-transform-load) batches are insufficient and real-time streaming from solutions like Storm, Spark Streaming, Samza and S4 are required. Even in common situations where typical Hadoop MapReduce would usually be used, Spark processes are now considered instead due to reduced latency.
- Hadoop distribution – since Hadoop is an open source project, a number of vendors have developed their own distributions, usually adding new functionality or improving the code base. Vendor distributions are designed to overcome issues with the open source edition and provide additional value to customers. Hence, a selection of a Hadoop distribution is recommended.
- The API Gateway at the end of the process flow (as seen in the diagram) gives flexibility to apply additional correlation, business logic, rules, security and use of connectors to integrate with enterprise and external cloud systems (push notification, email, loyalty programs, social networks, cloud commercial solutions like Salesforce.com, etc.). This also helps to shape up the data along with a presentation layers to different channels (e.g. Mobile devices and Web).
- This architecture also supports adding "triggers" to your business – once you're there e.g. you're living in a more proactive state. This architecture allows your business to be more responsive and agile. Use of technologies like Internet of Things (IoT) tied in this architecture fosters an extra layer of a) integration (connected business) – mobile apps and device can connect to smart functions available via APIs allowing integration with smart wearables, smart homes, smart cars, smart retail or smart cities. These capabilities can be also used to link medication to patients to pharmacies to doctors or pharmaceutical companies or hospitals (big in a connected healthcare world), for instance; and b) responsiveness to your business – you can apply business critical logics, rules, and thresholds to analyze metrics that could automatically trigger specific business actions and activities. For instance, system X is working abnormally, re-route its traffic; customer A from a game company started using all his/her award points more recently, detect a possible attrition pattern and send them some incentive to stay, etc.
- Storage alternatives for improved capacity – solutions like HDFS and unstructured data stores like Amazon S3 provide better scalabilities and flexibilities to existing enterprise solutions.
- Storage for better correlation capabilities – the graph data model allows storage of all the connections of the node along with the node, so that there is no additional step in computing connected data apart from reading the node into memory. This type of solution allows you to find interconnected data much faster and in a much more scalable manner as compared to the relational data model or even NoSQL model. For instance, Cassandra embedded in Titan (Distributed Graph DB) under the same JVM is a good solution for cases like "who bought/watched this item/video also bought/watched", etc.
- Search capabilities for quick ad-hoc searches and/or integrated custom-fit solutions exposing data via APIs – ElasticSearch, Apache Solr, etc.
- Multiple analytics paradigms and computational methods must be supported (ref [VMware]):
- Real-time database: These are typically in-memory, scale-out engines that provide low-latency, cross-data center access to data, and enable distributed processing and event-generation capabilities.
- Real-time analytics: Real-time analytics is the use of, or the capacity to use, all available enterprise data and resources when they are needed. It consists of dynamic analysis and reporting, based on data entered into a system less than one minute before the actual time of use. This approach usually uses the real-time stream flow described in the diagram above.
- Interactive analytics: Includes distributed MPP (massively parallel processing) data warehouses with embedded analytics, which enable business users to do interactive querying and visualization of big data.
- Batch processing: Leverage Hadoop as a distributed processing engine that can analyze very large amounts of data and apply algorithms that range from the simple (e.g. aggregation) to the complex (e.g. machine learning).
OK, And So What?
It is common to see organizations starting with a smaller footprint and/or a subset of the patterns of this architecture, and as they realize the benefits and business value this brings to them they usually expand the coverage of this architecture. As they grow and expand this architecture, here are some points that should be observed and are recommended by many, including Forester and others:
- Prepare your firm to capitalize on big data as underlying technology is getting more and more mature, more available and cheaper – do it sooner!
- Position your company to quickly gain competitive advantage as "the pack" races to catch up
- Architects should be leading their firm's thinking in bringing initiatives like this to reality to best meet business needs sooner. Be proactive and innovative!
- Recognize where the traditional platforms can no longer solve problems and look to distributed, elastic cloud solutions to make scalability and processing affordable. Big data processing is part of an emerging elastic platform. Embrace it when applicable.
- Always bring flexibility and new capabilities into your architecture now and enjoy the ride later. A scalable, comprehensive and flexible architecture helps the introduction of strong analytics and data science capabilities.
For years, businesses have not taking much advantage of the potential of their data. Most of the solutions being developed by organizations (e.g. traditional BI solutions and others) usually take years to get implemented and they still may not meet the needs of a more modern business workforce. Organizations need to consider the new dynamic of today's environment and be prepared to be everywhere -- think Mobile First. Architectures as described here have the flexibility and capability to additionally expose your data for insights in whatever channel your business requires and to expose data to solutions either on premises or via trending analytics cloud solutions.
Here are some other posts I have written:
Customer Success Manager | Salesforce
9 年Great job Antonio. I really like your insight on reference architecture and points for consideration when planning for big data. I think this article is a must read for everyone on the team who is planning for a new venture on big data analytics.
ServiceNow Consulting @ GlideFast, Advocating for technology that improves accessibility for the underserved
9 年Seems obvious now but clarifying that "Big Data" is not the answer but rather the problem...not the end-point but rather the starting point. Antonio Figueiredo's post succeeded in making a complex issue into an easily comprehend-able topic that clarifies the approach business leaders should take in addressing the challenge and value of big-data on their respective business. Well said Antonio!