登录查看更多内容

Keeping Up With Big Data...

Antonio Figueiredo

Agentforce AI Innovator | Architecting the Future of Customer Interactions

发布日期: 2015年9月13日

For many years, companies have been collecting transactional data and storing them in databases using a "one-size-fits-all" approach. Currently, the data we see flying around doesn't present the same "well behaved" characteristics as before; both the sources and volume of data collected have exploded exponentially. It is now possible to take advantage of this variety of data in a manner that was not even possible to be considered in the past in a reasonable timeframe and cost. It is now possible to click-stream data about every potential customer interaction with your web site and to log events that could indicate potential system or integration latency issues, etc. Marketing teams can also tap into comments people are making about their products and brand on social sites, blogs, and other media.

These different types of sources of data have become a great insight into companies' products, customers (for better KYC) and services. It is becoming more and more feasible for organizations to achieve these new capabilities as technologies and processes to better tame this data have become more available to handle what is being called Big Data. These technology and process enabled capabilities are definitely getting more attention from businesses.

Big Data is not the solution, but the problem that organizations are looking to address to gain insights into their business or to be more responsive to customer expectations, ideally leading to more proactivity.

" Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it." -- [Big Data Now: O'Reilly Media]

Identify Your Goals and Build A Big Data Strategy

A good approach to finding alternatives and solutions for your organization is to start by defining your big data strategy to materialize your goals and objectives in an attempt to find your valuable golden nuggets.

Organizations are now reinvesting in technologies to find those hidden nuggets of insights with the business goals of:

Improving revenue growth
Reducing churn and improving customer loyalty
Identifying operational efficiencies
Creating market differentiation
Expanding the brand presence
Mitigating regulatory risks

This process usually includes:

Big data strategy -- goals, objectives, data, and use case definitions and their prioritization.
Gap analysis and readiness assessment -- what is available today in terms of access to the data, tools and capabilities.
Architecture and analytics recommendations -- a flexible and scalable architecture that includes ingestion, storage and processing of the data as well as the supporting analytics and data visualization.
Roadmap -- a phased actionable roadmap detailing tangible steps to achieve your goals.

The picture below shows the major components involving, enabling and automating the collection, publishing and managing big data flow as well as high-level steps to be addressed by a Big Data Architecture.

A Big Data Reference Architecture

The understanding and consumption of the business value found in big data sets helps organizations to tackle the above business goals, but the existing technologies can't extract that value effectively using traditional processes and architecture. This reference architecture supports most of the big data requirements addressing typical scenarios: real-time and batch cases as well as interactive accesses. This reference architecture also brings an additional API Gateway recommended for exposing data to its consumers (real-time dashboard, analytics tools, custom-fit solutions, etc.).

The architecture reflects a framework coined by Nathan Marz as Lambda Architecture. Some of the key requirements in building this architecture include:

Fault-tolerance against hardware failures and human errors
Support for a variety of use cases that include low latency querying as well as updates
Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done
Extensibility so that the system is manageable and can accommodate newer features easily

Understanding the high level view of this architecture and its components gives a good perspective on tackling Big Data and how it complements existing enterprise systems.

The diagram below depicts a high-level view of a typical Big Data Architecture:

The data is routed to either a data streaming process where it gets processed as it flows or via batch processes into Hadoop HDFS (Hadoop Distributed File System). The processed data becomes structured as it travels through the architecture hitting map reduce processes (in Hadoop or outside Hadoop) and eventually gets exposed to API Gateways, analytics tools and/or ad-hoc queries. Data mining and machine learning processes like R, Mahout, SparkML, etc. can be used on the data to produce better insights. Depending on different factors and approaches each component of this architecture can be represented at least by several other alternatives each with its own advantages and disadvantages for a particular workload

A Few Considerations

The actual implementation of some of the components highlighted in this architecture varies in different use cases. Some recommendations and considerations need to be observed:

This architecture is flexible and scalable. It is not very complex to implement preferably on the cloud (e.g. AWS, Rackspace, etc) which makes its operationalization easier to manage. However, this can also be deployed successfully in on-premises environments (see virtualization options and/or OpenStack for instance).
Data sources having considerably different scale – from multi-terabyte to even petabyte range. Use the right tool for the job, from data collection through data processing to advanced analytics tools and approaches.
Speed requirements may be critical in some cases – nightly ETL (extract-transform-load) batches are insufficient and real-time streaming from solutions like Storm, Spark Streaming, Samza and S4 are required. Even in common situations where typical Hadoop MapReduce would usually be used, Spark processes are now considered instead due to reduced latency.
Hadoop distribution – since Hadoop is an open source project, a number of vendors have developed their own distributions, usually adding new functionality or improving the code base. Vendor distributions are designed to overcome issues with the open source edition and provide additional value to customers. Hence, a selection of a Hadoop distribution is recommended.
The API Gateway at the end of the process flow (as seen in the diagram) gives flexibility to apply additional correlation, business logic, rules, security and use of connectors to integrate with enterprise and external cloud systems (push notification, email, loyalty programs, social networks, cloud commercial solutions like Salesforce.com, etc.). This also helps to shape up the data along with a presentation layers to different channels (e.g. Mobile devices and Web).
This architecture also supports adding "triggers" to your business – once you're there e.g. you're living in a more proactive state. This architecture allows your business to be more responsive and agile. Use of technologies like Internet of Things (IoT) tied in this architecture fosters an extra layer of a) integration (connected business) – mobile apps and device can connect to smart functions available via APIs allowing integration with smart wearables, smart homes, smart cars, smart retail or smart cities. These capabilities can be also used to link medication to patients to pharmacies to doctors or pharmaceutical companies or hospitals (big in a connected healthcare world), for instance; and b) responsiveness to your business – you can apply business critical logics, rules, and thresholds to analyze metrics that could automatically trigger specific business actions and activities. For instance, system X is working abnormally, re-route its traffic; customer A from a game company started using all his/her award points more recently, detect a possible attrition pattern and send them some incentive to stay, etc.
Storage alternatives for improved capacity – solutions like HDFS and unstructured data stores like Amazon S3 provide better scalabilities and flexibilities to existing enterprise solutions.
Storage for better correlation capabilities – the graph data model allows storage of all the connections of the node along with the node, so that there is no additional step in computing connected data apart from reading the node into memory. This type of solution allows you to find interconnected data much faster and in a much more scalable manner as compared to the relational data model or even NoSQL model. For instance, Cassandra embedded in Titan (Distributed Graph DB) under the same JVM is a good solution for cases like "who bought/watched this item/video also bought/watched", etc.
Search capabilities for quick ad-hoc searches and/or integrated custom-fit solutions exposing data via APIs – ElasticSearch, Apache Solr, etc.
Multiple analytics paradigms and computational methods must be supported (ref [VMware]):

Real-time database: These are typically in-memory, scale-out engines that provide low-latency, cross-data center access to data, and enable distributed processing and event-generation capabilities.
Real-time analytics: Real-time analytics is the use of, or the capacity to use, all available enterprise data and resources when they are needed. It consists of dynamic analysis and reporting, based on data entered into a system less than one minute before the actual time of use. This approach usually uses the real-time stream flow described in the diagram above.
Interactive analytics: Includes distributed MPP (massively parallel processing) data warehouses with embedded analytics, which enable business users to do interactive querying and visualization of big data.
Batch processing: Leverage Hadoop as a distributed processing engine that can analyze very large amounts of data and apply algorithms that range from the simple (e.g. aggregation) to the complex (e.g. machine learning).

OK, And So What?

It is common to see organizations starting with a smaller footprint and/or a subset of the patterns of this architecture, and as they realize the benefits and business value this brings to them they usually expand the coverage of this architecture. As they grow and expand this architecture, here are some points that should be observed and are recommended by many, including Forester and others:

Prepare your firm to capitalize on big data as underlying technology is getting more and more mature, more available and cheaper – do it sooner!

Position your company to quickly gain competitive advantage as "the pack" races to catch up
Architects should be leading their firm's thinking in bringing initiatives like this to reality to best meet business needs sooner. Be proactive and innovative!
Recognize where the traditional platforms can no longer solve problems and look to distributed, elastic cloud solutions to make scalability and processing affordable. Big data processing is part of an emerging elastic platform. Embrace it when applicable.
Always bring flexibility and new capabilities into your architecture now and enjoy the ride later. A scalable, comprehensive and flexible architecture helps the introduction of strong analytics and data science capabilities.

For years, businesses have not taking much advantage of the potential of their data. Most of the solutions being developed by organizations (e.g. traditional BI solutions and others) usually take years to get implemented and they still may not meet the needs of a more modern business workforce. Organizations need to consider the new dynamic of today's environment and be prepared to be everywhere -- think Mobile First. Architectures as described here have the flexibility and capability to additionally expose your data for insights in whatever channel your business requires and to expose data to solutions either on premises or via trending analytics cloud solutions.

Here are some other posts I have written:

Raheel Khan

Customer Success Manager | Salesforce

9 年

Great job Antonio. I really like your insight on reference architecture and points for consideration when planning for big data. I think this article is a must read for everyone on the team who is planning for a new venture on big data analytics.

1 次回应

Kevin Saxe

ServiceNow Consulting @ GlideFast, Advocating for technology that improves accessibility for the underserved

9 年

Seems obvious now but clarifying that "Big Data" is not the answer but rather the problem...not the end-point but rather the starting point. Antonio Figueiredo's post succeeded in making a complex issue into an easily comprehend-able topic that clarifies the approach business leaders should take in addressing the challenge and value of big-data on their respective business. Well said Antonio!

2 次回应

查看更多评论

要查看或添加评论，请登录

Antonio Figueiredo的更多文章

The Generative AI Revolution - A Primer

2023年6月8日

The Generative AI Revolution - A Primer

[Full Version please check it out here.] Generative Artificial Intelligence (AI) is revolutionizing technology with its…

14 条评论
Why You Should Leverage the Combined Power of Blockchain and IoT

2019年10月28日

Why You Should Leverage the Combined Power of Blockchain and IoT

We get excited about technologies like automation, artificial intelligence (AI), and the Internet of Things (IoT), but…

11 条评论
What Is Blockchain Technology and Why Does It Matter?

2019年9月27日

What Is Blockchain Technology and Why Does It Matter?

Blockchain is a technology that promises to fundamentally change how we share information, buy and sell things, and…

1 条评论
Artificial Intelligence & Machine Learning: A Primer

2017年12月29日

Artificial Intelligence & Machine Learning: A Primer

Artificial Intelligence is changing the way organizations innovate and do business as new types of products and…

19 条评论
IoT: Connecting the Dots for Better Customer Experience

2015年10月14日

IoT: Connecting the Dots for Better Customer Experience

Throughout the years, organizations, institutions, and businesses became linked by forming this connected environment…

8 条评论
Enterprise Mobility Strategy, Architecture and Cloud: Why Do They Matter?

2015年9月5日

Enterprise Mobility Strategy, Architecture and Cloud: Why Do They Matter?

Enterprise Mobility Strategy should concentrate on creating a comprehensive mobility agenda and keeping it in line with…

3 条评论

See all articles

Keeping Up With Big Data...

Antonio Figueiredo

Agentforce AI Innovator | Architecting the Future of Customer Interactions

Identify Your Goals and Build A Big Data Strategy

A Big Data Reference Architecture

A Few Considerations

OK, And So What?

Antonio Figueiredo的更多文章

社区洞察

其他会员也浏览了

How To Ensure the Data Mesh Doesn’t Create A Data Mess?

Data Pipeline: Why is it essential to get insights?

The 5C's for Building Data Products

Small Data - Big Insights

Bridging Data and Business: Crafting an Enterprise Data Strategy for Growth

Big Data is technology or a Problem?

The Era of a Data-Driven World

Data Mesh: Domain-Driven Data Products

Big Focus on Small Data

Big Data a Problem?

Identify Your Goals and Build A Big Data Strategy

A Big Data Reference Architecture

A Few Considerations

OK, And So What?

Antonio Figueiredo的更多文章

The Generative AI Revolution - A Primer

Why You Should Leverage the Combined Power of Blockchain and IoT

What Is Blockchain Technology and Why Does It Matter?

Artificial Intelligence & Machine Learning: A Primer

IoT: Connecting the Dots for Better Customer Experience

Enterprise Mobility Strategy, Architecture and Cloud: Why Do They Matter?

社区洞察

其他会员也浏览了

How To Ensure the Data Mesh Doesn’t Create A Data Mess?

Data Pipeline: Why is it essential to get insights?

The 5C's for Building Data Products

Small Data - Big Insights

Bridging Data and Business: Crafting an Enterprise Data Strategy for Growth

Big Data is technology or a Problem?

The Era of a Data-Driven World

Data Mesh: Domain-Driven Data Products

Big Focus on Small Data

Big Data a Problem?