Big Data Basics

Big Data Basics

What is Big Data?

Many consider the term Big Data as a cliché. However, there is merit in classifying certain types of problems as a group, since they can be solved using a common set of new tools and techniques. Often, high velocity of data, high variability of data and high volume of data are considered as the characteristics which distinguishes a Big Data problem. Also, the complexity of algorithms to scale up to solve this type of problems can be a fourth dimension.

What differentiate the handling of Big Data?

Big data need different approach for all stages - data accumulation from various sources, processing and presenting the data. This is necessitated because of the size of the data, the need for dealing with unstructured data and different expectation from the data – present all the answers rather than focus on predefined problems. 

Data Ingestion

Traditional ETL approach does not work for Big Data. Large number of integration requirements (for data correlation) makes it impractical to write ETL based integration for each of the data sources.

Data Curation

For data ingestion & curation there are several new tools available. They are intelligent tools which make data cleansing and transformation easier.  Some of the available tools are Paxata, Trifacta, data wrangler, Cambridge semantics, data tamer, etc. Also it can handle large number of data sources without having to spend huge integration effort.  They are geared towards two set of usage. 

Tools for individual data scientists – typically such tools are not very expensive and do not require programming skills. Tools offers visual tool for looking at the data and cleansing / transforming them. From the tool reusable scripts can be generated.  Example of such a tool is data wrangler.

The second type of tools is for enterprise data integration.  Such tool uses machine learning and statistics. Global schema may be created on the fly based on the data. One example of such tool is data tamer.  

Data Processing

Because of the sheer volume of the data, techniques like massive parallelism, horizontal / vertical partitioning, sampling and summarization are used for processing the data.  

Data-parallel models are offered by software like Hadoop, Hive, Pig, Dryad etc. Latest in such software is spark based on RDD (resilient distributed data sets) which offers multiple passes of over data.

Traditional rdbms based databases may or may not be used. For very large data in-memory databases and column based databases gives much better performance.

Some of the  new age databases ideal for big data are SAP Hanna, HP Vertica, Amazon Redshift, Cloudera, Google BigQuery, Hortonworks, IBM BigInsights.

For big data analysis various types of algorithms and machine learning techniques are used like clustering, association learning, parameter estimation, recommendation engines, classifications, etc.

Visualization

Big data can produce opportunities for businesses to achieve newer insight that can help in their decision making, increase innovation, identify optimization areas and improve customer experience. Visualization is the key in making this happen.

The typical approach is to use a sort of information map. Visualization based discovery tool help the business to derive more value from big data. Some of the expectations from visualization are – pre-attentive interactions, faceted browsing, infographics, etc.

Some popular tools for visualization are Exhibit, Visual.ly, D3.js, Dygraphs, Zingchart, InstantAtlas etc.

Big Data examples in businesses 

Some examples on how businesses bring game changing value through Big data.

  • Insurance companies put sensors in your car and analyze the driving patterns. This helps to evaluate the risk more accurately and hence propose customized premium.
  • Interactive scatter plots to choose your favorite movie to rent. This help to increase the rental business for movies - also can analyze the customer need.
  • Data from sensors help in predicting Preventive Maintenance for Aircraft. This help to reduce down time.
  • Retailers can use data from web browsing patterns, social media, industry forecasts, existing customer records, etc. This  helps to predict trends, prepare for demand, pinpoint customers, optimize pricing and promotions, and monitor real-time analytics and results.
  • Fast food company analyzes real time queue to display items requiring optimal preparation time. When queue is long - displays only faster serviceable items

 

Sagar Patel

Advisor - OMS/ Support consulting at Manhattan Associates

9 年

Data analytics is rapidly growing in retail verticals to understand consumer buying behavior but the problem faced by many planner is reusability of data model over seasons to seasons on yearly basis vs accurate outliner prediction. So the biggest challenge will be usability in terms of selection criteria which Hadoop or other tools in the market does but flexibility of usage will be in question that can help in presenting to C level executives.

Indervir Singh Banipal

Senior Software Engineer at IBM Watson | Master Inventor

9 年

The tools and technologies for handling the big data are rapidly evolving. The recent one is Spark from University of California at Berkeley. They developed Spark which is a In-Memory implementation of a framework which can handle big data. Hadoop uses HDFS which internally uses disk space of the clusters, which is no doubt an efficient way. But disk space comes lower in the memory pyramid as compared to the Main Memory (RAM). Spark makes use of the Main Memory to do the same stuff and performs better than Hadoop. Who knows if Spark will replace Hadoop in the near future!

要查看或添加评论,请登录

Joseph Eapen的更多文章

  • AI in HealthCare

    AI in HealthCare

    Overview Artificial Intelligence (AI) has permeated every facet of our lives, some apparent while others remain…

    8 条评论
  • It's time to make the Machine Learn your business

    It's time to make the Machine Learn your business

    There is a lot of talk about AI (Artificial Intelligence) and ML(Machine Learning). Is it just a hype? Or can AI add…

    14 条评论

社区洞察

其他会员也浏览了