Technology adoptions for data processing and analysis

While working in different projects and designing the solution I just come across different data processing technologies Following are some of the most popular data processing technologies, which help you to perform transformation and processing for a large amount of data:

  • Apache Hadoop uses a distributed processing architecture in which a task is mapped to a cluster of commodity servers for processing. Each piece of work distributed to the cluster servers can be run or re-run on any of the servers. The cluster servers frequently use HDFS to store data locally for processing. In the Hadoop framework, Hadoop takes a big job, splits it into discrete tasks, and process them in parallel. It allows for massive scalability across an enormous number of Hadoop clusters. It's also designed for fault tolerance, where each of the worker nodes periodically reports its status to a master node, and the master node can redistribute work from a cluster that doesn't respond positively. Some of the most popular frameworks used with Hadoop are Hive, Presto, Pig, and Spark.
  • Apache Spark is an in-memory processing framework. Apache Spark is a massively parallel processing system with different executors that can take apart a Spark job and run tasks in parallel. To increase the parallelism of a job, add nodes to the cluster. Spark supports batch, interactive, and streaming data sources. Spark uses directed acyclic graphs (DAGs) for all the stages during the execution of a job. The DAGs can keep track of the transformations of your data or lineage of the data during the jobs and efficiently minimizes the I/O by storing the DataFrames in memory. Spark is also partition-aware to avoid network-intensive shuffles.
  • Hadoop User Experience (HUE) enables you to run queries and scripts on your cluster through a browser-based user interface instead of the command line. HUE provides the most common Hadoop components in a user interface. It enables browser-based viewing and tracking of Hadoop operations. Multiple users can access the cluster via Hue's login portal, and administrators can manage access manually or with LDAP, PAM, SPNEGO, OpenID, OAuth, and SAML2 authentication. HUE allows you to view logs in real time and provides a metastore manager to manipulate Hive metastore contents.
  • Pig is typically used to process large amounts of raw data before storing it in a structured format (SQL tables). Pig is well suited to ETL operations such as data validation, data loading, data transformation, and combining data from multiple sources in multiple formats. In addition to ETL, Pig also supports relational operations such as nested data, joins, and grouping. Pig scripts can use unstructured and semi-structured data (such as web server logs or clickstream logs) as input. In contrast, Hive always enforces a schema on input data. Pig Latin scripts contain instructions on how to filter, group, and join data, but Pig is not intended to be a query language. Hive is better suited to querying data. The Pig script compiles and runs to transform the data based on the instructions in the Pig Latin script.
  • Hive is an open source data warehouse and query package that runs on top of a Hadoop cluster. SQL is a very common skill that helps the team to make an easy transition into the big data world. Hive uses a SQL-like language called Hive Query language (HQL), which makes it easy to query and process data in a Hadoop system. Hive abstracts the complexity of writing programs in a coding language such as Java to perform analytics jobs.
  • Presto is a Hive-like query engine but it is much faster. It supports the ANSI SQL standard, which is easy to learn and the most popular skill set. Presto supports complex queries, joins, and aggregation functions. Unlike Hive or MapReduce, Presto executes queries in memory, which reduces latency and improves query performance. You need to be careful while selecting the server capacity for Presto, as it needs to have high memory. A Presto job will restart in the event of memory spillover.
  • HBase is a NoSQL database developed as a part of the open source Hadoop project. HBase runs on HDFS to provide non-relational database capabilities for the Hadoop ecosystem. HBase helps to store large quantities of data in columnar format with compression. Also, it provides a fast lookup because large portions of the data cache are kept in memory while cluster instance storage is still used.
  • Apache Zeppelin is a web-based editor for data analytics built on top of the Hadoop system, also known as Zeppelin Notebook. It uses the concept of an interpreter for its backend language and allows any language to be plugged into Zeppelin. Apache Zeppelin includes some basic charts and pivot charts. It's very flexible in terms of any output from any language backend that can be recognized and visualized.
  • Ganglia is a Hadoop-cluster monitoring tool. However, you need to install Ganglia on the cluster during launch. The Ganglia UI runs on the master node, which you can see using an SSH tunnel. The Ganglia is an open source project designed to monitor clusters without impact on their performance. Ganglia can help to inspect the performance of the individual servers in your cluster and the performance of clusters as a whole. 
  • JupyterHub is a multi-user Jupyter notebook. Jupyter notebooks are one of the most popular tools among data scientists to perform data engineering and ML. The JupyterHub notebook server provides each user with their Jupyter notebook web-based IDE. Multiple users can use their Jupyter notebooks simultaneously to write and execute code for exploratory data analytics.
  • Amazon Athena is an interactive query service for running queries on Amazon S3 object storage using standard ANSI SQL syntaxes. Amazon Athena is built on top of Presto and extends ad hoc query capabilities as a managed service. The Amazon Athena metadata store works like the Hive metadata store so that you can use the same DDL statements from the Hive metadata store in Amazon Athena. Athena is a serverless and managed service, which means all infrastructure and software handling and maintenance is taken care of by AWS and you can directly start running your query in the Athena web-based editor.
  • Amazon Elastic MapReduce (EMR) is essentially Hadoop in the cloud. You can utilize the Hadoop framework with the power of the AWS cloud using EMR. EMR supports all the most popular open source frameworks including Apache Spark, Hive, Pig, Presto, Impala, Hbase, and so on. EMR provides decoupled compute and storage which means you don't have to always keep running a large Hadoop cluster; you can perform data transformation and load results into persistent Amazon S3 storage and shut down the server. EMR provides autoscaling and saves you from the administrative overhead of installing and updating servers with various software.
  • AWS Glue is a managed ETL service, which helps in data processing, data cataloging, and ML transformations to find duplicate records. AWS Glue Data Catalog is compatible with the Hive data catalog and provides a centralized metadata repository across a variety of data sources including relational databases, NoSQL, and files. AWS Glue is built on top of a warm Spark cluster and provides ETL as a managed service. AWS Glue generates code in PySpark and Scala for common use cases, so that you are not starting from scratch to write ETL code. Glue job-authoring functionality handles any error in the job and provides logs to understand underlying permission or data formatting issues. Glue provides workflow, which helps you to build an automated data pipeline with simple drag and drop functionality.

Data analysis and processing are huge topic, above are some popular ones used in industry.


要查看或添加评论,请登录

Siba Prasad Kar, AFRM的更多文章

社区洞察

其他会员也浏览了