Supercharging Analytics with Apache Arrow
Image courtesy: Apache Arrow Project; Minority Report (film, 2002, Steven Spielberg); 20-caliber cermet shells used by Zazie in Last Order, volume 4, Angel of Protest, page 136 – featuring APFSDS [Ref 9]

Supercharging Analytics with Apache Arrow

An inevitable escape into ‘Polyglot Persistence’ – almost became a norm, though it meant data duplication and slowness – traded off against the rich features spread across multiple technologies – which don’t talk to each other. The absence of a live (in-memory) common standard for columnar in-memory representation of data while taking advantage of multi-core CPUs, Cache, RAM, GPU, Spark etc - sounded helpless – appeared more like ‘penny-wise-pound-foolish’ - inevitably. The slowest intermediate decided the final speed (Amdahl’s law). The ‘tortoise in the middle’ in most cases turned out to be ‘marshalling, serialization - deserialization, re-formatting, transformation and data preparation’. This is exactly where Apache Arrow shines. From its inception in 2016, it has come a long way – determined to handle one job very well – i.e. to have a common columnar in-memory standard across applications and technologies – for enhanced interoperability. Like GS1 EDI / GDSN for merchandising [Ref 1], Apache Arrow too has a clear, elegant and important goal to create a common standard supported across various software and hardware platforms. In short, Apache Arrow is an indispensable in-memory data structure specification for use by engineers building data systems today.

Apache Arrow has several key benefits :

  • A columnar memory-layout enabling O(1) random access. The layout is highly cache-efficient to handle analytics workloads and facilitates SIMD optimizations [ Ref 2 ] with recent processors. That way, programmers can write superfast algorithms using Arrow data structures.
  • Highly efficient and super-fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.
  • A flexible structured data model supporting complex types that handles flat tables as well as real-world nested and rich-structured JSON-like data.

Wait a moment to check from where the name ‘Arrow’ might have come. Why ‘Arrow’ when they could pick something more modern like bullet, missile or rocket? Referring to Army Ballistics Research [Ref3], it turns out, arrow-type projectiles have more penetrating power than normal bullets alone. Armour piercing Anti-tank rounds with APFSDS (armor-piercing fin-stabilized discarding sabot) have a hybrid design which mixes bullet, arrow and sabot – together to achieve higher muzzle velocity, fin stabilization, less drag resistance and penetration. From the official website of Apache arrow, the reason behind choosing the name 'Arrow' sounds well justified.

Talking of Sabot - in fact, Dremio, a Data-as-a-Service Platform, have an execution environment named 'Sabot' - for working with data that's built entirely on top of Arrow - which they think of as a fast way of operationalizing Apache Arrow [Ref 4].

Figure 1 : (a) APFSDS Anti-tank round with its "saddle-back" sabot [Ref 3]; (b) Apache Arrow [Ref 5]

While it is clear, that Arrow is here to facilitate data-penetration at high speed through diverse applications at once at in-memory level, we should be made equally aware about the pre-requisites (to check our readiness ourselves) – to be able to take advantage of the technology. So, Let us look into its relevance in your project and multiple scenarios - to understand where Arrow might fit in.

Are you ready for Arrow ?

Arrow is born in big data era. It is assumed that you are taking advantage of columnar databases already. To be more specific, let us assume that you are already using in-memory technologies like pandas dataframe with Spark and also using a columnar format like Parquet for serializing data.

Situations where Arrow might be relevant -

1.    Need random and bulk access to Nested Structure Data. You have lot of Manage/Govern scenarios for Analytics, Data Visualisation and Batch Processing.

2.    Using multiple technologies having their own formats and stores. There is at least one transformation routine to convert an unstructured format (like JSON) to something else, and the transformation needs manual coding

3.    You have complex data and dynamic schemas. Solving business problems are much easier when represented through hierarchical and nested data structures. This was the primary reason for the adoption of JSON and document based databases

4.    You have real time data visualization needs for anomaly detection and prepared to use Graph and GPU.

5.    Need to distribute processing loads between Server and Client to reduce network effect

Analytical queries are different. There most queries read a small subset of the columns for large numbers of rows at a time.
A columnar database organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads.

What’s wrong with rows, and how columns fixed them?

Latency comes from two aspects in traditional systems: (1) number of disk seeks and (2) amount of data scanned per seek. For row-oriented database, the columns for a given row are written out to disk contiguously - which is very efficient for writes but not for reads. The system might perform a seek for each row, and most of the columns would be read from disk into memory – even though it might be unnecessary most of the times. Analytical queries are different. There most queries read a small subset of the columns for large numbers of rows at a time. A columnar database organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads. Furthermore, compression algorithms tend to be much more effective on a single data type rather than the mix of types present in a typical row.

Figure 2 : Arrow memory buffer differ from traditional memory buffer, thus significantly reducing the number of seeks for multi-row reads [Ref 5]. Arrow leverages the data parallelism (SIMD, Ref 2) in modern Intel CPUs and optimizes CPU prefetching and catching.

writes are slower for columnar, but this is a good optimization for analytics where reads typically far outnumber writes

But what about writes? The trade-off is that writes are slower for columnar, but this is a good optimization for analytics where reads typically far outnumber writes. To summarize, columnar data structures provide a number of performance advantages over traditional row-oriented data structures. These include benefits for data on disk – fewer disk seeks, more effective compression, faster scan rates - as well as more efficient use of CPU for data in memory.

Today columnar data is very common and is implemented by most databases, including Teradata, Vertica, Oracle, and others. Twitter and Cloudera created Parquet to bring these ideas to the Hadoop ecosystem. While Parquet became the standard for columnar data on disk - Apache Arrow emerged to become the standard way of representing columnar data in memory – taking it to the next level. Keep reading further below to know, why columnar alone cannot solve all the problems. There are more barriers to overcome. Below, let us quickly check the timeliness of Arrow’s arrival from another facilitator – the RAM.

there is enormous interest in the data world in how to make optimal use of RAM for analytics

The main driver behind Arrow and its arrival-timing :

Hardware has changed a lot in a decade since original Google papers (in 2003) that inspired Hadoop. The most important change is the dramatic decline in RAM prices. Original Hadoop was based on spinning platter-disk I/O based seek. That place has been taken by RAM. Servers today have a lot more RAM than a decade ago, and because reading data from memory is thousands of times faster than reading data from disk, there is enormous interest in the data world in how to make optimal use of RAM for analytics (or wherever we need super-fast data access). Cache is the fastest. But they are consumed in CPU/GPU context and there is an upper limit for cache size. So, cache will never replace RAM. Let us carefully note that, storage access speed is different from CPU processing speed. For the same reason, comparing cache memory with RAM is almost irrelevant. Based on your actual processing need, you can even go for GPUs or beyond – if a higher end CPU is not enough. For example, if you are in Gaming or real-time graph visualization – you would prefer GPUs (has lot of processing cores packed in). Or, if you are in Bitcoin mining, even high end GPUs won’t be enough. 16nm ASICs technology is the standard today. So, irrespective of what processing technology you use – you are bound to use lot of RAM wherever you can – and you should prefer technologies which uses RAM based access (rather than Hard disk based). Carefully note that, you will still use some non volatile memory (e.g. SSD or HDD) to initially load data into RAM and minimize disk-read thereafter.

Figure 3 : Comparing latency figures [ Ref 6 ]. Note various latency values of RAM (main memory) compared to SSD and platter Disks (non-volatile memory).

the trade-offs for being columnar data in-memory are quite different from that (columnar) being in-disk

Now that you acquired lot of RAM to maximize your CPU utilization and accepted columnar databases in disk – let us look into columnar in-memory. You will be surprised to find that, the trade-offs for being columnar data in-memory are quite different from that (columnar) being in-disk. For data on disk, usually I/O dominates latency, which can be addressed with aggressive compression, at the cost of CPU. In memory, access is much faster and we want to optimize for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions [Ref 2].

One of the funny things about computer science is that while there is a common set of resources – RAM, CPU, storage, network – each language has an entirely different way of interacting with those resources. When different programs need to interact – within and across languages – there are inefficiencies in the handoffs that can dominate the overall cost. This is a little bit like traveling in Europe before the Euro where you needed a different currency for each country, and by the end of the trip you could be sure you had lost a lot of money with all the exchanges! And this is exactly where Arrow comes handy.

The official Apache Arrow website describes two benefits very clearly [Ref 6 ] 

  • Performance Advantage of Columnar In-Memory
  • Advantages of a Common Data Layer

Figure 4 : With and Without Arrow from official Apache Arrow website [ Ref 5 ]

Apache Arrow viewed these handoffs (Figure 4, diagram on the left : ‘Before’) as the next obvious bottleneck for in-memory processing, and set out to work across a wide range of projects to develop a common set of interfaces that would remove unnecessary serialization and deserialization when marshalling data. Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. Today it includes first class bindings in over 13 projects, including Spark, databricks, Hadoop, R, Python/Pandas.

The bottleneck with any typical system comes when the data is moved across machines. Serialization is an overhead in many cases. Arrow improves the performance for the data movement within a cluster without any serialization or deserialization. Another important aspect of Arrow is when two systems use arrow as their in-memory storage. For example, Kudu could send Arrow data to Impala for analytics purposes since both of them are Arrow-enabled without involving any costly deserialization on the receipt. Inter Process Communication is mostly happening through shared memory, TCP/IP and RDMA with Arrow. It also supports a wide variety of data types which includes both the SQL and JSON types, such as Int, BigInt, Decimal, VarChar, Map, Struct and Array.

Nobody wants to wait longer to get their answers from the data...
When the data is in columnar structure, it is much easier to use SIMD [Ref 2] instructions over it

Nobody wants to wait longer to get their answers from the data. The faster they gets the answer the faster they can ask other questions or solve their business problems. CPUs these days become faster and more sophisticated in design, the key challenge in any system is making sure the CPU Utilization is at ~100% and is using it efficiently. When the data is in columnar structure, it is much easier to use SIMD instructions [Ref 2] over it.

SIMD is short for Single Instruction/Multiple Data, while the term SIMD operations refers to a computing method that enables processing of multiple data with a single instruction. In contrast, the conventional sequential approach using one instruction to process each individual data is called scalar operations. In some cases, when using AVX [Ref 7] instructions, these optimizations can increase performance by two orders of magnitude.

Figure 5 : SIMD versus scalar operations [Ref 7]

Arrow is designed to maximize the cache locality, pipelining and SIMD instructions. Cache locality, pipelining and super-word operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU bound, these benefits translate into dramatic end-user performance gains. These gains result in faster answers and higher levels of user concurrency.

Python in particular has very strong support in the Pandas library, and supports working directly with Arrow record batches and persisting them to Parquet. It is not uncommon for users to see 10x-100x improvements in performance across a range of workloads by using both Parquet and Arrows. While each project can be used independently, both provide APIs to read and write between the formats.

Since both are columnar it can allow implementing efficient vectorized converters from one to the other and read from Parquet to Arrow much faster than in a row-oriented representation – included with a vectorized reader, and full type equivalence.

Pandas is a good example of using both projects. Users can save a Pandas data frame to Parquet and read a Parquet file to in-memory Arrow. Pandas can directly work on top of Arrow columns, paving the way for a faster Spark integration.

Arrow’s design is optimized for analytical performance on JSON-like rich nested structured data

Arrow improves the performance for data movement within a cluster in these ways:

  1. Two processes utilizing Arrow as their in-memory data representation can “relocate” the data from one process to the other without serialization or deserialization. For example, Spark could send Arrow data to a Python process for evaluating an UDF (user-defined function).
  2. Arrow data can be received from Arrow-enabled database-like systems without costly deserialization on receipt. For example, Kudu could send Arrow data to Impala for analytics purposes.
  3. Arrow’s design is optimized for analytical performance on JSON-like rich nested structured data, such as that found in Impala or Spark DataFrames. Let’s look at a simple example to see what the Arrow array will look like [Figure 6]


Figure 6 : An example of nested structured data representation. See how persons.name.addresses values are accessible with offsets. In Arrow, the array data enables us to interpret these values as variable-length offsets while the data is stored in contiguous blocks [Ref 8]

The example in Figure 6 above, depicts two immediate benefits:

  • You can locate any value of interest in constant O(1) time by “walking” the variable-length dimension offsets.
  • The address data itself is stored in-memory contiguously. So scanning the values for analytics purposes is highly cache-efficient and amenable to SIMD operations - subject to suitable memory-alignment [Ref 2 ].


?Two top use cases with Apache Arrow

1.    Combined use of Arrow and Parquet : Apache Arrow for In-Memory Columnar Data and Apache Parquet for On Disk Columnar Data with fast and flexible interoperability is supercharging various mange and govern use cases in Big Data processing world. Various Data Visualization, Automation and optimization using AI/ML, PXM (Product Experience Management) companies and research institutes [ Powering Particle Physics at CERN : Ref 10 ] are adopting this combination at high speed.

Figure 7 : Using Arrow JDBC Adapter in Python / Pandas environment it is now possible to convert JDBC objects directly to Arrow objects without the serialization overhead [Ref 11]

Implementing great experiences over typical web architectures built on JSON and a plethora of JavaScript libraries quickly hits two key bottlenecks: 1) networking clogged by large file sizes, and 2) CPU and memory-intensive data serialization.

There are several columnar formats for data on disk (apart from Apache Parquet). However, before Apache Arrow there was no such standard for in-memory processing. Even with libraries like Google’s FlatBuffers sitting on top of native ArrayBuffers, each process uses a private model, and data must be serialized, de-serialized, and likely transformed between these models. This processing can even dominate the overall workload, consuming 60%-70% of CPU time. In addition, each process creates a copy of the data, increasing use of scare memory resources. This is especially important for GPUs where there is a fraction of the RAM compared to CPUs.

Apache Arrow was designed to eliminate the overhead of serialization by providing a standard way of representing columnar data for in-memory processing. When using Arrow, multiple processes can access the same buffer, with zero copy and zero serialization/deserialization of the data.


 2.    Combined use of Arrow and GPU : GPUs began as specialized hardware for accelerating games. With projects like Google’s TensorFlow and the increasing number of vendors in the GPU Open Analytics Initiative (GOAI), the same GPU hardware has been taking over general analytical applications. GPUs have thousands of times more cores than CPUs, and many analytical workloads can be parallelized to take advantage of these resources. As a result, machine learning libraries, GPU-based databases, and visualization tools have emerged that dramatically improve the speed of these workloads.

Figure 8 : Reference architectures for JS GOAI bridges to ML, GPU, and Big Data frameworks [Ref 12]

Remote rendering... client-side GPUs... even mobile phones have great GPUs

This allows ‘Remote rendering’ where the server sends geometry commands to the client, and the client turns those into viewable pixels by leveraging the client’s standard web browser and its local access to a client-side GPU. This is now practical because even mobile phones have great GPUs. Remote rendering done well is akin to running a video game on the client, and the server telling the client what the pieces are and when to move them. Clever applications will transfer surprisingly little. Most interactions like pan and zoom will run entirely on the client at full framerate with few, if any, network calls on the critical path.

Minority Report-style data visualizations ... in every web browser

Imagine a future where Minority Report-style data visualizations (top banner) run in every web browser. This is a big step forward for critical workflows like investigating security and fraud incidents, and making critical insights for the next level of BI. Today’s options are dominated by rigid Windows desktop tools and slow web apps with clunky dashboards. The Apache Arrow ecosystem, including the first open source layers for improving JavaScript performance, is changing that.

Frustrated with legacy big data vendors for whom “interactive data visualization” does not mean “sub-second”, companies like Dremio, Graphistry, and other leaders in the data world have been gutting the cruft from today’s web stacks. By replacing them with straight-shot technologies like Arrow, they are being able to successfully serving up beyond-native experiences to regular web clients.


About the author :

Ripan Sarkar is an ‘MBA, Business Strategy’, a Senior Data Scientist affiliated to Data Science Council of America, coder, growth hacker, patent holder, startup accelerator, six-sigma master black belt, certified scrum master, blogger and book-writer with a series titled ‘Bullet Proof Machine Learning’ and currently pursuing his PhD in AI/ML


References :

Ref 1 : https://www.gs1.org/standards/edi

Ref 2 : https://en.wikipedia.org/wiki/SIMD

Ref3 : Sabot Design for a 105mm APFSDS Kinetic Energy Projectile, June 1978, Drysdale, Kirkendall, Kokinakis, US Army Ballistics Research Laboratory [ Source : https://www.dtic.mil/dtic/tr/fulltext/u2/a056428.pdf ]

Ref 4 : https://www.dremio.com/

Ref 5 : https://arrow.apache.org/

Ref 6 : PyData Paris 2015 - Closing keynote Francesc Alted [Source : https://www.slideshare.net/PoleSystematicParisRegion/closing-keynote-francesc-alted ]

Ref 7 : Intro to Intel AVX – [ Source : https://computing.llnl.gov/tutorials/linux_clusters/Intro_to_Intel_AVX.pdf ]

Ref 8 : Presentation by Siddharth Teotia (Dremio) on ‘Vectorized Query Processing using Apache Arrow’ by Strata Data Conference - San Jose 2018 by O'Reilly Media, Inc.   [Source : https://www.safaribooksonline.com/library/view/strata-data-conference/9781492025955/video319106.html ]

Ref 9 : 20-caliber cermet shells used by Zazie in Last Order, volume 4, Angel of Protest, page 136 – featuring APFSDS (Armor-piercing fin-stabilized discarding sabot) [Source : https://wiki.rippersanime.info/tiki-index.php?page=APFSDS]

Ref 10 : https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

Ref 11 : Xoriant, a Silicon Valley based company, recently contributed to Arrow. Xoriant team developed a feature called JDBC Adapter. This new feature is now available as a part of the latest 0.10.0 release of Apache Arrow. https://www.xoriant.com/blog/big-data-analytics/xoriant-open-source-contribution-apache-arrow-jdbc-adapter.html

Ref 12 : https://www.graphistry.com/blog/js-gpus-ml-arrow-goai

要查看或添加评论,请登录

Dr. Ripan Sarkar的更多文章

  • Genome Master for Mutating RNA Virus

    Genome Master for Mutating RNA Virus

    Popular novel Jurassic Park (Michael Crichton, 1990) tells the story of a billionaire (Industrialist John Hammond) who…

  • The sacrificial fail-fast approach and Reverse Conway Maneuver in big data era

    The sacrificial fail-fast approach and Reverse Conway Maneuver in big data era

    If you step back - into 'pre-big-data-era' - tracing the root for some of our well established success-strategies…

    1 条评论
  • CAP Theorem Revisited

    CAP Theorem Revisited

    The Brewer’s CAP Theorem [Ref 1] asserts that any networked distributed system can have only two of strong consistency,…

    1 条评论
  • AI Based Governance For Software Development

    AI Based Governance For Software Development

    Until very recently, in traditional software development, ‘testing’ continued to be seen as a separate process –…

    3 条评论

社区洞察

其他会员也浏览了