One Billion Row Challenge in Java: Part 4 - in Less than 10 seconds

One Billion Row Challenge in Java: Part 4 - in Less than 10 seconds

Introduction

The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file. Grab all your (virtual) threads, reach out to SIMD, optimize your GC, or pull any other trick, and create the fastest implementation for solving this task

Data Format

The text file contains temperature values for a range of weather stations. Each row is one measurement in the format <string: station name>;<double: measurement>, with the measurement value having exactly one fractional digit. The following shows ten rows as an example:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0        

Task Description

The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this (i.e. sorted alphabetically by station name, and the result values per station in the format <min>/<mean>/<max>, rounded to one fractional digit):

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, Abéché=-10.0/29.4/69.0, Accra=-10.1/26.4/66.4, Addis Ababa=-23.7/16.0/67.0, Adelaide=-27.8/17.3/58.5, ...}        

Current Baseline

Hampus takes (28.381 seconds) to execute, it achieves it by doing the following:

  1. Leveraging memory mapped files for optimized data access.
  2. Implementing custom parsing and processing to swiftly extract and process data from byte buffers.
  3. Employing a sophisticated file chunking and parallel processing strategy, maximizing resource utilization and throughput.

Can we double the performance ?

Anthony Goubard takes (9.945 seconds) to execute, it achieves it by doing the following:

  • More custom parsing: Handle Split Line, Read First Lines, Read Next Line, Read Temperature.
  • Switched to InputStream and ConcurrentHashMap.
  • Semaphore to avoid OOMException.
  • Custom Data Types: The implementation introduces custom data types like ByteArray and Text to optimize memory usage and string comparisons.
  • Chunk size is 5MB

Assumptions

  • No temperatures are above 100 or below -100
  • The last character of the file is \n

Temperature Data Parsing

The method begins by determining the size of the input file and calculating the number of blocks required to read the entire file efficiently.

Thread Pool Setup: It initializes an ExecutorService with a fixed thread pool size to manage concurrent parsing tasks. This approach maximizes CPU utilization and enhances overall performance.

Blockwise Processing: The method iterates over blocks of data within the file, ensuring that each block is processed concurrently. This allows for efficient utilization of system resources, particularly when dealing with large files.

Concurrency Control: The method employs a semaphore (readFileLock) to control access to shared resources, such as file reading operations. This ensures that threads wait when resources are unavailable, preventing contention and potential deadlock situations.

Task Submission and Execution: Within each block iteration, the method submits tasks for reading data (readBlock) and parsing temperature data (parseTemperaturesBlock) to the thread pool. This enables parallel execution of file reading and parsing tasks, maximizing throughput.

Task Completion and Shutdown: Once all tasks have been submitted, the method waits for their completion using futures. Upon completion, it shuts down the thread pool to release system resources gracefully.

Exploring the readBlock Method

The method initializes parameters such as the file index and buffer size based on the provided block index and predefined constants.

File Reading: Using an InputStream, the method reads data from the specified file starting from the calculated file index. It efficiently handles scenarios where the file index exceeds the file size, ensuring graceful termination of file reading operations.

Buffer Allocation: Depending on the size of the data to be read, the method either retrieves a buffer from a pre-allocated pool or instantiates a new buffer. This approach optimizes memory usage and promotes resource reuse.

Data Transfer: The method reads data from the input stream into the allocated buffer, ensuring that the entire block of data is read efficiently. It handles cases where the data read exceeds the buffer size, guaranteeing completeness of the read operation.

Resource Management: Upon completing the file reading operation, the method releases the read file lock, ensuring proper synchronization and concurrency control.

Handling of Partial Lines

The method begins by initializing the buffer index based on the data read from the buffer's first lines.

Partial Line Handling: It iterates over the buffer in reverse order, starting from the end and moving backward until it encounters a newline character ('\n'). During this process, it stores the characters of the partial line in a temporary list (lastLine).

Previous Line Storage: If the previous block's last line is empty, indicating the beginning of a new block, the method stores the current partial line as the previous block's last line for future reference.

Split Line Handling: If the previous block's last line is not empty, indicating a continuation of a line from the previous block, the method further processes the split line by reading the remaining characters and appending them to the previous line.

Buffer Index Update: Finally, the method updates the buffer index to reflect the position where the next block of data should begin processing.

Pros

  • Efficient Iteration: It efficiently iterates over the buffer in reverse order to identify the end of the partial line, minimizing unnecessary iterations and improving performance.
  • Resource Management: By storing the partial line in a temporary list, the method ensures efficient memory utilization and promotes resource reuse.
  • Seamless Line Continuation: It seamlessly handles line continuation from the previous block by appending the split line to the previous block's last line, ensuring continuity in data processing.

Dive into the readFirstLines Method

The readFirstLines method is responsible for parsing and extracting relevant information from the initial lines of a buffer, typically containing file headers and metadata. Here's a breakdown of its key components:


Header Detection: The method first checks if the precision value is already set. If it is, it immediately returns, assuming that the current lines are not part of the header.

Comment Handling: It iterates over the buffer, skipping lines that start with a '#' character, indicating comments. This ensures that comments and metadata are ignored during header parsing.

Header Parsing: Once comments are skipped, the method extracts the relevant information from the header. It identifies the position of the decimal point ('.') within the header, which signifies the precision of the temperature data.

Precision Calculation: Based on the position of the decimal point, the method calculates the precision of the temperature data. It computes the precision limit, which is crucial for validating temperature values later in the data processing pipeline.

Index Tracking: Finally, the method returns the index where the actual data begins, enabling subsequent processing of the data with the extracted precision information.

Dive into the readSplitLine Method

The method iterates over the buffer, reading characters until it encounters a newline ('\n') character, signifying the end of the split line.

Partial Line Accumulation: During iteration, it accumulates characters from the split line into a list (previousBlockLastLine), ensuring completeness in line processing.

Line Reconstruction: Once the split line is fully processed, the method reconstructs it by converting the accumulated characters back into a byte array (splitLineBytes).

Continued Line Processing: It delegates the processing of the reconstructed split line to the readNextLine method, ensuring seamless continuation of data processing.

Index Management: Finally, the method updates the buffer index to reflect the position where the next block of data should begin processing.

Dive into the readNextLine Method

The readNextLine method is responsible for parsing individual lines of data within a buffer, extracting relevant information such as city names and temperature readings.



City Name Extraction: The method iterates over the buffer until it encounters a delimiter (';'), signifying the end of the city name. It extracts the city name substring from the buffer and converts it into a Text object using a custom method (Text.getByteText).

Temperature Reading: After parsing the city name, the method extracts the temperature reading from the buffer. It delegates temperature reading to another method (readTemperature), ensuring modular and reusable code.

Precision Adjustment: The method adjusts the buffer index to skip additional characters after reading the temperature, considering the precision of the temperature data. It also performs additional checks to ensure the validity of the temperature reading based on precision limits.

Temperature Addition: Finally, the method adds the parsed city name and temperature reading to the provided blockCityMeasurementMap, facilitating aggregation and analysis of temperature data.

Pros

  • Efficient Delimiter Search: It efficiently searches for the delimiter (';') within the buffer, minimizing unnecessary iterations and improving parsing speed.

Dive into the readTemperature Method

The readTemperature method is responsible for extracting temperature values from a byte array, typically representing lines of temperature data within a file.

Sign Detection: The method checks if the first character in the buffer represents a negative sign ('-'). If it does, the method sets a flag indicating that the temperature is negative and increments the buffer index to skip the sign character.

Digit Parsing: After handling the sign, the method iterates over the buffer, parsing digits and computing the temperature value. It accumulates digits to construct the temperature value, considering the position of the decimal point ('.') if present.

Decimal Point Handling: When encountering a decimal point, the method skips the character to maintain precision in the temperature value. This ensures accurate parsing of temperature values with fractional components.

Negative Value Adjustment: If the temperature value is negative, the method negates it before returning the final temperature value, ensuring correctness in the extracted temperature data.

Custom Byte Array

The ByteArray class serves a crucial role in optimizing memory usage during file processing. At its core, it encapsulates a byte array, providing a flexible and efficient mechanism for handling data buffers. Key features and functionalities of ByteArray include:

Memory Allocation: Instances of ByteArray are constructed with a pre-allocated byte array, ensuring efficient memory utilization and avoiding unnecessary overhead associated with dynamic resizing.

Resource Reuse: The class includes a mechanism for pooling and reusing ByteArray instances. This approach minimizes the creation of new objects, thereby reducing memory churn and enhancing performance.

Concurrent Access: As evidenced in the implementation, ByteArray instances are safely accessed and manipulated across multiple threads, thanks to Java's concurrency utilities.

Encapsulation: By encapsulating byte array operations within a dedicated class, ByteArray promotes code modularity and encapsulation, enhancing the maintainability and readability of the overall implementation.

Custom Text

In contrast to traditional string representations, the Text class offers a customized approach tailored to the specific needs of the temperature data processing application. Key characteristics of the Text class include:

Immutable Representation: Instances of Text encapsulate immutable byte arrays representing textual data. This immutability ensures data integrity and simplifies concurrency management within the application.

Custom Hashing: Text implements a custom hashing mechanism optimized for byte array comparison. This approach improves performance compared to standard string hashing algorithms, particularly in scenarios involving large datasets.

Memory Efficiency: By representing textual data as byte arrays, Text minimizes memory overhead associated with character encoding and string manipulation, contributing to overall memory efficiency.

Caching and Reuse: The implementation incorporates a caching mechanism to reuse Text instances, reducing object instantiation overhead and promoting memory reuse across the application.

Interoperability: Despite its custom representation, Text seamlessly integrates with existing Java APIs, ensuring compatibility and interoperability with standard library components.

In essence, the Text class offers a specialized representation of textual data optimized for performance and memory efficiency, making it an integral component of the temperature data processing workflow.

Conclusion

In achieving an execution time of 9.945 seconds, Anthony Goubard's implementation employs a range of optimizations, including custom parsing techniques like handling split lines and reading temperatures, utilizing InputStream and ConcurrentHashMap for enhanced concurrency, and implementing Semaphore to prevent memory exhaustion. Introducing custom data types such as ByteArray and Text optimizes memory usage and string comparisons, while operating with a 5MB chunk size efficiently processes large datasets. With the assumptions that temperatures range between -100 and 100 and the file ends with a newline character, Goubard's implementation showcases an efficient and meticulous approach to file processing, achieving remarkable performance while meeting specific requirements.

Slower Version



要查看或添加评论,请登录

社区洞察

其他会员也浏览了