Apache Arrow

Apache Arrow is an in-memory columnar data format across various systems such as Apache Spark, Impala, Apache Drill.

Arrow have a columnar data represent format - Value Vectors. There are various types of value vectors depending on the data type. In this post, I serialize NullableIntVector to a file and deserialize from it.

Sample Code

Getting Started

The arrow-vector module is already in maven repos.

pom.xml:

1<dependencies>
2    <!-- https://mvnrepository.com/artifact/org.apache.arrow/arrow-vector -->
3    <dependency>
4        <groupId>org.apache.arrow</groupId>
5        <artifactId>arrow-vector</artifactId>
6        <version>0.4.0</version>
7    </dependency>
8</dependencies>

Write to file

The sample code that writing NullableIntValue to a file is follow:

 1public static void write(String path, BufferAllocator allocator) throws IOException {
 2
 3    try (FileOutputStream out = new FileOutputStream(path)) {
 4        NullableIntVector vector = new NullableIntVector("test", allocator);
 5        vector.allocateNew();
 6        NullableIntVector.Mutator mutator = vector.getMutator();
 7        mutator.set(0, 3);
 8        mutator.set(1, 2);
 9        mutator.set(2, 1);
10        mutator.set(3, 4);
11        mutator.setValueCount(4);
12
13        VectorSchemaRoot root = new VectorSchemaRoot(asList(vector.getField()), asList((FieldVector) vector), 4);
14        try (ArrowWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(out))) {
15            writer.writeBatch();
16        }
17    }
18}

Read from file

The sample code that reading NullableIntValue from a file is follow:

 1public static void read(String path, BufferAllocator allocator) throws IOException {
 2    byte[] byteArray = Files.readAllBytes(FileSystems.getDefault().getPath(path));
 3    SeekableReadChannel channel = new SeekableReadChannel(new ByteArrayReadableSeekableByteChannel(byteArray));
 4    try (ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
 5
 6        for (ArrowBlock block : reader.getRecordBlocks()) {
 7            reader.loadRecordBatch(block);
 8            FieldReader fieldReader = reader.getVectorSchemaRoot().getVector("test").getReader();
 9            System.out.println("buf[0]: " + fieldReader.readInteger());
10            fieldReader.setPosition(1);
11            System.out.println("buf[1]: " + fieldReader.readInteger());
12            fieldReader.setPosition(2);
13            System.out.println("buf[2]: " + fieldReader.readInteger());
14            fieldReader.setPosition(3);
15            System.out.println("buf[3]: " + fieldReader.readInteger());
16        }
17    }
18}

Caller

The sample code that calling these write/read methods are follow:

1public static void main(String[] args) throws IOException {
2    write("test", new RootAllocator(Long.MAX_VALUE));
3    read("test", new RootAllocator(Long.MAX_VALUE));
4}

And run this…

1buf[0]: 3
2buf[1]: 2
3buf[2]: 1
4buf[3]: 4

Conclusion

In this post, I tried to ser/des with Apache Arrow. I think that outputting data to a file with Apache Arrow is not an essential usage. Because I want to the executable Apache Arrow code, I did. Enjoy it!