Back

 Industry News Details

 
Spark 1.6 feeds big data's hunger for memory Posted on : Nov 24 - 2015

Those curious about what's coming in Apache Spark 1.6 can now get hands-on experience with the latest changes to the big data processing framework.

Databricks, a major commercial contributor to Spark and provider of a cloud-hosted platform for running Spark applications, is offering Spark 1.6 in preview on its services. Those who want to try out Spark on their own can obtain Spark 1.6 pre-release code directly from Apache.

This new version packs major changes to the way Spark handles memory. Earlier editions of Spark used to subdivide available memory into two partitions, one for data and one for execution, and required users to figure out how much memory to split between the two. This might have been one of the reasons for InfoWorld contributor Ian Pointer's complaints about Spark's memory management.

In version 1.6, execution memory and storage memory can borrow from each other as needed. "For many applications," says Databricks in its blog post, "this will mean a significant increase in available memory that can be used for operators, such as joins and aggregations."

There are still some restrictions in the current implementation; while borrowed execution memory can be released if needed, borrowed storage memory is never released. For backward compatibility, 1.6 also includes a legacy, fixed-partition memory management mode.

Version 1.6 also continues 1.5's work on Project Tungsten, a major initiative to rewrite Spark's low-level handling of memory and CPU. As Spark picks up speed, both in terms of its performance and its popularity with developers, more of its performance bottlenecks are turning out to be limits of the JVM itself, such as its garbage collector.

Keeping it convenient

Another key addition to Spark 1.6 is the Dataset API for working with collections of typed objects. Previously, Spark users had a choice of two APIs for interacting with data: RDDs, essentially collections of native Java objects (useful, but slow); and DataFrames, collections of structured binary data (less safe, but faster).

 

Datasets meld the advantages of both: the static typing and user functions of RDDs, and the compile-time type checking and better performance of DataFrames. Datasets and DataFrames are also meant to interoperate freely; libraries that accept one kind of data can accept the other as well. View More