From Impala to Hive with Love

October 20th, 2015

To create data products for customers, our data scientists must push their analytic models into production for automatic execution, and share the results within the organization. In one of our projects, this meant converting Impala code to Hive. The rationale behind this was that Hive on YARN was more stable than Impala, and this move also reduced demand on in-memory resources used largely for ad-hoc data science exploration.

While on paper it should be a seamless transition to run Impala code in Hive, in reality it’s more like playing a relentless game of whack-a-mole.

For every fix you make, another problem pops up. Even when you’ve fixed the syntax errors and the SQL executes, the results may have changed. Tracing the root cause of these is like finding a needle in multiple haystacks.

This post will discuss three major categories of differences between Hive and Impala to speed up the conversion process, and provide hints on what to look for.

NOTE: Hive 0.13 and Impala 2.1 versions were used for testing, however several of these still apply for newer versions as of publication date.

Common Functions, Different Results

This is the most “dangerous” of the difference categories, and the one that took most time to debug. It was immediately apparent in the final result, but figuring out where the results started deviating was a time-consuming process. Having a validated dataset to compare against for intermediate steps was extremely helpful in speeding up this process.

Regular Expressions

Regexp_replace supports POSIX expressions in Impala and Java Regex syntax in Hive. The following line to strip the “-“ character was understood in Impala, but not in Hive:

regexp_replace(col, ‘[^[:digit:]]’,’')

This is because Java does not support POSIX bracket expressions, at least not using the same syntax. In Hive, the above function returns a NULL, in our case hidden in one of the intermediate tables, skewing final results significantly. For example:

regexp_replace(col,"-",”")

[NOTE: this works in both Hive and Impala]

In some cases, the syntax is not supported at all. For example, when doing substitutions in this example to reintroduce the dash in dates:

In Hive:

SELECT regexp_replace('20150901', '([0-9]{4})([0-9]{2})([0-9]{2})', '$1-$2-$3');

In Impala:

SELECT regexp_replace('20150901', '([0-9]{4})([0-9]{2})([0-9]{2})', '\\1-\\2-\\3');

Recommended: Try to use an expression supported by both when possible. Follow the appropriate regex syntax in each language. Make sure test tables are in place to compare Hive and Impala results to identify where they diverge.

Rounding Differences

No matter how close mathematical functions were, any results with a DOUBLE data type needed to be rounded in order to match against Hive counterparts. We used the ROUND(value, precision) function for this purpose. Precision varies on the data, however we typically defaulted to 5 decimal spaces.

A particular scenario that was problematic for us was when comparing values to a threshold value (i.e. 0). For example, let’s say we were subtracting two dollar amounts, expecting to get a 0 result. Due to a minor difference in the precision, we would instead get a very small fraction. When we tested this value against 0, it did not match, even though in substance the values are the same.

0.00000001 ≠ 0

The effect of which was that rows which had very small fractions were being excluded from the result. The round function solved this issue.

Recommended: Use round when comparing DOUBLE values using impala.

Support for international characters (i.e. UTF)

Another problem we encountered was around Impala support for international characters (i.e. UTF). Quoting from Cloudera’s documentation:

For full support in all Impala subsystems, restrict string values to the ASCII character set. UTF-8 character data can be stored in Impala and retrieved through queries, but UTF-8 strings containing non-ASCII characters are not guaranteed to work properly with string manipulation functions, comparison operators, or the ORDER BY clause.

If you need to sort, manipulate, or display data depending on those national language characteristics of string data, use logic on the application side.

We were able to work with UTF values by switching over to Hive.

Recommended: Use Hive when working with international characters.

unix_timestamp

For future dates, unix_timestamp reported different results between Hive and Impala for the same date.

In Hive:

SELECT unix_timestamp('2040-01-01 00:00:00');
2209017600

In Impala:

SELECT unix_timestamp('2040-01-01 00:00:00');
-2085978496

This issue has since been fixed in Impala 2.2.

Performance Issues

Performance issues in Hive can lead to restructuring of SQL queries. This can vary widely, but in our case it was as simple as being conscious of the ordering of columns based on cardinality for GROUP BY queries or the ordering of tables in JOINs, avoiding putting small tables at the end. There are good references already for these out there.

stddev

The default standard deviation function in Hive is interpreted as STDDEV_POP() while in Impala is interpreted as STDDEV_SAMP(). STDDEV_SAMP() scales the result by 1/(N-1) and STDDEV_POP() by 1/N (3). This led to minor differences in the stddev results, but enough that they were not insignificant.

Recommended: Pick one approach (STDDEV_POP() or STDDEV_SAMP()) and use consistently across both implementations.

Different Functions

Impala has additional functions that do not exist in Hive. Here are some examples:

Hive does not have a now() to produce the current timestamp. Instead, the workaround to achieve this is to use from_unixtime(unix_timestamp()).
Hive does not have adddate or subdate. Instead, Hive has date_add and date_sub.
Similarly, Hive does not have an extract() function. Instead, you have to use year(), month(), day(), etc. functions. Trunc is not supported either.

A list of all Impala supported functions can be found here.

Different Syntax

There’s some Impala syntax which is not supported in Hive. Here are some examples:

  • No /* */ in comments; use -- instead
  • Hive does not have INVALIDATE METADATA; these lines would need to be removed in a Hive version of the SQL.
  • No INSERT INTO; instead INSERT INTO TABLE
  • No col = (SELECT query); instead use table.col IN (SELECT query)
  • Prior to Hive 1.2, no UNION; instead you have to use UNION ALL, which does not remove duplicates. Compensate adding a SELECT DISTINCT to the UNION ALL result.

Recommended: Pick a syntax that is supported by both Hive and Impala when possible.

Conclusion

Additional resources can be found in the Hive manual and Cloudera’s page on Hive/Impala differences. Have any comments or questions? Leave ‘em below.

Thanks to Kevin Zielnicki and Chloe Mawer, who contributed additional examples to this list.