What you need to know about Hadoop right now

Ten months ago, we published a cheat sheet for learning about Hadoop, the center of the big data vortex. Check out what's been added since then

Hadoop sign door
flickr/Robert Scoble

Last year I gave a crunch list of what you should know about Hadoop. It has been a couple months short of a year since then, but I thought I'd check in and see how you’re coming along -- and add a few more technologies to the list.

To start, I hope you didn’t forget your fundamentals. Yarn and HDFS are no less important now than last year. Plus, I hope you remembered the ecosystem stuff. In fact, HBase is even more vital and Cassandra is on fire in the marketplace, although many now consider it to be its own thing outside of Hadoop. (If you think your brain is running out of room, at least you can forget that HAWQ or Greenplum ever existed, since Pivotal soon will.)

Today, you should probably know about Phoenix -- similar to Splice Machine, which I covered last year, but totally open source. It's essentially an RDBMS built on top of HBase and supports a healthy SQL subset: JDBC and the works. It's also a heck of a lot faster than Hive. I don’t think of it as a replacement for Hive, which is still a good fit for a bunch of flat files that you don’t want to mangle into HBase and might analyze other ways. Anyway, best of all, Pheonix was founded by James Taylor, who is totally not tired of jokes based on his name.

If you didn’t take my advice to learn a little Spark and Storm, now's the time. (Note: You can forget about Shark and learn Spark SQL instead.) Spark is setting the world on fire (pun intended), and at the moment, when people say “real time” and “Hadoop” in the same sentence, “Storm” will probably be in there too. The two have some overlap, but there are places where one is a better fit than the other.

You should probably know about Kafka, too. If you have used JMS, AMQ, or any messaging tool, then you already know a little about Kafka. If you're using Storm, most of the time you'll also use it with Kafka to make sure the little streams of bits end up somewhere and are not merely dropped into dev/null.

You may also want to learn Falcon -- writing a whole stream processing thing when all you want to do is feed data from Hadoop A to Hadoop B is a waste and managing data evictions with Oozie can be laborious.

As fun as Ambari is for setting up clusters, it might not be how you want to set up, configure, and reconfigure a massive farm. Moreover, what if you have a big fat data center and don't want to decide that some set of servers will only ever be used for batch rather than stream processing? What if you simply want to pool your resources? Maybe Mesos is your daddy.

If someone makes you do security at the perimeter you might have to use Knox, but it's probably more important to start boning up on Ranger. In a way, Ranger is a side effect of the disjointed way in which the Hadoop ecosystem was created. The idea is that a user is a user, security is security, and I ought not have to create the concept separately in Hive, HBase, Storm, Knox, and so on. It plugs into all of them. Don’t get too excited -- it doesn’t plug into everything yet and the documentation isn't quite done, but you can find more on the Hortonworks site.

A few notes outside of Hadoop you should know about, too. For example, you should familiarize yourself with LDAP. I mean, no one likes Active Directory, but everyone is doing it and LDAP is one of the key ways to integrate with it. Unfortunately, the most complete security model in Hadoop is Kerberos. Yes, that old piece of, er, engineering is still the thing to configure most of the time. You should probably know how to set that up from point A to B to C.

I would also recommend learning a little about Docker and what it is. Luckily, if you know what Solaris Zones are and can imagine packaging, you probably have a good handle on what Docker does and is.

Most important, you need to learn a bit about machine learning. This is the stuff that can prevent a meat-cloud from munging Excel reports and help you guess, using predictive analytics, where the bodies -- or earthquakes -- are buried. There are several libraries from Mahout to MLib, but to set up problems for them to solve,  you should understand at least the basics of the techniques and algorithms.

I hope you boned up last year and are ready for these little additions to your knowledge base. I hope Kerberos didn’t bite you too hard or Phoenix didn’t burn you too much. Hadoop is an ever-growing ecosystem and it can be a challenge to keep up, but I believe in you!

Copyright © 2015 IDG Communications, Inc.