The Seven 'Simple' Steps To Big Data

The use of big data analytics in cars could soon lead us to the point where accidents are completely... [+] eradicated, but this could lead to a shortage of organ donors in our hospitals. Image credit: Google.

There is a general feeling that big data is a tough job, a big ask… it’s not simply a turn on and use technology as much as the cloud data platform suppliers would love us to think that it is. Typically we find that big data analytics technologies are weighed down by as many regulatory and compliance related convolutions as they are software tooling complexities. So where to start?

Embedded big data analytics company Pentaho (now a Hitachi Data Systems company) has a new software version just out and a selection of analyst reports to reference, but let’s ignore those factors for now. Instead let’s look for seven key defining elements to help explain what big data analytics is, what it is comprised of, how it should be initiated and how it can be used.

The following list comes out of time spent talking with Pentaho executives and customers and most crucially of all, the big data software application developers who build these things.

1. Big data needs a business rationale

“A defined Line of Business LoB function (and therefore a business use case) should be an essential motivation to drive any big data analytics project,” argues Pentaho CEO Quentin Gallivan. “Big data analytics should have a Return on Investment (ROI)-driven initiative behind it; simply trying to use a big data platform as a ‘pure cost play’ to store an overflow of information is not productive.”

Gallivan provided the example of a bank which wanted to move from next day reporting on its financial systems to same day reporting – hence, a business reason existed for bringing big data analytics to bear.

2. Understand data terminology

There’s a lot of terminology in big data, knowing the difference between some of the basics is a good idea – so (taking ‘what is a database’ as read) as previously explained on Forbes…

“At one end, traditional data warehouses host prepared, structured data; at the other, data lakes provide a repository for raw, native data. Data refineries, which transform raw data and provide the ability to incorporate data sources that are too varied or fast-moving to stage in the data lake, sit between these on the spectrum.”

The data lake is now a ‘thing’ and is part of the big data conversation; the term was coined by Pentaho co-founder James Dixon.

3. Care about data lineage

People care about organic produce these days and data has a kind of provenance factor too. Pentaho partner Cloudera provides a commercialized version of Apache Hadoop with the type of more robust security tooling and certification controls you would expect in a ‘commercial open source’ offering.

Cloudera’s chief strategy officer Mike Olson says that data lineage is a key factor in understanding not just WHEN data happened, but WHAT happened to it.

When you are trying to incorporate big data streams into your information stack within defined governance guidelines, you need to know what the data is – but, crucially, you also need to know which commands were run on it and what other system resources touched it. Data has a life and you need to know something about its birth certificate and diet if you want to look after it.

4. The ‘when and where’ factor in big data analytics

Take driverless cars with all their sensors and 360 degree spatial intelligence. Processing all that information back in a cloud datacenter is not a good idea i.e. the controls to avoid the upcoming crash might not get alerted in time to adjust the car. We will start to use more in-memory processing opportunities to process this kind of data ‘in situ’, or it won’t be worth doing.

People say that driverless cars will eventually rid the planet of car accidents. Cars will eventually communicate adverse conditions ahead to a central information bank which will impact the behaviour of the cars three miles back down the road.

The upshot here is that hospitals may now find that they have a lack of donor organs as the ‘car death supply chain’ is a key pipeline for them. The wider implications of big data improvements go further than you think.

5. Correlation does not imply causation

The number one reason for doing data analytics is to improve customer relationships. Forrester analyst Mike Gaultieri presents every year at PentahoWorld and this year his story was George Clooney and the Cheesecake Factory. If George Clooney walked into the Cheesecake Factory store, he would get special treatment based upon who he is and his registered preferences and likes, which are probably quite openly documented. When we walk into the Cheesecake Factory we don’t get special treatment unless big data analytics kicks in and the firm has used intelligence to tag who we are and what we like.

But warns Gaultieri, when we start matching up big data sets, let's remember that correlation does not always imply causation. His example noted that divorce rate in Maine is directly linked to the per capita consumption of margarine in the USA -- so two seemingly congruent data sets might follow each other for no logical reason at all.

6. Balance ‘new innovation’ with hardened enterprise-grade tech

Pentaho chief product officer Christopher Dziekan explains how his own firm’s ‘main codeline’ is roadmapped out to produce what he calls an ‘enterprise grade’ version of the firm’s software with hardened features, certification and all the whistles and bells that come with ‘commercialized’ versions of open source code.

But, alongside (or perhaps beneath) this main codeline, developed in parallel, are the new and emerging ‘pure research’ type projects that can bring new functions into the total big data analytics capabilities presented. This could be functions like data lineage or new data modelling controls, for example.

The upper tier is where the developer have documented and tested all the APIs so that customer users never get heartburn with system malfunctions, the lower tier on the other hand is ‘still emerging’ and comes with more of a caveat emptor buyer beware label.

Actually this advice goes for any software, not just big data controls, but the point is well made.

7. Use reference architectures

According to Pentaho, “The big data lake could be a strategic corporate asset if a firm can start to channel this information into a data warehouse and start blending that data into the right Business Intelligence (BI) tools.”

What this means is that if firms are looking to ‘operationalize’ their unstructured ungainly data lakes, they should look for reference architectures to see which use cases have gone before them and learn from others.

Pentaho says that from what is somewhere over 400 deployments of its software, it can basically break big data analytics down into five typical use cases:

Firms that want a 360 degree view of their customers i.e. those that might be looking to blend ERP data with clickstream analysis to find out more about customer buying habits (it’s not just about WHAT customers bought, but it’s about WHAT THEY DID while they were buying).
Big data controls for regulatory and compliance reasons – firms in healthcare and financial services for example.
InfoSec – firms that want to capture ‘event data’ to augment and expand their information security.
The Internet of Things (IoT), as simple as that.
Streamlined data refineries – firms looking to do data management functions that cannot be performed with ‘traditional databases’.

The new Hitachi Data Systems version of Pentaho

So taking stock, these insights come from spending two days with a set of big data developers and it appears that the Pentaho brand has been left fully intact under its new Hitachi parentage. As a Japanese conglomerate with a big interests in everything from nuclear power stations to trains and all the way down to fridges, Hitachi has a lot of use for a big data analytics company so it’s no surprise to see this purchase go through. That being said, it’s pleasing to see it’s still the same Pentaho, but now with bigger dreams.

Follow me on Twitter or LinkedIn.

More From Forbes

The Seven 'Simple' Steps To Big Data