Big Data analysis – everyone claims to be doing it, but is anyone really getting it right?
I sat down with Richard J Self from the University of Derby at the fifth Analytics and Big Data Congress to discuss the future of big data analytics. As the popularity of big data grows in the industry the accuracy of analysis and the employment of the data is coming under more scrutiny.
For a good analysis of big data, Self explained the need for what is known in the data analysis community as a ‘unicorn’.
“Well, typically I think that there are two or three areas that need to be involved [in good analysis], one of them is that in the terms of the team involved in doing data analytics you’ve got to have the technical skills to extract the data and to load it into the analysis system.
“Then someone with the skills to do some statistics, the regression analysis, the correlation, whatever you choose, and then maybe some predictors because the decision makers want to know what’s happening in the future. Then you need to have someone who can convey the message to the decision makers.”
“So you’ve got 4 or 5 different skill sets or if you’re lucky you’ve got the unicorn – the one person with all of the hard technical skills and all of the soft skills, the problem understanding, problem solving, story telling and so on.”
Unfortunately those who have exemplary skills on the analysis end often don’t possess the communication and story telling skills that are required to translate the results of big data analysis into utilisable information.
There is a shortage of the ‘unicorn’ analysts – even of plain old workhorse analysts in fact. According to recent research demand for data analysts is ten times that of the graduates UK universities are producing.
The demand for data analysts is exceeding the supply because of the excess of big data that is now being produced.
“Most of the big data is from our corporate execution/operations systems, Internet of Things and social media. As I’m discovering now with location data – all of the data we have is of uncertain veracity – we don’t know which elements are true and which elements are false or inaccurate.”
“The real problem in cleaning data is in identifying data that is erroneous, that is not correct and then removing that from the analytical process – typically a statistician would say ‘oh we’ll get rid of the outliers and we’ll just use the stuff that falls within the normal distribution’ but what we are discovering in business is that the outliers are sometimes incredibly important.”
In the past data analytics and statisticians used to discount the outlying data as useless and harmful to the mean. Big data analytics has recently learnt that those outlying data points are in fact important as they help to determine the validity of the entire data set. The one data point lying outside the curve may in fact indicate a whole set of data that has been overlooked – or be a beacon for an upcoming issue.
As we’re beginning to analyse things like sentiment from tweets – the problem is arising that humans can identify irony, but we can’t teach a text analytic system to understand that neural pathway. The system cannot be taught to understand that a sequence of words from one person is okay, but from another it would be an ironic statement. There is also the issues around colloquial language and spelling errors, it’s all well and good to teach a system to correct basic spelling errors; but often Twitter is like reading a whole different language due to the character constrictions.
“Big data is, it seems the latest magic silver bullet, it has certainly been high on the Gartner hype curve – but now it’s dropping down and actually being actioned in business. We know that very large numbers of big organisations are using big data – or they claim they are in surveys.”
“Many big organisations are able to currently demonstrate improvements in revenue, profitability, and stock market evaluation through excellent execution of analytics. However I’m not entirely sure how representative all of that data is because it was said at the IBM insights conference in Las Vegas in November that 60% of big data analytics projects are actually not very successful.”
Despite all the hype around big data – no one really seems to be talking about the issues of privacy around that data. Clearly it is something we should be worried about, as let’s be honest with ourselves, for the most part we’re not great at protecting our data.
The problem is that we don’t trust anyone to protect our data… it can be tracked back through simple online services and information.
The problems around data security and big data is with things like the data being produced by wearable devices – technically that data you are sharing with the insurance company is your data, because it falls under the data protection scheme as being attributable to a living person.
An awful lot of new data streams coming from wearable technology are bringing up a whole new suite of problems. Let’s say you own a watch with sensors in it that track all of your health and exercise data, and you’ve agreed to let that data go back to the organisation you purchased the watch from so it can be analysed.
What about when in a year’s time you upgrade that smart watch and sell your old one on – have you sold your data on with it? How do you clean the data out from that device? Is it possible to, and if so, how do you disconnect it from you?
What systems are being put in place to protect your wearable data and differentiate you from the new owner? Your smart watch location tracking data is probably going to show that you spend almost every night at a certain set of coordinates – if someone gets that data they can easily deduce where you live.
The problem is that we don’t trust anyone to protect our data – we are reasonably certain that anonymising doesn’t work very well. It can be tracked back through simple online services and information. In some cases it can even locate back to an appliance in your house – if you have a smart fridge it has to be chattering to something.
When you’re on holiday and your smart house is sending out data that says all your lights are off and you haven’t used any heating in the past 3 weeks – it becomes very easy for someone to see that your house is unoccupied and decide to go snoop around.
As a whole we’re not very good at personal data security and big data analytics currently hasn’t found a way to help us with that.
Self did explain that it is helping with other forms of security though. Many big banks are getting very good at big data analysis that can detect, and does, find where fraud is occurring within their systems. They are able to supply this information to the police and direct them to the next ATM that is most likely to be hit.
Self has said that it is impossible to train a ‘unicorn’ in a three year degree, there is a distinct need for universities to start understanding the different roles that are involved in a data analytics team. For them to start catering to training graduates who are employable on the different hard skill levels that are required, but also to start training them in the soft skills – at the communication level.
Big data analytics unicorns are the dream of the future – but for now, as an industry we will have to struggle on with a team of workhorses that have around a 40% success rate.