In the first article in this series I wrote about the various ways in which big data needs to be considered over and above what database you are going to host it on, specifically with respect to governance and management of that data. One of the issues I highlighted was trust: if you can't trust the data you are basing your decisions on then you are effectively up the proverbial creek without the requisite paddle. Of course, this applies to conventional analytic environments as well and, heaven knows, there are plenty of companies that are still making decisions based upon faulty data though often that is a matter of either wilful ignorance or blindness to reality; they should wake up and smell the coffee - but that's another story.
Anyway, let's suppose that you have got the message about good quality data with respect to your normal transactional environment. Is there any logical reason why the same shouldn't apply to big data? Of course there are differences: machine generated data is not usually subject to errors unless there is a fault in the sensor or whatever device you are using and many (most?) modern devices are capable of detecting when a fault has arisen and alerting that fact so that relevant results can be ignored. However, machine generated data is prone to duplication. A simple example would be a call detail record (CDR) in a mobile phone environment. If you are moving while phoning and transferring from one antenna to another then you will generate more than one CDR for the same call. Actually, it's more complex than this but you get the idea. And the same applies to other sorts of sensor-based systems as well as things like capital markets.
When it comes to unstructured data from social media you also have duplication issues (retweets for example) as well as abbreviations and spelling mistakes to consider. While you might like to have semantically aware analytic tools the truth is that these will never cover all the possible combinations of what may crop up. And even if you could there will need to be human intervention at least on some occasions: does LOL stand for "lots of love" or "laugh out loud"? It's not difficult to imagine situations where the latter is pouring scorn on whatever is being discussed while the former is simply a sign-off.
So, the bottom line is that you need data quality or, more broadly, data governance, for big data just as much as you do for conventional data. It may be nuanced and have slightly different emphases but the requirement remains. Moreover, because you are likely to have different types of big data to analyse, then you will need different approaches depending on the type of data.
Data governance may be further complicated if you are streaming data in real-time. For this sort of data, assuming you also need real-time analyses, then you are going to have to assume that the data is accurate: you may have time to remove duplicates and you might have time to detect missing data but you are certainly not going to have time to do much else.