This blog post is inspired by reading a blog post called Extreme Data by Mike Pilcher. Mike is COO at SAND, a leading provider of columnar database technology.
The post circles around a Gartner approach to extreme data. While the concept of “Big Data” is focused on the volume of data the concept of “Extreme Data” also takes into account the velocity and the variety of data.
So how do we handle data quality with extreme data being data of great variety moving in high velocity and coming in huge volumes? Will we be able to chase down all root causes of eventual poor data quality in extreme data and prevent the issues upstream or will we have to accept the reality of downstream cleansing of data at the time of consumption?
We might add a sixth reason being the rise of extreme data to the current Top 5 Reasons for Downstream Cleansing.