New LinkedIn Group: Big Data Quality

BigDataQualityDo we need a LinkedIn group for this and that? It’s always a question. There are already a lot of LinkedIn groups for Big Data and a lot of LinkedIn groups for Data Quality.

However I think we do see targeted discussions and engagement in the niche groups on LinkedIn, so therefore I created a new group about the intersection of Big Data and Data Quality yesterday. The group is called Big Data Quality.

It’s good to see a stampede of people joining (well, 39 within first 24 hours) and see discussions and comments starting.

So, if you haven’t joined already, please do so here.

And why not take part in the fun, maybe just by voting on the question: How important is data quality for big data compared to data quality for small data?

Bookmark and Share

Social Data vs Sensor Data

Social data sensor data big dataThe two predominant kinds of big data are:

  • Social data and
  • Sensor data

Social data are data born in the social media realm such as facebook likes, linkedin updates, tweets and whatever the data entry we as humans do in the social sphere is called.

Sensor data are data captured by devices of many kinds such as radar, sonar, GPS unit, CCTV Camera, card reader and many more.

There’s a good term called “same same but different” and this term does also in my experience very well describe the two kinds of big data: The social data coming directly from a human hand and the sensor data born by a machine.

Of course there are humans involved with sensor data as well. It is humans who set up the devices and sometimes a human makes a mistake when doing so. Raw sensor data are often manipulated, filtered and censored by humans.

There is indeed data quality issues associated with both kinds of big data, but in slightly different ways. And you surely need to apply master data management (MDM) in order to make some sense of both social data and sensor data as examined in the post Big Data and Multi-Domain Master Data Management.

What is your experience: Is social data and sensor data just big data regardless of source? Is it same same but different? Or are social data and sensor data two separated data worlds just both being big?

Bookmark and Share

Coma, Wetsuit and Dedoop

The sehr geehrte damen und herren at Universität Leipzig (Leipzig University) are doing a lot of research in the data management realm and puts some good efforts in naming the stuff.

Here are some of the inventions:

COMA is a system for flexible Combination Of schema Matching Approaches. Let’s hope the thing is still alive.

WETSUIT (Web EnTity Search and fUsIon Tool) is a new powerful mashup tool – and what a nice seven letter abbreviation not sticking only to the first letters.

Tilia_tomentosaDedoop (Deduplication with Hadoop) is a prototype for entity matching for big data. Big phonetic Dedupe will be around of course.

Well, you should expect fuzzy abbreviations from this city, as Leipzig means “settlement where the linden trees stand”.

Bookmark and Share

Who Killed Big Data?

No Bulls
Please, no big data bullsh…

I guess everyone is sick and tired of seeing the term “big data” attached to just about everything larger than 1 kilobyte.

But who is responsible? Who do we hold accountable for overusing the term big data? Who killed big data?

Was it first and foremost the vendors who made the kill? A recent blog post called “Big Data is Dead. What’s Next?” by John De Goes suggest that the vendors are to be blamed for stabbing big data from behind.

Could it be the analysts? I have, as mentioned in the post The Big MDM Trend, seen how Gartner (the analyst firm) have put big data forward in the shouting gallery in order to explain something already explained with other terms.

Big data has often been personalized by the data scientist. So maybe it was a Californian girl called Jill Dyché who caused an extinction of the data scientist and thereby big data. She wrote the blog post called Why I Wouldn’t Have Sex with a Data Scientist.

What do you think? Who killed big data?

Bookmark and Share

Big Reference Data as a Service

This morning I read an article called The Rise of Big Data Apps and the Fall of SaaS by Raj De Datta on TechCrunch.

I think the first part of the title is right while the second part is misleading. Software as a Service (SaaS) will be a big part of Big Data Apps (BDA).

The article also includes a description of LinkedIn merely as a social recruitment service. While recruiters, as reported in the post Indulgent Moderator or Ruthless Terminator?, certainly are visible on this social network, LinkedIn is much more than that.

Among other things LinkedIn is a source of what I call big reference data as examined in the post Social MDM and Systems of Engagement.

Besides social network profiles big reference data also includes big directory services, being services with large amount of data about addresses, business entities and citizens/consumers as told in the post The Big ABC of Reference Data.

Right now I’m working with a Software as a Service solution embracing Big (Reference) Data as a Service thus being a Big Data App called instant Data Quality.

And hey, I have made a pin about that:

Bookmark and Share

Data Quality vs Big Data

If you go to Google Insight and ask for how it goes with search interest for “data quality” versus how it is with “big data” you’ll get this graph:

“Data quality” (blue line) is a bear market. The interest is slowly but steadily decreasing. “Big data” (red line) is a bull market with a steep rising curve of interest starting in early 2011 and exploding in 2012.

So, what can you do if your blog is about data quality? For my part I’m writing a blog post on my data quality blog mentioning the term “big data” as many times as possible 🙂

I’m not saying “big data” is uninteresting. Not at all. I even use the term “big reference data” when describing how to exploit big directories and social network profiles in the quest for improving party master data quality.

In the short period of the “big data” hype it has often been said, that why should we start working with “big data” when we can’t manage small data yet?

While this makes some sense, it will in my eyes be a mistake not to try exploring what data quality techniques we can apply to “big data” and what data quality advantages we can harvest within “big data”.

We have known for years that the amount of data being available is drastically increasing. Now we just have a term to be used when searching for and talking about it. Like it or not; that term is “big data”.

Bookmark and Share

The Big Search Opportunity

The other day Bloomberg Businessweek had an article telling that Facebook Delves Deeper Into Search.

I have always been advocating for having better search functionality in order to get more business value from your data. That certainly also applies to big data.

In a recent post called Big Reference Data Musings here on the blog, the challenge of utilizing large external data sources for getting better master data quality was discussed. In a comment Greg Leman pointed out, that there often isn’t a single source of the truth, as you for example could expect from say a huge reference data source as the Dun & Bradstreet WorldBase holding information about business entities from all over the world.

Indeed our search capabilities optimally must span several sources. In the business directory search realm you may include several sources at a time like supplementing the D&B  WorldBase with for example EuroContactPool, if you do business in Europe, or the source called Wiki-Data (under rename to AvoxData) if you are in financial services and wants to utilize the new Legal Entity Identifier (LEI) for counterparty uniqueness in conjunction with other more complete sources.

As examined in Search and if you are lucky you will find combining search on external reference data sources and internal master data sources is a big opportunity too. In doing that you, as described the follow up piece named Wildcard Search versus Fuzzy Search, must get the search technology right.

I see in the Bloomberg article that Facebook don’t intend to completely reinvent the wheel for searching big data, as they have hired a Google veteran, the Danish computer scientist Lars Rasmussen, for the job.

Bookmark and Share

Big Reference Data Musings

The term “big data” is huge these days. As Steve Sarsfield suggest in a blog post yesterday called Big Data Hype is an Opportunity for Data Management Pros, well, let’s ride on the wave (or is it tsunami?).

The definition of “big data” is as with many buzzwords not crystal clear as examined in a post called It’s time for a new definition of big data on Mike2.0 by Robert Hillard. The post suggests that big may be about volume, but is actually more about big complexity.

As I have worked intensively with large amounts of rich reference data, I have a homemade term called “big reference data”.

Big Reference Data Sets

Reference Data is a term often used either instead of Master Data or as related to Master Data. Reference data is those data defined and (initially) maintained outside a single organization. Examples from the party master data realm are a country list, a list of states in a given country or postal code tables for countries around the world.

The trend is that organizations seek to benefit from having reference data in more depth than those often modest populated lists mentioned above.

An example of a big reference data set is the Dun & Bradstreet WorldBase. This reference data set holds around 300 different attributes describing over 200 million business entities from all over world.

This data set is at first glance well structured with a single (flat) data model for all countries. However, when you work with it you learn that the actual data is very different depending on the different original sources for each country. For example addresses from some countries are standardized, while this isn’t the case for other countries. Completeness and other data quality dimensions vary a lot too.

Another example of a large reference data set is the United Kingdom electoral roll that is mentioned in the post Inaccurately Accurate. As told in the post there are fit for purpose data quality issues. The data set is pretty big, not at least if you span several years, as there is a distinct roll for every year.

Big Reference Data Mashup

Complexity, and opportunity, also arises when you relate several big reference data sets.

Lately DataQualityPro had an interview called What is AddressBase® and how will it improve address data quality? Here Paul Malyon of Experian QAS explains about a new combined address reference source for the United Kingdom.

Now, let’s mash up the AddressBase, the WorldBase and the Electoral Rolls – and all the likes.

Image called Castle in the Sky found on photobotos.

Bookmark and Share

Informatics for adding value to information

Recently the Global Agenda Council on Emerging Technologies within the World Economic Forum has made a list of the top 10 emerging technologies for 2012. According to this list the technology with the greatest potential to provide solutions to global challenges is informatics for adding value to information.

As said in the summary: “The quantity of information now available to individuals and organizations is unprecedented in human history, and the rate of information generation continues to grow exponentially. Yet, the sheer volume of information is in danger of creating more noise than value, and as a result limiting its effective use. Innovations in how information is organized, mined and processed hold the key to filtering out the noise and using the growing wealth of global information to address emerging challenges.”

Big data all over

Surely “big data” is the buzzword within data management these days and looking for extreme data quality will be paramount.

Filtering out the noise and using the growing wealth of global information will help a lot in our endurance to make a better world and to make better business.

In my focus area, being master data management, we also have to filtering out the noise and exploit the growing wealth of information related to what we may call Big Master Data.

Big external reference data

The growth of master data collections is also seen in collections of external reference data.

For example the Dun & Bradstreet Worldbase holding business entities from around the world has lately grown quickly from 100 million entities to over 200 millions entities. Most of the growth has been due to better coverage outside North America and Western Europe, with the BRIC countries coming in fast. A smaller world resulting in bigger data.

Also one of the BRICS, India, is on the way with a huge project for uniquely identifying and holding information about every citizen – that’s over a billion. The project is called Aadhaar.

When we extend such external registries also to social networking services by doing Social MDM, we are dealing with very fast growing number of profiles in Facebook, LinkedIn and other services.

Surely we need informatics for adding the value of big external reference data into our daily master data collections.

Bookmark and Share

Big Master Data

Right now I am overseeing the processing of yet a master data file with millions of records. In this case it is product master data also with customer master data kind of attributes, as we are working with a big pile of author names and related book titles.

The Big Buzz

Having such high numbers of master data records isn’t new at all and compared to the size of data collections we usually are talking about when using the trendy buzzword BigData, it’s nothing.

Data collections that qualify as big will usually be files with transactions.

However master data collections are increasing in volume and most transactions have keys referencing descriptions of the master entities involved in the transactions.

The growth of master data collections are also seen in collections of external reference data.

For example the Dun & Bradstreet Worldbase holding business entities from around the world has lately grown quickly from 100 million entities to near 200 millions entities. Most of the growth has been due to better coverage outside North America and Western Europe, with the BRIC countries coming in fast. A smaller world resulting in bigger data.

Also one of the BRICS, India, is on the way with a huge project for uniquely identifying and holding information about every citizen – that’s over a billion. The project is called Aadhaar.

When we extend such external registries also to social networking services by doing Social MDM, we are dealing with very fast growing number of profiles in Facebook, LinkedIn and other services.

Extreme Master Data

Gartner, the analyst firm, has a concept called “extreme data” that rightly points out, that it is not only about volume this “big data” thing; it is also about velocity and variety.

This is certainly true also for master data management (MDM) challenges.

Master data are exchanged between organizations more and more often in higher and higher volumes. Data quality focuses and maturity may probably not be the same within the exchanging parties. The velocity and volume makes it hard to rely on people centric solutions in these situations.

Add to that increasing variety in master data. The variety may be international variety as the world gets smaller and we have collections of master data embracing many languages and cultures. We also add more and more attributes each day as for example governments are releasing more data along with the open data trend and we generally include more and more attributes in order to make better and more informed decisions.

Variety is also an aspect of Multi-Domain MDM, a subject that according to Gartner (the analyst firm once again) is one of the Three Trends That Will Shape the Master Data Management Market.

Bookmark and Share