The Big Data Secret of SPECTRE

I’m sorry if this blog is turning into a travel blog. But here’s a third Paris story.

Boulevard Haussmann is one of the city’s great thoroughfares (to use the right meta-data term) and is known to be where we can find the headquarters of SPECTRE.

While visiting SPECTRE today I learned a lot about how SPECTRE is exploiting big data as an important way of keeping up with the tough competition in its industry sector today. But all that is of course a secret.

When asking about if they still has trouble with Bond the answer was:

Barry_Nelson_as_Jimmy_Bond_in_1954
Jimmy Bond when he was a field agent

“Bond? – Jimmy Bond? – The sexy data scientist who is working for NSA?”

“Oh no, I replied. James Bond.”

“Oh, yes” the SPECTRE chief data manipulator replied. “He was with British Intelligence. But he has been moved to the EU Data Protection Service. He just got his license to fine. Now 2%  and soon 5% of our global turnover each time. Very dangerous man. Very dangerous”.

Bookmark and Share

Growing Variety in Big Master Data

With the rise of big data we will see that master data is going to be Small Data with Big Impact.

Master data itself is going to grow in terms of volume and velocity. This is because we will have to manage more types of master data in order to make sense of big data. Notable examples are:

  • We will have to identify more locations in order to make sense of the geospatial attributes in big data.
  • We will be forced to manage some attributes of our competitor’s product master data, besides our own product master data, in order to listen to the talk in the social media stream.
  • We will need to take care of more party master data roles. Besides the classic party master data roles of real world entities being customers, suppliers and employees we will have to care about subscribers, users and visitors of online services, followers and friends in social media and the spouses, relatives, friends of friends and other influenced ones of those.

Party roles

It’s not the volume and probably neither the velocity that will be the big issue here. It’s the variety in the data which will support the processes in caring about those entities that is a huge challenge, not at least for ensuring the veracity of the master data here.

Bookmark and Share

What’s New in The Data Quality Magic Quadrant?

The Gartner Magic Quadrant for Data Quality Tools 2013 is out. If you don’t want to pay Gartner’s fee for having a look, you can sign up for a free copy on one of the vendor’s websites for example here at Trillium Software Insights.

So, what’s new this year?

It is pretty much the same picture as last year with X88 as the only new intruder. Else the news is that some vendors “now appear under slightly different names”. And now Ted Friedman is the only author.

The most exciting part, in my eyes, is the words about how the market will develop. Some seen and foreseen trends are:

  • Information governance programs drive the need for data quality tools.
  • Cloud based deployments are gaining traction.
  • Growth expected for embracing less-structured data, not at least social data, by using big data techniques and sources.

That’s good news.

Data Quality Tools

Bookmark and Share

Entity Resolution and Big Data

FingerprintThe Wikipedia article on Identity Resolution has this catch on the difference between good old data matching and Entity Resolution:

”Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Laboratory for Advanced Research in Entity Resolution and Information Quality:

  • Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured
  • Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information
  • Utilizes non-matching, asserted linking (associate) information in addition to direct matching
  • Uncovers non-obvious relationships and association networks (i.e. who’s associated with whom)”

I have a gut feeling that Data Matching and Entity (or Identity) Resolution will melt together in the future as expressed in the post Deduplication vs Identity Resolution.

If you look at the above mentioned factors that distinguish data matching from identity resolution, some of the often mentioned features in the new big data technology shine through:

  • Working with unstructured and semi-structured data is probably the most mentioned difference between working with small data versus working with big data.
  • Working with associations is a feature of graph databases or other similar technologies as mentioned in the post Will Graph Databases become Common in MDM?

So, in the quest of expanding matching small data to evolve into Entity (or Identity) Resolution we will be helped by general developments in working with big data.

Bookmark and Share

Big Data Veracity

Veracity is often mentioned as the 4th V of big data besides Volume, Velocity and Variety.

While veracity of course is paramount for a data quality geek like me veracity is kind of a different thing compared to volume, velocity and variety as these three terms are something that defines big data and veracity is more a desirable capacity of big data. This argument is often prompted by Doug Laney of Gartner (the analyst firm) who is behind the Volume, Velocity and Variety concept that also was coined as Extreme Data at some point.

Doug Laney on Veracity
Comment in discussion on the Big Data Quality LinkedIn group

As mentioned in the post Five Flavors of Big Data the challenges with data quality – or veracity – is very different with the various types of big data. If I should order the mentioned types of big data I would say that veracity has more challenges in this order going from some challenges to huge challenges:

  • Big reference data
  • Big transaction data
  • Web logs
  • Sensor data
  • Social data

It’s interesting that you may say that variety has the same increasing order, but volume and velocity doesn’t necessarily follow that order apart from that big reference data is less challenging in all respects and therefore maybe isn’t big data at all. However I like it to be. That is because big reference data in my eyes will play a big role in order to solve the veracity challenge for the other types of big data.

Bookmark and Share

Five Flavors of Big Data

We are often talking about big data as if it is one kind of data while in fact we need separate approaches to handling for example data quality issues with different sorts of big data.

Big Data Quality
Join the Big Data Quality group on LinkedIn

In the following I will go through some different types of big data and share some observations related to data quality.

Social data

The most mentioned type of big data I guess is social data and the opportunity to listen to Twitter streams and Facebook status updates in order to get better customer insight is an often stated business case for analyzing big data.

However, everyone who listens to those data will be aware of the tremendous data quality problems in doing that as told in the post Crap, Damned Crap and Big Data.

Sensor data

Another often mentioned type of big data is sensor data and as examined in the post Social Data vs Sensor Data these are somewhat different from social data with less complex data quality issues but not in all free of data quality flaws as reported in the post Going in the Wrong Direction.

Web logs

Following the clicks from people surfing the internet is a third type of big data. This kind of big data shares characteristics from both social data and sensor data as they are human generated as social data but more fact oriented as sensor data.

Big transaction data

Even traditional transaction data in huge volume are treated as big data but of course inherits the same data quality challenges as all transaction data as even that data are structured we may have trouble with having the right relations to the who, what, where and when in the transactions. And that isn’t easier with large volumes.

Big reference data

When reference data grows big we also meet big complexity. Try for example to build a reference data set with all the valid postal addresses in the world. Several standardizing bodies have a hard time making a common model for that right now. Learn about other examples of big reference data and the related complexity in the post Big Reference Data Musings.

Bookmark and Share

How can you have any pudding….

The social media sphere these days has a lot of good stuff around Data Quality and Big Data including this piece from Jim Harris called Big Data is Just Another Brick in the Wall.

the wallIn here Jim ponders how working with Big Data must be build on a lot of other disciplines including Data Quality and the title of the blog post is nicely composed from the title of the fantastic Pink Floyd song called Another Brick in the Wall.

In this song there is an unpleasant voice of an angry stupid old teacher yelling:

“If you don’t eat yer meat, you can’t have any pudding. How can you have any pudding if you don’t eat yer meat?”

I’m afraid I also have to raise an equally unpleasant voice of saying:

“If you don’t eat yer data quality, you can’t have any big data. How can you have any big data if you don’t eat yer data quality?”

And by the way: How can you work with big data if you don’t join the LinkedIn group called Big Data Quality?

Bookmark and Share

Counting Citizens

A main story on BBC this morning is about how collection of UK migration figures is not fit for purpose as reported on the BBC website here.

UK boarderThe problem is that measuring who is going in and out of the country is designed on different purposes like measuring tourism and fighting terrorism.

Some different solutions have been mentioned:

  • The “oh no” solution: More data collection
  • The shiny new solution: Big Data
  • The unwanted solution: Master Data Management

The “oh no” solution: More data collection

Imagining you have to fill in endless forms with rigid checks when going in and out of the airports and ferry ports adding to the checks and security controls already in place. Oh no.

The shiny new solution: Big Data

A system of collecting data from passenger lists on ferries and airplanes called e-Borders is already being implemented and there are hopes that joining these new big data with the old system of record will improve accuracy. Oh, really.

The unwanted solution: Master Data Management

As said in an expert interview on TV the only sustainable solution is a central citizen registry – a solution not unknown for immigrants as me coming from Scandinavia. However, as reported here this solution is unwanted in the UK.

Bookmark and Share

OK, so big data is about size (and veracity)

During the rise of the term “big data” there has been a lot of different definitions around trying to shortly express what this very popular term really is about. A lot of these definitions has included a sentiment about that big data is not (only) about size. The tree V’s being Volume, Variety and Velocity has been very popular. A fourth V being Veracity has been added, though this hardly isn’t a definition of big data but rather a desirable capability of big (and any other) data.

OEDBut apparently big data is about size.

The Oxford English Dictionary has now included big data in this authoritative explanation of English words and terms, and big data is:

“Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges”.

It’s interesting that the challenges that make data big are not about analyzing the data. It is about data manipulation and data management. These are by the way things you do to achieve veracity.

Bookmark and Share

Social Score Credibility

A recent piece from Fliptop is called What’s the Score. It is a thorough walk through on what is usually called social scoring done in influence scoring platforms within social media, where Klout, Kred and PeerIndex are the most known services of that kind.

The Fliptop piece has a section around faking, which was also the subject in a post lately on this blog. The post is called Fact Checking by Mashing Up, and is about how to link social network profiles with other known external sources in order to detect cheat. Linking social network profiles with other external sources and internal sources is what is known as Social MDM, a frequent subject on this blog for several years.

A social score must of course be seen in context, as it matters a lot what you are influential about when you want to use social scoring for business. As told in the post Klout Data Quality this was a challenge two years ago, and this is probably still the case. Also here I think linking with other (big) data sources and letting Social MDM be the hub will help.

Kred
Taken from Kred on my twitter handle.

PS: I have no idea why moron ended up there. Einstein is OK.

Bookmark and Share