Social Data vs Sensor Data

Social data sensor data big dataThe two predominant kinds of big data are:

  • Social data and
  • Sensor data

Social data are data born in the social media realm such as facebook likes, linkedin updates, tweets and whatever the data entry we as humans do in the social sphere is called.

Sensor data are data captured by devices of many kinds such as radar, sonar, GPS unit, CCTV Camera, card reader and many more.

There’s a good term called “same same but different” and this term does also in my experience very well describe the two kinds of big data: The social data coming directly from a human hand and the sensor data born by a machine.

Of course there are humans involved with sensor data as well. It is humans who set up the devices and sometimes a human makes a mistake when doing so. Raw sensor data are often manipulated, filtered and censored by humans.

There is indeed data quality issues associated with both kinds of big data, but in slightly different ways. And you surely need to apply master data management (MDM) in order to make some sense of both social data and sensor data as examined in the post Big Data and Multi-Domain Master Data Management.

What is your experience: Is social data and sensor data just big data regardless of source? Is it same same but different? Or are social data and sensor data two separated data worlds just both being big?

Bookmark and Share

Where is the Spot?

One of things we often struggle with in data quality improvement and master data management is postal addresses. Postal addresses have different formats around the world, names of streets are spelled alternatively and postal codes may be wrong, too short or suffer from other flaws.

An alternative way of identifying a place is a geocode and sometimes we may think: Hurray, geocodes are much better in uniquely identifying a place.

Well, unfortunately not necessarily so.

First of all geocodes may be expressed in different systems. The most used ones are:

  • Latitude and longitude: Even though the globe is not completely round, this system for most purposes is good for aligning positions with the real world.
  • UTM: When the world is reflected on a paper or on a computer screen it becomes flat. UTM reflects the world on a flat surface very well aligned with the metric system making distance calculations straight forward.
  • WGS: This is the system in use in many GPS devices and also the one behind Google Maps.

Next, where is the address exactly placed?

I have met at least three different approaches:

  • It could be where the building actually is and then if the precision is deep and/or the building is big on different places around the building.
  • It could be where the ground meets a public road. This is actually most often the case, as route planning is a very common use case for geocodes. The spot is fit for the purpose of use so to say.
  • It could, as reported in the post  Some Times Big Brother is Confused, be any place on (and beside) the street as many reference data sources interpolates numbers equally along the street or in other ways gets it wrong by keeping it simple.

Bookmark and Share

Geocoding from 100 Feet Under

I stumbled upon this image posted by Ellie K. on Google+

The title is World map of Flickr and Twitter locations and the legend is that red dots are locations of Flickr pictures, blue dots are locations of Twitter tweets and white dots are locations that have been posted to both.

You may be able to see your city following this link.

For example Copenhagen looks like this:

Here you have Copenhagen in Denmark to the left and Malmoe in Sweden to the right.

The strip between is the fixed link known as the Øresund Bridge.

However the connection isn’t entirely a bridge. If you look at a flyover picture you may think that there wasn’t money enough to finish the connection. Fortunately there was. The part closest to Copenhagen Airport is a 4 kilometer (2.5 miles) undersea tunnel.

So what puzzles me is the dots apparently representing Flickr uploads and tweets made from the tunnel. Are you able to upload to Flickr from down there? How are the tweets geocoded with that precision? My GPS never works when passing the tunnel.

(PS: I know you may geotag when back to surface)

Bookmark and Share

Information and Data Quality Blog Carnival, February 2010

El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics.

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management.  With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings.

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on by IAIDQ

Bookmark and Share