The Cases for UPPER CASE in Data Management

I remember some years ago when I started SMS’ing I had an old mobile phone that defaulted the text in upper case. After I while my son answered back: “Why are you always yelling at me in SMSes”.

So I learned that you can use lower case in SMSes as well, and only using all caps in SMSes, as in any other writing, usually means that YOU ARE YELLING.

Examining a text for upper case use can, together with polarity classifiers and all that jazz, be used today in sentiment analysis for example within social media data.

Within data parsing using words in upper case in person names may tell you something too. Especially in France it is common to indicate a surname with only upper case characters, so for example in the name “AUGUST Michel” the first name is the surname and the last name is the given name.

When matching company names a word in upper case may indicate an abbreviation. So “THE Ltd” and “The Happy Entrepreneur Ltd” may be a good match despite of a horrible edit distance.

In data migration within handling names from older systems where all caps have been used, it is common to try to make better looking names. “JOHN SMITH” will be “John Smith” and “SAM MCCLOUD” should be “Sam McCloud”. In environments with other alphabets than English national characters may be reintroduced as well. For example in a German context “JURGEN VON LOW” may come out as “Jürgen von Löw”.

What about you? Have you stumbled upon some fun with upper case in data management?

Bookmark and Share

9 thoughts on “The Cases for UPPER CASE in Data Management

  1. Andrew 26th June 2012 / 12:10

    Casing can be “fun” when you encounter two byte forenames.

    Mrs JO Brown (Janet Olivia) vs Mrs Jo Brown (Joanne?)

    Don’t get me started on Ng….

  2. Steve Tootill 26th June 2012 / 16:01

    The one that occurs to me straight away is MS SOCIETY being interpreted as a female “Ms Society” but there are many similar pitfalls…

  3. lynnvanavermaet 26th June 2012 / 16:55

    I don’t know if it is typically Belgian, but in translating all caps to mixed case names, some people are very sensitive to the way it is written. Writing VAN LAERE as ‘Van Laere’ will not be appreciated by some people (nobility) as their name is often written as ‘van Laere’. The little ‘v’ can be oh so sensitive.

    • Steve Tootill 26th June 2012 / 17:12

      Yes – in Holland and Belgium we’d usually case “van” in lower case, as in “Piet van der Valk”. The equivalent in French is lower case as well (“de”) but I believe that “La” and “Le” are capitalized so you could have “Pierre de La Haye”, I guess. Does anyone have a definite answer? Most Scots care about the capital letter folllowing Mac e.g. MacDonald but I once worked with a Mr Macfarlane with a lower case “f” 🙂

      • Graham Rhind 26th June 2012 / 17:31

        @Steve

        As a general rule The Netherlands (Holland? Where that?) have lower case prepositions (van de) and Belgium upper case (Van De) because it’s part of the name and not a preposition in Belgium, as illustrated by Van De being under V in the telephone book in Belgium but under the letter that follows in The Netherlands.

        However, if you use the Dutch name without a given name but with a form of address, the V is upper case (Mr Van den Broek).

        The trick is the write it as the owner of that name wants – which is why I counsel against processing names in any way and collecting that data correctly at source!

        End of lesson 🙂

      • Graham Rhind 26th June 2012 / 17:35

        Oh, and … I suspect “de la” and not “de La”, but your example includes the name of a city (La Haye – The Hague) – which may be why it is cased in that way.

        A fount of useless knowledge, me!

      • Steve Tootill 26th June 2012 / 21:41

        or even den Haag or ‘s Gravenhage to carry on casing… I never knew about the differences in Dutch and Belgian casing and indexing!

        I completely agree about trying to collect data correctly at source, but sometimes we have to “start from here” and use either our best guess or try and adopt a rule which is least likely to offend.

  4. Oliver Townshend 26th June 2012 / 23:52

    One rule unlikely to offend is to leave the name alone. Sometimes easier to explain all caps than to explain why the name has been wrecked. But I’ve never found out how many people are offended, and how many people just accept that some people can’t spell their surname (I certainly fall into that category fairly often).

  5. Henrik Liliendahl Sørensen 1st July 2012 / 09:03

    Thanks Andrew, Steve, Lynn, Graham and Oliver for adding in.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s