In the last weeks I have been asked several times to modify some components I’m working on to add the ability to split a full name in its components (first name, family name, etc.).
It looks like most people have great expectations about this working correctly but they get annoyed when it fails, and you can be sure it will fail. It will fail because it’s impossible to parse a name correctly, for instance:
|Barack Hussein Obama||Barack||Hussein||Obama|
|Pier Silvio Berlusconi||Pier Silvio||Berlusconi|
|José Rodríguez Zapatero||José||Rodríguez Zapatero|
How can you do this automatically?
This becomes particularly silly if you cannot be sure that the string you are going to parse is actually a full name, for instance don’t try to parse a chat nickname. It’s true that gmail/gtalk uses your full name by default, but this is only a default and it’s true only for gmail.
To cut a long story short, please please please don’t try to parse names. You can see by yourself how hard it is, even if I’m just considering western-style names.
If you still don’t trust me here’s a quote from
e-name-western.c, i.e. the file that does name parsing in libebook :
* <Nat> Jamie, do you know anything about name parsing? * <jwz> Are you going down that rat hole? Bring a flashlight.
On a side note when you are trying to understand why some code is broken you can find some funny commits, like the great EDS purge
Update: I found this “serious” bug in
19 thoughts on “Parsing names”
Zapatero’s name is even slightly more complicated. It is:
“José Luís” — name
“Rodríguez Zapatero” — surname
Because Rodríguez is such a common surname in Spain, people use “Zapatero” instead to refer to him, although that is his mother’s family name…
Actually, in Spain we have 2 surnames, which makes things a bit more convoluted (even for UIs). So in Zapatero’s name, being his full name “José Luis Rodriguez Zapatero”, the name is “José Luis”, first surname is “Rodriguez” and the second surname is “Zapatero”.
And it gets worse!
I read an interesting wikipedia article about names, surnames, patronimical and the likes in various languages. I’d say there’s not a single universal rule… there is not much rule at all.
Expect headaches handling brazilian names 🙂
oh, here it is
I knew about Zapatero’s name but I was not sure if Luis being part of the first name or a middle name. Also, I wanted the example to have all the names with 3 components.
Yeah, but the name parsers are written by English speaking people and they work well with English-style names 🙂
I also think that the vcard format doesn’t support multiple surnames.
Oh, it doesn’t even totally work for English names, since there are a few edge cases like “Mary Ann”, which I think most people consider to be a “first name”, not a first+middle.
I’ve seen some UIs where it asks for your full name, and then it has a second entry labelled “Call me _____” or something, which defaults to the first word of the full name, but you can change it if it guessed wrong. (Or maybe sometimes they’re even clever, and the behavior changes with locale, so like in China it would take everything *except* the first word.)
Maybe it’s time western people realize that not all cultures have need to divide names into any “components”. Just the distinction between nick, name and Full name would work for most cultures.
A few weeks ago I was asked to parse names from telephone listings (we publish phone books). The non-programmers asking me to do this simply couldn’t understand why this is difficult until I offered them an example just like yours. Even then they thought there must be a workaround that I wasn’t able to see.
Phone listings are actually worse than names because the name field might have one or more initials instead of a name or any zany combination that people want on their call display.
eg. “A & P Van Allen” where “A & P” is the acting first name.
One memorable listing was “Hutchinson Family (kids line)”. I’m not even sure how to split that one manually!
Another good example that I don’t see mentioned here is when people have generational identifiers (can’t think of the right term). For example: Jr., III, IV, etc.
That’s not all. Greek names have traditionally the family (last) name first, then the given name.
That works in libebook and is parsed correctly, but the code uses an hardcoded list of prefixes/suffixes.
There are also other possible problems from this. Is liv a first name or does it mean 54th? 🙂
IBM InfoSphere Global Name Recognition provides multi-cultural name information, analytics and name matching through a series of flexible, easy-to-integrate, SOA enabled interfaces, enabling you to unlock and unleash the wealth of information in a name.
@Michael Moore: are you the fat liberal film-maker, or merely another spammer of the same name?
I don’t understand why any software would even *consider* trying to parse a name. Just treat the full name as an indivisible string, and if necessary have a second field for “nickname” or similar if absolutely necessary for some reason.
I think you will find that it is nearly impossible to invent a system that can correctly parse names into pieces.
My own name, for example, has as last name Ruigrok van der Werven, but if you would take the general Dutch rule that last names have a ‘van ‘ part, you might think van der Werven is my last name and Ruigrok is part of my first or middle name. And ‘van’ is not considered part of the last name and is registered as a prefix and this system does not provide for that specific part.
@Anonymous: did you take a look at the software I mentioned? It does what Marco is looking for, identifying the culture of a name and parsing it according to that culture’s rules. This is a knowledge-based solution to a hard problem. For all the reasons that Marco and other commenters have identified, an algorithmic approach will not work.
If you consider parsing names in eastern asian languages, things would be much worse. In Korean, for example, the family name and the given name is not separated by whitespaces. (The order of them is reversed–the family name comes first.) Most names are composed of 3 hangul syllables, but some names are 2 syllables, and even some names have family names with 2 syllables and given names with 1 or 2 syllables. In texts, we have postpositional words that indicates the role of an word in a sentence, so one or two more syllables may follow the name instance without whitespace.
And there are further complications in Spanish! As you seem to find some fun in this matter, let me elaborate:
Just as, I imagine, in Italian, many last names and names have articles and prepositions in them (de, los, del, de los, de la, de las, la, etcetera); these should NEVER be capitalized, and must be ignored for sorting, both of which are invariably done wrong.
And there are some names that can also be, less frequently, surnames (Santiago, Esteban, Miguel, for example).
And we do legally have two last names, one from the father, one from the mother, but we often only quote the first one, for brevity.
For traditional reasons, in the vast majority of cases, you get the first last name from your father, but nowadays parents have that choice too.
I would not be surprised if you could pass on your SECOND last name, instead of the first one, offering a full four possibilities for naming your progeny, although I know of no such cases. I do *believe* that all brothers have to have the same set and in the same order, but Spanish law is so permissive with that sort of dumb liberties that I could easily be wrong.
So there is no safe way to split “José Miguel de la Rosa” without more information. Most likely the name would be “José Miguel” and there would only be one last name quoted, but it could also be that the name were “José” and the last names “Miguel” and “de la Rosa”. So for Spanish name splitting, some complicated heuristics are necessary improve certainty, and still they will fail sometimes.
Comments are closed.