Chinese Character Normalization – Finding People in Greater China

For most of this past year I’ve been working on a project that involved searching for Chinese people in various online databases using their Romanized names or their Chinese character names. When you are searching for someone’s Chinese name inside a database there are some quite thorny issues. With the rise of China as the world’s 2nd largest economy and Chinese people traveling and spending more and more around the world these issues about identifying Chinese people by their names are going to become a part of many knowledge workers day-to-day tasks. Here is the definition of Greater China from Wikipeida.

Most of the time trying to find a Chinese person among many other Chinese people in a database by name is not very successful. Most of the problems are around ‘Romanization’ and ‘Simplification and Traditional Chinese characters’. If you are interested in ‘Romanization’ see this Wikipeida entry. The ‘Romanization’ problem is that there are simply too many methods and no real standard.

In mainland China, people are by law required to use ‘simplified’ characters for their names. This assumes that there is a ‘simplified’ character for that name. In Hong Kong, Macau and Taiwan people use ‘traditional’ characters for their names. If you are interested in the difference refer to this Wikipedia entry. In any event, ‘simplification’ is a master stroke of censorship and knowledge control by the mainland Chinese government. Mainland Chinese have difficulty reading books, pamphlets and newspapers from outside of China. What better way could there be of controlling knowledge than by changing the writing system people use every day? Conversely, people from Hong Kong, Macau and Taiwan have a difficult time reading ‘simplified’ characters. Some claim it is harder to go from ‘Simplified’ to ‘Traditional’ than from ‘Traditional’ to ‘Simplified’ but I’m not sure. Reading Chinese is always hard for me and I’ve learned both character sets, sort of, up to the 1,000 character mark.

However, since there are different character sets a problem arises when someone from mainland China comes to Hong Kong, Macau or Taiwan and start to use their written character name to open accounts at banks, shops, hotels and so on. The same happens when people from Hong Kong, Macau and Taiwan go to mainland China. Simply put, people can’t easily read this person’s name. The solution is to ‘transform’ the name into the ‘correct’ character set; ‘Simplified’ Character to ‘Traditional’ Character or ‘Traditional’ Character to ‘Simplified’ Character. It happens all the time when a person opens an account where there details will be input into a database. They write down their name in the character set they are comfortable with using and the person either collecting the names or the data-input person ‘transforms’ this name. Interestingly, all Hong Kong and Macau Chinese people may apply for a ‘home return permit‘ card that lets them cross the border into China easily, and also lets the Chinese government know they have arrived. Their names are always ‘transformed’ into simplified characters when there is corresponding character between the ‘traditional’ character and the ‘simplified’ character. I assume these transformations are more accurate than some of the others. I know some of the transformations between ‘simplified’ to ‘traditional’ are not always accurate. This is due to imperfect knowledge of the mapping rules between the character sets. Sometimes people are in too much a hurry so they simply guess. All Chinese names have at least 2 characters and many, maybe the majority, have 3 characters. Sometimes the transformer will transform 1 or 2 characters and leave 1 or 2 character unchanged.

The end result is that if even if you have a Chinese person’s correct name you may not be able to find it in a database because someone has ‘transformed’ the name. Sometimes you can’t find a Chinese person in a database because you believe their name is written with character ‘X’ but in fact they write it with character ‘Y’. The only way to solve this problem is for the database’s search engine to ‘normalize’ the search. Here is an excellent summary of ‘normalization’ prepared by Michael CY Chan.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: