How to filter out common unwanted characters in Python

Jurie Horneman

Sunday, February 14, 2010

I did a fair amount of programming last year and among other things I wrote a Python program that takes Word documents and transforms them to game data. One of the things it had to do is take a text that uses characters we didn't support in the engine (on Nintendo DS) and replace them with other characters. Like so:

character_replacements = [
    ( u'\u2018', u"'"),   # LEFT SINGLE QUOTATION MARK
    ( u'\u2019', u"'"),   # RIGHT SINGLE QUOTATION MARK
    ( u'\u201c', u'"'),   # LEFT DOUBLE QUOTATION MARK
    ( u'\u201d', u'"'),   # RIGHT DOUBLE QUOTATION MARK
    ( u'\u201e', u'"'),   # DOUBLE LOW-9 QUOTATION MARK
    ( u'\u2013', u'-'),   # EN DASH
    ( u'\u2026', u'...'), # HORIZONTAL ELLIPSIS
    ( u'\u0152', u'OE'),  # LATIN CAPITAL LIGATURE OE
    ( u'\u0153', u'oe')   # LATIN SMALL LIGATURE OE
]

    for (undesired_character, safe_character) in character_replacements:
        text = text.replace(undesired_character, safe_character)

This code is not hard to write, but building the character replacement list can take some time. The bulk consists of characters that Word will automatically insert when appropriate, but that game typefaces often don't support or that are cumbersome for other reasons. As you can see. those characters are basically all kinds of quotation marks, plus a dash and the ellipsis. All of these are replaced by old school 7-bit ASCII characters or character sequences.

(Last year, while I was working on this tool, I read Robert Bringhurst's "Elements of Typographic Style". So as I was learning why it makes sense to use different types of dashes and quotation marks, I was also writing code to strip them out. Given the time constraints, this was the pragmatic approach. Still, I'd love to write a better text display engine sometime.)

Some time later I ran into trouble with the French 'Å“' (as used in 'Å“uvre'), so I added that too since the mapping is uncontroversial. Our typeface supported German umlauted vowels, so we left those.

In case it wasn't clear: the comments give the official Unicode names for each character.

The list is far from complete, and depending on your situation you may want to add some kind of additional black- or white-listing mechanism to make sure other unsupported characters don't slip through. But this simple list worked very well for the 13 multilingual Nintendo DS games we developed last year. If I could have found a reasonable and well-documented list like this, I could have saved a few hours of time. I hope that by publishing this snippet here someone else will be able to do that.