Django 1.6's Collation and Umlauts

2014-06-24 | #django, #python, #solution, #webdev

By default Django 1.6 uses "utf8_general_ci" as its table collation.

Mostly, this seems like a good choice, but has one ugly caveat when dealing with umlauts such as üäöÜÄÖß:

These are treated equally to uaoUAO in the sort order, which is fine, but also treated equally for LIKE-queries, which can be utter bullcrap. Searching for "gööd" on those tables using "__contains" results in found entries containing "good" for example and vice versa.

Ugh.

There are two solutions for this:

Use a "bin"-collation, discarding all character interpretation, which could bring sorting problems and will definitively make all queries case senstive by default. I guess Django would treat this python-wise, which is fine but slow.

Build your queries using "__regex" and "__iregex", which explicitely differentiates Umlauts and their base characters. You could also replace "__startswith" and "__endswith" respectively by adding "^" or "$" to the beginning and or end of your search string. If you use regex always be sure to mask user input for regex use and be aware of one problem here: regex does not, by default, know that there is a corresponding small "ä" to a big "Ä". You can solve this by using lowercased search fields or could try to bring in a fitting locale here, but I'm not certain what to do on international pages. So my suggestion is to accept case-sensitivity for unicode-range characters or explicitely build a search blob for searchable entries on save.