Fixing malformed UTF-8 in Lua

I received several reports of files being loaded empty in ZeroBrane Studio and it turned out that it is caused by malformed UTF-8 code; the most frequent offenders are quotes copied from text with other encodings. For example, ISO 8859-1 has grave and acute accents with codes 0x91 and 0x92 and Windows CP1250 has single and double quotation marks with codes 0x91-0x94, with all these codes being invalid in UTF8 ("an unexpected continuation byte"). You can check ASCII and Unicode quotation marks for some details on various types of quotation marks).

There is no shortage of recommendations on how to detect malformed UTF-8 code, but most of them are based on using iconv and I was looking for something Lua-based. There is a StackOverflow answer that provides a clever way to iterate over UTF-8 code points, but it only works correctly over valid UTF-8 strings.

So I turned to other languages and found exactly what I was looking for in Test::utf8: a regexp to detect a valid UTF-8 sequence. The rest was easy; here is the fixUTF8 method that takes a string and "fixes" it:

function fixUTF8(s, replacement)
  local p, len, invalid = 1, #s, {}
  while p <= len do
    if     p == s:find("[%z\1-\127]", p) then p = p + 1
    elseif p == s:find("[\194-\223][\128-\191]", p) then p = p + 2
    elseif p == s:find(       "\224[\160-\191][\128-\191]", p)
        or p == s:find("[\225-\236][\128-\191][\128-\191]", p)
        or p == s:find(       "\237[\128-\159][\128-\191]", p)
        or p == s:find("[\238-\239][\128-\191][\128-\191]", p) then p = p + 3
    elseif p == s:find(       "\240[\144-\191][\128-\191][\128-\191]", p)
        or p == s:find("[\241-\243][\128-\191][\128-\191][\128-\191]", p)
        or p == s:find(       "\244[\128-\143][\128-\191][\128-\191]", p) then p = p + 4
      s = s:sub(1, p-1)..replacement..s:sub(p+1)
      table.insert(invalid, p)
  return s, invalid

The logic is very simple: it checks for all valid code points of various length and replaces those characters that do not match.

The reason to have replacement is that you wouldn't want to silently remove malformed characters; ideally you'd replace them with some infrequent, but valid character that would be easy to find if needed. I picked "\022" as it is a control character that is rarely seen in texts and is usually shown as [SYN] glyph. Now ZeroBrane Studio will generate a message like this: "Replaced an invalid UTF8 character with [SYN]."

This is not the fastest way, but works well if there are few replacements. If you expect a large number of replacements, you can store all fragments in an array and concatenate them all at once using table.concat function to avoid repeated memory allocation to store a modified string.

[Updated 10/01/2013] Fixed an issue with UTF8 sequences; thanks to Enrique García for the fix.
[Updated 10/29/2013] Fixed an issue with one of the sequences that allowed for some invalid characters to pass as valid; thanks to Vadim Zeitlin for bringing this up.

You should get a copy of my slick ZeroBrane Studio IDE.

Leave a comment

what will you say?