how to get the length in bytes of a UTF-8 character?

Started by Francois G., March 12, 2024, 09:48:24 AM

Previous topic - Next topic

Francois G.

Hi,

Given LANG=en_GB.utf8 and FGL_LENGTH_SEMANTICS unset (defaulting to byte), and given a text typed in by the user containing emojis, accents, UK pound sign (£), quotes and double quotes from a word processor (3-byte UTF-8 characters), I need to be able to split this text into several Informix records (each 200 bytes long) at a UTF-8 character boundary (not at a byte boundary), as well as cut off a string at a certain byte length, but without damaging any UTF-8 towards the end (cut earlier than the byte length, so that all multi-byte UTF-8 characters keep their integrity).

How do I iterate through a STRING, one multi-byte UTF-8 character at a time (not one byte at a time) without changing FGL_LENGTH_SEMANTICS ?

I am currently using a horrible hack:

    FOR f_i = 1 TO f_note_text.getLength()
        LET f_utf8 = f_note_text.getCharAt(f_i)
        LET f_b64  = util.Strings.base64EncodeFromString(f_utf8)
        IF f_b64.getLength() = 4 AND f_b64 MATCHES "*==" THEN
            # 1 byte, 7-bit US ASCII, less than 0x7F
        ELSE
            # More than 1 byte, 8-bit, or byte more than 0x7F, either valid part of a UTF-8 character, or invalid part of a broken UTF-8 character
        END IF
    END FOR

With this semantics, ORD() and ASCII() are less than useful for this purpose.

I can also use "hexdump -C" from  the Linux backend and get all the bytes that I need, but this is very heavy and slow.

In C# there is a function for this:

    bytes = Encoding.UTF8.GetBytes(f_note_text);

Finally I could also write a C shared library (SO file) to do this.

Is there a native 4GL / BDL way to achieve this?

Regards,

Sebastien F.

Hello François,

NO!!! You should not have to write such code to split UTF-8 strings at byte level to insert the text into the SQL database!

Please let's have a call to discuss this.

Seb

Francois G.

Thanks Seb,

We usually have a call with Michelle and Neil on Monday mornings: would you be able to attend, or at least discuss with Neil?

Regards,

Sebastien F.

No need to wait for Monday
Please contact me directly
Seb

Rene S.

Try this:

Code (#) Select

MAIN
    DEFINE s STRING
    LET s = 'ab€c d'
    DISPLAY SFMT("s:%1 l:%2 w:%3 number_of_logical_chars:%3",
        s, length(s), fgl_width(s), number_of_logical_chars(s))
    LET s = '123456'
    DISPLAY SFMT("s:%1 l:%2 w:%3 number_of_logical_chars:%3",
        s, length(s), fgl_width(s), number_of_logical_chars(s))
END MAIN

FUNCTION number_of_logical_chars(s STRING) RETURNS INT
    DEFINE i, l, n INT

    LET l = s.getLength()
    LET i = 1
    LET n = 0
    WHILE i <= l
        VAR c = s.getCharAt(i)
        LET i += c.getLength()
        LET n += 1
    END WHILE
    RETURN n
END FUNCTION


BTW: if the string not contains double-width East Asian characters, fgl_width will do the job.

Rene S.

The program above produces:
Code (#) Select

$ FGL_LENGTH_SEMANTICS=BYTE fgl ll1
s:ab€c d l:8 w:6 number_of_logical_chars:6
s:123456 l:6 w:6 number_of_logical_chars:6
$ FGL_LENGTH_SEMANTICS=CHAR fgl ll1
s:ab€c d l:6 w:6 number_of_logical_chars:6
s:123456 l:6 w:6 number_of_logical_chars:6


The number of logical chars does not depend on FGL_LENGTH_SEMANTICS

Leo S.

Hi Francoise, to cut at a byte boundary of x you even *must* use FGL_LENGTH_SEMANTICS=BYTE because getLength of the String returns the number of bytes.
So Rene did demonstrate how easy it is to iterate thru because as long as you increase the index passed to getCharAt properly with the next character byte length you always get the correct character.
If the index goes beyond your byte boundary , just go one position back and do a subString.

HTH , Leo
Code (genero) Select

MAIN
  DEFINE s, s2 STRING
  IF fgl_getenv("FGL_LENGTH_SEMANTICS") == "CHAR" THEN
    DISPLAY "must not have CHAR length semantics to count bytes"
    RETURN
  END IF
  LET s = "🙁🙂ab£c🙁"
  LET s2 = cutAtByteLength(s, 14)
  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
  LET s = "🙁🙂x"
  LET s2 = cutAtByteLength(s, 8)
  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
END MAIN

FUNCTION cutAtByteLength(s STRING, maxByteLength INT) RETURNS STRING
  VAR l = s.getLength() --byte length whole string
  DISPLAY "source has length:", l
  VAR i = 1
  WHILE i <= l
    VAR c = s.getCharAt(i)
    VAR clen = c.getLength() --byte length one char
    LET i += clen
    DISPLAY SFMT("i:%1,c:%2,len:%3", i, c, c.getLength())
    IF i - 1 > maxByteLength THEN
      RETURN s.subString(1, i - 1 - clen)
    END IF
  END WHILE
  RETURN s
END FUNCTION

Leo S.

With the help of Rene a program which is displayed correctly here: util.JSON.parse helps to import non ascii symbols printable in this forum
Code (genero) Select

IMPORT util
MAIN
  DEFINE s, s2,smil1,smil2,pound STRING
  IF fgl_getenv("FGL_LENGTH_SEMANTICS") == "CHAR" THEN
    DISPLAY "must not have CHAR length semantics to count bytes"
    RETURN
  END IF
  --smileys need 2 unicode code points (beyond Unicode basic plane)                     
  CALL util.JSON.parse('"\\uD83D\\uDE01"',smil1)
  DISPLAY "smil1:",smil1
  CALL util.JSON.parse('"\\uD83D\\uDE02"',smil2)
  DISPLAY "smil2:",smil2
  CALL util.JSON.parse('"\\u00a3"',pound)
  DISPLAY "pound:",pound
  LET s = smil1,smil2,"ab",pound,"c",smil1
  LET s2 = cutAtByteLength(s, 14) --last smil1 is cut because the byte boundary is inbetween
  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
  LET s = smil1,smil2,"x"
  LET s2 = cutAtByteLength(s, 8)
  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
END MAIN

FUNCTION cutAtByteLength(s STRING, maxByteLength INT) RETURNS STRING
  VAR l = s.getLength() --byte length whole string
  DISPLAY sfmt("source:'%1' has length:%2",s,l)
  VAR i = 1
  WHILE i <= l
    VAR c = s.getCharAt(i)
    VAR clen = c.getLength() --byte length one char
    LET i += clen
    DISPLAY SFMT("i:%1,c:%2,len:%3", i, c, c.getLength())
    IF i - 1 > maxByteLength THEN
      RETURN s.subString(1, i - 1 - clen)
    END IF
  END WHILE
  RETURN s
END FUNCTION

Francois G.

Thanks all.

My current challenge is how to parse / handle compound emojis such as "Woman Facepalming: Dark Skin Tone" (part of a family of UTF-8 characters called "Emoji ZWJ Sequence"):
🤦🏿‍♀️

One emoji, displayed over a visual length of  2 (takes 2 "spaces") is actually a sequence of :
person facepalming, 4 bytes
colour black, 4 bytes
zero width joiner, 3 bytes
female, 3 bytes
zero width joiner, 3 bytes (that last one I am not sure: I can see this in my debugger, but not in the emoji docs)

so this emoji  requires 17 bytes (unless I got my calculation wrong).

If I "split" at any intermediate element of this sequence, I will corrupt the emoji.