Title: how to get the length in bytes of a UTF-8 character? Post by: Francois G. on March 12, 2024, 09:48:24 am Hi,
Given LANG=en_GB.utf8 and FGL_LENGTH_SEMANTICS unset (defaulting to byte), and given a text typed in by the user containing emojis, accents, UK pound sign (£), quotes and double quotes from a word processor (3-byte UTF-8 characters), I need to be able to split this text into several Informix records (each 200 bytes long) at a UTF-8 character boundary (not at a byte boundary), as well as cut off a string at a certain byte length, but without damaging any UTF-8 towards the end (cut earlier than the byte length, so that all multi-byte UTF-8 characters keep their integrity). How do I iterate through a STRING, one multi-byte UTF-8 character at a time (not one byte at a time) without changing FGL_LENGTH_SEMANTICS ? I am currently using a horrible hack: FOR f_i = 1 TO f_note_text.getLength() LET f_utf8 = f_note_text.getCharAt(f_i) LET f_b64 = util.Strings.base64EncodeFromString(f_utf8) IF f_b64.getLength() = 4 AND f_b64 MATCHES "*==" THEN # 1 byte, 7-bit US ASCII, less than 0x7F ELSE # More than 1 byte, 8-bit, or byte more than 0x7F, either valid part of a UTF-8 character, or invalid part of a broken UTF-8 character END IF END FOR With this semantics, ORD() and ASCII() are less than useful for this purpose. I can also use "hexdump -C" from the Linux backend and get all the bytes that I need, but this is very heavy and slow. In C# there is a function for this: bytes = Encoding.UTF8.GetBytes(f_note_text); Finally I could also write a C shared library (SO file) to do this. Is there a native 4GL / BDL way to achieve this? Regards, Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Sebastien F. on March 12, 2024, 09:53:00 am Hello François,
NO!!! You should not have to write such code to split UTF-8 strings at byte level to insert the text into the SQL database! Please let's have a call to discuss this. Seb Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Francois G. on March 12, 2024, 09:56:51 am Thanks Seb,
We usually have a call with Michelle and Neil on Monday mornings: would you be able to attend, or at least discuss with Neil? Regards, Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Sebastien F. on March 12, 2024, 10:08:28 am No need to wait for Monday
Please contact me directly Seb Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Rene S. on March 13, 2024, 09:03:03 am Try this:
Code
BTW: if the string not contains double-width East Asian characters, fgl_width will do the job. Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Rene S. on March 13, 2024, 09:06:16 am The program above produces:
Code
The number of logical chars does not depend on FGL_LENGTH_SEMANTICS Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Leo S. on March 13, 2024, 10:34:02 am Hi Francoise, to cut at a byte boundary of x you even *must* use FGL_LENGTH_SEMANTICS=BYTE because getLength of the String returns the number of bytes.
So Rene did demonstrate how easy it is to iterate thru because as long as you increase the index passed to getCharAt properly with the next character byte length you always get the correct character. If the index goes beyond your byte boundary , just go one position back and do a subString. HTH , Leo Code
Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Leo S. on March 13, 2024, 11:33:00 am With the help of Rene a program which is displayed correctly here: util.JSON.parse helps to import non ascii symbols printable in this forum
Code
Title: Re: how to get the length in bytes of a UTF-8 character? Post by: Francois G. on March 13, 2024, 11:42:39 am Thanks all.
My current challenge is how to parse / handle compound emojis such as "Woman Facepalming: Dark Skin Tone" (part of a family of UTF-8 characters called "Emoji ZWJ Sequence"): 🤦🏿♀️ One emoji, displayed over a visual length of 2 (takes 2 "spaces") is actually a sequence of : person facepalming, 4 bytes colour black, 4 bytes zero width joiner, 3 bytes female, 3 bytes zero width joiner, 3 bytes (that last one I am not sure: I can see this in my debugger, but not in the emoji docs) so this emoji requires 17 bytes (unless I got my calculation wrong). If I "split" at any intermediate element of this sequence, I will corrupt the emoji. |