Subscribe for automatic updates: RSS icon RSS

Login icon Sign in for full access | Help icon Help
Advanced search

Pages: [1]
  Reply  |  Print  
Author Topic: how to get the length in bytes of a UTF-8 character?  (Read 798 times)
Francois G.
Posts: 20


« on: March 12, 2024, 09:48:24 am »

Hi,

Given LANG=en_GB.utf8 and FGL_LENGTH_SEMANTICS unset (defaulting to byte), and given a text typed in by the user containing emojis, accents, UK pound sign (£), quotes and double quotes from a word processor (3-byte UTF-8 characters), I need to be able to split this text into several Informix records (each 200 bytes long) at a UTF-8 character boundary (not at a byte boundary), as well as cut off a string at a certain byte length, but without damaging any UTF-8 towards the end (cut earlier than the byte length, so that all multi-byte UTF-8 characters keep their integrity).

How do I iterate through a STRING, one multi-byte UTF-8 character at a time (not one byte at a time) without changing FGL_LENGTH_SEMANTICS ?

I am currently using a horrible hack:

    FOR f_i = 1 TO f_note_text.getLength()
        LET f_utf8 = f_note_text.getCharAt(f_i)
        LET f_b64  = util.Strings.base64EncodeFromString(f_utf8)
        IF f_b64.getLength() = 4 AND f_b64 MATCHES "*==" THEN
            # 1 byte, 7-bit US ASCII, less than 0x7F
        ELSE
            # More than 1 byte, 8-bit, or byte more than 0x7F, either valid part of a UTF-8 character, or invalid part of a broken UTF-8 character
        END IF
    END FOR

With this semantics, ORD() and ASCII() are less than useful for this purpose.

I can also use "hexdump -C" from  the Linux backend and get all the bytes that I need, but this is very heavy and slow.

In C# there is a function for this:

    bytes = Encoding.UTF8.GetBytes(f_note_text);

Finally I could also write a C shared library (SO file) to do this.

Is there a native 4GL / BDL way to achieve this?

Regards,
Sebastien F.
Four Js
Posts: 509


« Reply #1 on: March 12, 2024, 09:53:00 am »

Hello François,

NO!!! You should not have to write such code to split UTF-8 strings at byte level to insert the text into the SQL database!

Please let's have a call to discuss this.

Seb
Francois G.
Posts: 20


« Reply #2 on: March 12, 2024, 09:56:51 am »

Thanks Seb,

We usually have a call with Michelle and Neil on Monday mornings: would you be able to attend, or at least discuss with Neil?

Regards,
Sebastien F.
Four Js
Posts: 509


« Reply #3 on: March 12, 2024, 10:08:28 am »

No need to wait for Monday
Please contact me directly
Seb
Rene S.
Four Js
Posts: 111


« Reply #4 on: March 13, 2024, 09:03:03 am »

Try this:

Code
  1. MAIN
  2.    DEFINE s STRING
  3.    LET s = 'ab€c d'
  4.    DISPLAY SFMT("s:%1 l:%2 w:%3 number_of_logical_chars:%3",
  5.        s, length(s), fgl_width(s), number_of_logical_chars(s))
  6.    LET s = '123456'
  7.    DISPLAY SFMT("s:%1 l:%2 w:%3 number_of_logical_chars:%3",
  8.        s, length(s), fgl_width(s), number_of_logical_chars(s))
  9. END MAIN
  10.  
  11. FUNCTION number_of_logical_chars(s STRING) RETURNS INT
  12.    DEFINE i, l, n INT
  13.  
  14.    LET l = s.getLength()
  15.    LET i = 1
  16.    LET n = 0
  17.    WHILE i <= l
  18.        VAR c = s.getCharAt(i)
  19.        LET i += c.getLength()
  20.        LET n += 1
  21.    END WHILE
  22.    RETURN n
  23. END FUNCTION
  24.  

BTW: if the string not contains double-width East Asian characters, fgl_width will do the job.
Rene S.
Four Js
Posts: 111


« Reply #5 on: March 13, 2024, 09:06:16 am »

The program above produces:
Code
  1. $ FGL_LENGTH_SEMANTICS=BYTE fgl ll1
  2. s:ab€c d l:8 w:6 number_of_logical_chars:6
  3. s:123456 l:6 w:6 number_of_logical_chars:6
  4. $ FGL_LENGTH_SEMANTICS=CHAR fgl ll1
  5. s:ab€c d l:6 w:6 number_of_logical_chars:6
  6. s:123456 l:6 w:6 number_of_logical_chars:6
  7.  

The number of logical chars does not depend on FGL_LENGTH_SEMANTICS
Leo S.
Four Js
Posts: 126


« Reply #6 on: March 13, 2024, 10:34:02 am »

Hi Francoise, to cut at a byte boundary of x you even *must* use FGL_LENGTH_SEMANTICS=BYTE because getLength of the String returns the number of bytes.
So Rene did demonstrate how easy it is to iterate thru because as long as you increase the index passed to getCharAt properly with the next character byte length you always get the correct character.
If the index goes beyond your byte boundary , just go one position back and do a subString.

HTH , Leo
Code
  1. MAIN
  2.  DEFINE s, s2 STRING
  3.  IF fgl_getenv("FGL_LENGTH_SEMANTICS") == "CHAR" THEN
  4.    DISPLAY "must not have CHAR length semantics to count bytes"
  5.    RETURN
  6.  END IF
  7.  LET s = "&#128577;&#128578;ab£c&#128577;"
  8.  LET s2 = cutAtByteLength(s, 14)
  9.  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
  10.  LET s = "&#128577;&#128578;x"
  11.  LET s2 = cutAtByteLength(s, 8)
  12.  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
  13. END MAIN
  14.  
  15. FUNCTION cutAtByteLength(s STRING, maxByteLength INT) RETURNS STRING
  16.  VAR l = s.getLength() --byte length whole string
  17.  DISPLAY "source has length:", l
  18.  VAR i = 1
  19.  WHILE i <= l
  20.    VAR c = s.getCharAt(i)
  21.    VAR clen = c.getLength() --byte length one char
  22.    LET i += clen
  23.    DISPLAY SFMT("i:%1,c:%2,len:%3", i, c, c.getLength())
  24.    IF i - 1 > maxByteLength THEN
  25.      RETURN s.subString(1, i - 1 - clen)
  26.    END IF
  27.  END WHILE
  28.  RETURN s
  29. END FUNCTION
  30.  
Leo S.
Four Js
Posts: 126


« Reply #7 on: March 13, 2024, 11:33:00 am »

With the help of Rene a program which is displayed correctly here: util.JSON.parse helps to import non ascii symbols printable in this forum
Code
  1. IMPORT util
  2. MAIN
  3.  DEFINE s, s2,smil1,smil2,pound STRING
  4.  IF fgl_getenv("FGL_LENGTH_SEMANTICS") == "CHAR" THEN
  5.    DISPLAY "must not have CHAR length semantics to count bytes"
  6.    RETURN
  7.  END IF
  8.  --smileys need 2 unicode code points (beyond Unicode basic plane)                    
  9.  CALL util.JSON.parse('"\\uD83D\\uDE01"',smil1)
  10.  DISPLAY "smil1:",smil1
  11.  CALL util.JSON.parse('"\\uD83D\\uDE02"',smil2)
  12.  DISPLAY "smil2:",smil2
  13.  CALL util.JSON.parse('"\\u00a3"',pound)
  14.  DISPLAY "pound:",pound
  15.  LET s = smil1,smil2,"ab",pound,"c",smil1
  16.  LET s2 = cutAtByteLength(s, 14) --last smil1 is cut because the byte boundary is inbetween
  17.  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
  18.  LET s = smil1,smil2,"x"
  19.  LET s2 = cutAtByteLength(s, 8)
  20.  DISPLAY SFMT("s:%1 slen:%2 s2:%3 s2len:%4", s, length(s), s2, length(s2))
  21. END MAIN
  22.  
  23. FUNCTION cutAtByteLength(s STRING, maxByteLength INT) RETURNS STRING
  24.  VAR l = s.getLength() --byte length whole string
  25.  DISPLAY sfmt("source:'%1' has length:%2",s,l)
  26.  VAR i = 1
  27.  WHILE i <= l
  28.    VAR c = s.getCharAt(i)
  29.    VAR clen = c.getLength() --byte length one char
  30.    LET i += clen
  31.    DISPLAY SFMT("i:%1,c:%2,len:%3", i, c, c.getLength())
  32.    IF i - 1 > maxByteLength THEN
  33.      RETURN s.subString(1, i - 1 - clen)
  34.    END IF
  35.  END WHILE
  36.  RETURN s
  37. END FUNCTION
  38.  
Francois G.
Posts: 20


« Reply #8 on: March 13, 2024, 11:42:39 am »

Thanks all.

My current challenge is how to parse / handle compound emojis such as "Woman Facepalming: Dark Skin Tone" (part of a family of UTF-8 characters called "Emoji ZWJ Sequence"):
🤦🏿‍♀️

One emoji, displayed over a visual length of  2 (takes 2 "spaces") is actually a sequence of :
person facepalming, 4 bytes
colour black, 4 bytes
zero width joiner, 3 bytes
female, 3 bytes
zero width joiner, 3 bytes (that last one I am not sure: I can see this in my debugger, but not in the emoji docs)

so this emoji  requires 17 bytes (unless I got my calculation wrong).

If I "split" at any intermediate element of this sequence, I will corrupt the emoji.



Pages: [1]
  Reply  |  Print  
 
Jump to:  

Powered by SMF 1.1.21 | SMF © 2015, Simple Machines