JavaScript/TypeScript Handling Surrogate Pairs

eye-catch JavaScript/TypeScript

I needed to convert a string to ASCII code in my work. I didn’t need to care about multi-byte characters but I wanted to know how to handle it, so I decided to write this article.

Sponsored links

Convert string to ASCII/UTF code array

There are two functions that we can use to convert the string to a number array. charCodeAt and charPointAt.

function byCharCodeAt(str: string): number[] {
    return str.split("").map((x) => x.charCodeAt(0));
}
function byCodePointAt(str: string): number[] {
    return str.split("").map((x) => x.codePointAt(0)!);
}

Both behaviors are basically the same if the string doesn’t contain surrogate pairs characters. I also defined the following two functions. byCharCodeAt2 can be used to convert the string to a number array of which a number of elements are fixed.

function byCharCodeAt2(str: string): number[] {
    const result = new Array(16).fill(0);
    for (let i = 0; i < str.length; i++) {
        result[i] = str.charCodeAt(i);
    }
    return result;
}
function byCodePointAt2(str: string): number[] {
    const result = [];
    for (let i = 0; i < str.length; i++) {
        result.push(str.codePointAt(i)!);
    }
    return result;
}

Let’s check the difference between the two functions. Since I will pass the different strings to the functions, I created the following function.

function compare(str: string): void {
    console.log(str);
    console.log("length        : " + str.length);
    console.log("split         : " + str.split(""));
    console.log("spread        : " + [...str]);
    console.log("Array.from    : " + Array.from(str));
    const result1 = byCharCodeAt(str);
    console.log("byCharCodeAt  : " + result1);
    console.log("  --> " + String.fromCharCode(...result1));

    const result2 = byCodePointAt(str);
    console.log("byCodePointAt : " + result2);
    console.log("  --> " + String.fromCodePoint(...result2));

    const result3 = byCodePointAt2(str);
    console.log("byCodePointAt2: " + result3);
    console.log("  --> " + String.fromCodePoint(...result3));
    console.log();
}

Let’s pass “Hello World!” first.

compare("Hello World!");
// Hello World!
// length        : 12
// split         : H,e,l,l,o, ,W,o,r,l,d,!
// spread        : H,e,l,l,o, ,W,o,r,l,d,!
// Array.from    : H,e,l,l,o, ,W,o,r,l,d,!
// byCharCodeAt  : 72,101,108,108,111,32,87,111,114,108,100,33
//   --> Hello World!
// byCodePointAt : 72,101,108,108,111,32,87,111,114,108,100,33
//   --> Hello World!
// byCodePointAt2: 72,101,108,108,111,32,87,111,114,108,100,33
//   --> Hello World!

The results are the same because the string consists of only ASCII code. If the target language is English, we can use both functions.

Sponsored links

Multibyte characters

The next example is multibyte characters. Japanese characters are multibyte. Let’s try “Hello World!” in Japanese.

compare("こんにちは世界!");
// こんにちは世界!
// length        : 8
// split         : こ,ん,に,ち,は,世,界,!
// spread        : こ,ん,に,ち,は,世,界,!
// Array.from    : こ,ん,に,ち,は,世,界,!
// byCharCodeAt  : 12371,12435,12395,12385,12399,19990,30028,65281
//   --> こんにちは世界!
// byCodePointAt : 12371,12435,12395,12385,12399,19990,30028,65281
//   --> こんにちは世界!
// byCodePointAt2: 12371,12435,12395,12385,12399,19990,30028,65281
//   --> こんにちは世界!

The characters are multibyte but the length is correct. As far as I remember, we need to handle multibyte in a special way in C++ but not in JavaScript/TypeScript.

Surrogate pairs

Some signs/characters have two 16-bit code units. It’s called surrogate pairs. We need to take care of this a bit complex case.

Length of the Surrogate pairs string

String Length check is done in many cases but if we get the length in the same way as normal string, its results are different from what we expect.

const str = "😀🌕";
console.log("length: " + str.length); // 4

Length is 4 even though there are only two emojis. Node.js treats 16 bit as a unit. Since emoji consists of two 16-bit code units, the length was 4 here.
Then, how we can get the actual length of the string? See the following results.

console.log("split     : " + str.split(""));    // split     : �,�,�,�
console.log("spread    : " + [...str]);         // spread    : 😀,🌕
console.log("Array.from: " + Array.from(str));  // Array.from: 😀,🌕
console.log("spread    : " + [...str].length);          // 2
console.log("Array.from: " + Array.from(str).length);   // 2

split function doesn’t suit this case but the other two can be used. The number of the elements is two by spread operator and Array.from function.
If we need to handle surrogate pairs, let’s use one of them.

If you don’t know the three dots, check this post.

Code number of surrogate pairs

We have already seen two functions which are charCodeAt and codePointAt. We can see the difference here.

compare("😀🌕");
// 😀🌕
// length        : 4
// split         : �,�,�,�
// spread        : 😀,🌕
// Array.from    : 😀,🌕
// byCharCodeAt  : 55357,56832,55356,57109
//   --> 😀🌕
// byCodePointAt : 55357,56832,55356,57109
//   --> 😀🌕
// byCodePointAt2: 128512,56832,127765,57109
//   --> 😀�🌕�

CharCodeAt function returns the 16-bit code at the first index of the sign while CodePointAt returns 32-bit code. If we want to get the code number, we have to use codePointAt function instead of charCodeAt. Note that codePointAt returns the same result as charCodeAt if its process is done index by index.

😀🌕
index0123
CharCodeAt55357568325535657109
CodePointAt1285125683212776557109

The values at index 1 and 3 are not necessary. If we want to get only two values, we can implement it as follows.

const str = "😀🌕";
const result4 = [...str].map((x) => x.codePointAt(0)!);
console.log("byCodePointAt3: " + result4);
console.log("  --> " + String.fromCodePoint(...result4));
// byCodePointAt3: 128512,127765
//   --> 😀🌕

Even if the character of the surrogate pair is in a string, we can handle it as we expect in the following way.

function getLength(str: string): number {
    return [...str].length;
}

function getStringCodePointArray(str: string): number[] {
    return [...str].map((x) => x.codePointAt(0)!);
}

const str = "Great😀";
console.log(getLength(str)); // 6
console.log(getStringCodePointArray(str));
// [ 71, 114, 101, 97, 116, 128512 ]

End

If we need to store string data in a database, we need to be aware of these surrogate pairs because of data size limitations. I’ve never faced the case though, this fact is nice to know.

Comments

Copied title and URL