Unicode and Encoding Validation¶
NeNe applications often handle multilingual input — Japanese, Arabic, emoji, and other non-ASCII text. This guide covers the common pitfalls.
JSON responses: non-ASCII is NOT escaped¶
JsonResponder includes JSON_UNESCAPED_UNICODE so Japanese and emoji are returned as-is:
Note for
<script>context:View::encodeScriptJson()intentionally does NOT useJSON_UNESCAPED_UNICODEbecause it outputs JSON inside an HTML<script>tag where unescaped line terminators (U+2028, U+2029) could break the script block. Usejson_encodewith theJSON_HEX_*flags for that context.
String length: use mb_strlen, not strlen¶
strlen counts bytes, not characters. For multibyte text the byte count is wrong:
strlen('あ') // 3 (UTF-8 bytes)
mb_strlen('あ', 'UTF-8') // 1 (character)
strlen('Hello 👋') // 10 (6 + 4 bytes for the emoji)
mb_strlen('Hello 👋', 'UTF-8') // 7 (characters)
Always validate length with mb_strlen:
$title = trim((string)($this->REQUEST_JSON['title'] ?? ''));
if (mb_strlen($title, 'UTF-8') > 255) {
return $this->API_RESPONSE->failure('TITLE-TOO-LONG');
}
Grapheme clusters and emoji sequences¶
mb_strlen counts Unicode codepoints, not grapheme clusters (rendered glyphs). A family emoji 👨👩👧 is one visible glyph but 5 codepoints (joined by U+200D Zero Width Joiner). mb_strlen returns 5, not 1.
For a "50 characters as displayed" limit, use grapheme_strlen() from the intl extension:
For most practical applications mb_strlen is sufficient — users understand that multi-codepoint emoji "use more characters." Only use grapheme_strlen when the user-visible count must be exact.
Null bytes¶
SQLite (and MySQL) TEXT columns accept null bytes. A value like "Alice\x00Bob" is stored and retrieved without error. PHP itself is null-byte safe in modern versions, but some library functions (file path operations, certain C extensions) treat \x00 as a string terminator.
Reject null bytes explicitly for any user-supplied string:
$value = (string)($this->REQUEST_JSON['name'] ?? '');
if (str_contains($value, "\x00")) {
return $this->API_RESPONSE->failure('INVALID-INPUT');
}
Alternatively, strip null bytes silently if your use case treats them as noise:
LIKE search with non-ASCII¶
LIKE pattern matching is case-insensitive for ASCII but case-sensitive for non-ASCII in most MySQL collations. For Japanese, MySQL's utf8mb4_unicode_ci collation handles case folding for Latin characters; Hiragana/Katakana are treated as distinct.
For a multilingual search box, test with the actual target collation (utf8mb4_unicode_ci is the NeNe default).
Common validation snippet¶
A reusable pattern for validated string inputs:
/**
* Validate and clean a user-supplied UTF-8 string.
*
* @return string|null Cleaned string, or null if validation fails.
*/
private function validateString(mixed $raw, int $maxChars): ?string
{
if (!is_string($raw)) {
return null;
}
$value = trim($raw);
if ($value === '' || str_contains($value, "\x00")) {
return null;
}
if (mb_strlen($value, 'UTF-8') > $maxChars) {
return null;
}
return $value;
}
Related¶
docs/development/sql-injection.md— SQL injection defense including LIKE wildcard escapingdocs/tutorials/building-a-service.md— controller and mapper patterns