Skip to content

Filter out invisible unicode characters from text segments#3344

Open
JiuqingSong wants to merge 1 commit into
masterfrom
u/jisong/filterinvisibleunicode
Open

Filter out invisible unicode characters from text segments#3344
JiuqingSong wants to merge 1 commit into
masterfrom
u/jisong/filterinvisibleunicode

Conversation

@JiuqingSong
Copy link
Copy Markdown
Collaborator

Summary

  • Strip invisible Unicode tag characters (U+E0000–U+EFFFF) inside createText so they cannot survive paste/DOM-to-model conversion. These characters are used to hide instructions/text inside HTML (see https://embracethered.com/blog/posts/2024/hiding-and-finding-text-with-unicode-tags/) and otherwise leak into the model as normal text.
  • Meaningful invisible characters that fall outside that range (e.g. ZWSP U+200B, ZWJ U+200D, RLO U+202E, PDF U+202C) are preserved.
  • Unit tests in creatorsTest.ts cover mixed/boundary/only-invisible inputs and confirm meaningful invisible chars are untouched. An end-to-end test in endToEndTest.ts verifies a full DOM → Model → DOM/text round-trip strips only the tag range.

Test plan

  • yarn test:fast --testPathPattern=creatorsTest
  • yarn test:fast --testPathPattern=endToEndTest

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant