The blank-line trap: when “empty” lines are not actually empty
You paste a list, click “remove blank lines,” and the result still has gaps. Or
=COUNTBLANK(A:A) in Excel returns 0 for a column that visibly contains empty cells.
The cause is almost always invisible whitespace — about a dozen Unicode codepoints that render
as blank space, behave differently from a regular space, and survive copy-paste from PDFs, chat
messages, and AI-generated text. We’ll walk through which ones DedupeLines catches, which it
doesn’t, and how to diagnose the rest.
What DedupeLines treats as “blank”
The Remove Blank Lines tool calls a line blank when it matches:
line === '' || /^\s*$/.test(line)
Two conditions: an exact empty string, OR a line that contains only whitespace as defined by JavaScript’s
\s character class. That second one is the interesting part — \s in modern
JavaScript matches a specific set of Unicode codepoints, and most of them are invisible.
The codepoints JavaScript ’\s’ catches
Per the ECMA-262 specification,
the JavaScript \s character class matches any character with the
Unicode “White_Space” property
plus a few extras. In practice that’s:
| Codepoint | Name | Source |
|---|---|---|
U+0009 | Character Tabulation (Tab) | Pressing the Tab key |
U+000A | Line Feed (\n) | Unix line ending |
U+000B | Line Tabulation (vertical tab) | Rare; some PDF exports |
U+000C | Form Feed | Old printers; rare in modern text |
U+000D | Carriage Return (\r) | Old Mac / DOS line endings |
U+0020 | Space | The space bar |
U+00A0 | No-Break Space (NBSP) | Word, web, anywhere appears |
U+1680 | Ogham Space Mark | Ogham script (very rare) |
U+2000-U+200A | En Quad, Em Quad, En Space, Em Space, Three-Per-Em Space, Four-Per-Em Space, Six-Per-Em Space, Figure Space, Punctuation Space, Thin Space, Hair Space | Typographic spaces from PDFs / Word docs / typesetting tools |
U+2028 | Line Separator | Some text editors |
U+2029 | Paragraph Separator | Some text editors |
U+202F | Narrow No-Break Space | French typography (number / unit separation) |
U+205F | Medium Mathematical Space | LaTeX-rendered math |
U+3000 | Ideographic Space | CJK input methods (full-width space) |
U+FEFF | Byte Order Mark / Zero Width No-Break Space | UTF-8 BOM at file start |
All of these will be detected as blank by Remove Blank Lines. A row containing exactly one ideographic space (U+3000) looks visually like an empty cell to a human and is correctly stripped.
What it does NOT catch: the zero-width family
The codepoints below render as zero pixels wide. They are not in the JavaScript
\s character class (the
Unicode Line Breaking Algorithm (UAX #14)
classifies them as format characters rather than whitespace), so DedupeLines and any tool that uses
\s will not treat them as blank:
| Codepoint | Name | Source |
|---|---|---|
U+200B | Zero Width Space | Soft line breaks in HTML, AI-generated text |
U+200C | Zero Width Non-Joiner | Persian, Indic scripts |
U+200D | Zero Width Joiner | Emoji combinations, Indic scripts |
U+2060 | Word Joiner | Typography |
A line containing only U+200B is invisible to your eye, technically non-empty as far as
the engine is concerned, and will survive Remove Blank Lines. This is intentional — the Unicode
standard distinguishes “whitespace” (visible separation) from “format characters”
(no visual width). Your text tool follows that distinction.
If you suspect zero-width contamination, see the diagnosis section below.
Why grep ‘^$’ is not enough
The classic shell pattern for skipping blank lines is:
grep -v '^$' file.txt
But ^$ only matches an exact empty line. A row containing one space, one tab, or any of
the Unicode whitespace codepoints above survives. The whitespace-aware version is:
grep -Ev '^[[:space:]]*$' file.txt
Which is closer to what DedupeLines does — though
POSIX [[:space:]]
historically only included ASCII whitespace; modern GNU grep with locale support catches more, but the
behaviour varies. The DedupeLines tool uses JavaScript’s \s, which is consistent across
Chrome, Firefox, Safari, and Edge.
Where invisible whitespace comes from
The most common sources of invisible whitespace in pasted text:
- Word documents and PDFs. Word uses NBSP (U+00A0) for non-breaking spaces around units, names, and titles. PDFs frequently embed thin spaces, en spaces, and em spaces for typographic alignment that survives copy-paste.
- HTML rendering. Anywhere
appears in the source HTML, the browser stores it as U+00A0. Copy text from a web page, paste into a tool, get NBSP-laden lines. - Chat / messaging platforms. Many chat clients (Slack, Discord, WhatsApp Web) insert zero-width spaces (U+200B) at soft line breaks and around mentions.
- AI-generated text. ChatGPT, Claude, and other LLMs occasionally produce zero-width characters or unusual whitespace, especially in code blocks or formatted output.
- CSV exports with BOM. Excel saves CSV with a UTF-8 BOM (U+FEFF) at the very start of the file. The first column header gets a phantom invisible character that breaks lookups.
- CJK keyboard input. Switching IME modes between half-width and full-width modes is easy to do accidentally, leaving ideographic spaces (U+3000) where you intended ASCII spaces.
How to diagnose: what’s actually in your line?
Open a browser DevTools console (F12, Console tab) and inspect any suspicious line:
Array.from('your line here').map(c => c.codePointAt(0).toString(16).padStart(4, '0'))
This returns the hex codepoint of every character. A line that looks empty but reports
['00a0'] contains exactly one NBSP. A line reporting ['200b', '200b']
contains two zero-width spaces.
Once you know what’s in there:
- If the contaminant is in
\s(NBSP, ideographic space, etc.) — Remove Blank Lines handles it. - If the contaminant is zero-width (U+200B, U+FEFF, etc.) — use Regex Extractor with an inverted pattern, e.g. extract lines that have at least one visible character:
[^\s] - If you want every line trimmed of whitespace at both ends regardless of source, use Trim Lines first — then run dedup or remove-blank.
The 64-character optimisation
A small implementation note: in the dedupe path, the engine runs the /^\s*$/ blank-check
regex only on lines shorter than 64 characters. Above that threshold, the line is treated as non-blank
without testing — on the safe assumption that a 65-character row of pure whitespace is essentially
never a real input. This shaves a few percent off processing time on inputs with many long lines.
The remove-blank mode itself does not have this short-circuit — it always runs the
full regex test, because finding blank lines is its only job.
The fix that doesn’t scale: manual find-and-replace
People often deal with NBSP contamination by doing find-and-replace in their text editor. This works once. It does not scale: each new whitespace codepoint is a new find-and-replace pass, and you can’t see what you’re replacing because the characters are invisible.
Our preferred approach treats “blank” as a property of the line, not a set of characters
to hunt down. Paste into the tool, the engine runs /^\s*$/ across the whole input in one
pass, and you copy back what survived. We’ve never seen a real-world list where this approach
missed a row that any of the find-and-replace passes would have caught — assuming the contamination
is one of the codepoints in the \s table above.
Frequently asked questions
How do I prevent invisible whitespace from showing up in my data in the first place?
You usually can’t — it’s introduced by the source. Word documents, PDFs, and HTML pages embed Unicode whitespace for typographic reasons (alignment, non-breaking spaces around units). You only see it after copy-paste lands the text in your tool. The cleanest preventive fix is to paste into a plain-text editor first (Notepad on Windows, TextEdit in plain-text mode on Mac), which won’t strip the characters but at least makes them visible if your editor has “show invisibles” mode.
Are zero-width spaces ever legitimately needed in data?
Yes. Persian (U+200C, ZWNJ) and Indic scripts (U+200D, ZWJ) use them for correct rendering — they’re part of the script, not invisible junk. Emoji combinations also use ZWJ to join codepoints (👨👩👧 is three person emojis + two ZWJs). If you’re processing internationalised text, blanket-stripping zero-width characters will corrupt valid input. The Remove Blank Lines tool deliberately doesn’t treat them as whitespace for this reason — we err on the side of preserving the user’s data.
Why doesn’t Excel’s TRIM() handle NBSP?
Excel TRIM() was specified in the pre-Unicode era and strips only ASCII space (0x20). NBSP (U+00A0) and other Unicode whitespace pass through unchanged. The workaround is =TRIM(SUBSTITUTE(A1, CHAR(160), "")) for NBSP, but that only handles one codepoint per SUBSTITUTE. To cover all of \s, you’d need a nested formula or a regex helper — or just paste the column into the Remove Blank Lines tool and copy it back.
Can I use Python’s strip() instead?
Yes, and it’s broader than Excel’s TRIM. Python 3’s str.strip() defaults to stripping all Unicode whitespace as defined by str.isspace(), which is close to JavaScript’s \s. The two diverge on a few edge cases (e.g. U+200B — both Python and JS exclude it; U+0085 NEXT LINE — Python includes it, JS doesn’t). For most data-cleaning purposes the differences don’t matter.
What text editor reliably shows invisible characters?
VS Code (Settings → “Render Whitespace” → all) makes ASCII whitespace and NBSP visible but doesn’t flag zero-width characters by default. Sublime Text’s View → Show Invisibles is similar. For full audit, paste into Compart’s Unicode viewer or run the DevTools console snippet from the diagnosis section above — both surface every codepoint regardless of visibility.
Related guides
- Trim before dedupe: workflow order — how the trim toggle handles edge whitespace inside the dedupe path.
- Keep the last occurrence — if your chained workflows are dropping the wrong rows, invisible whitespace is often why.
- Regex extractor recipes — patterns for filtering lines that contain visible content vs whitespace-only.
If you suspect your list has invisible whitespace, paste it into the tool and watch the “blank” counter. Anything that disappears was hiding in your text.
Remove blank lines