Trim before dedupe: why workflow order changes the result
Two strings: foo and foo (with a trailing space). To a strict
deduper they are different rows. To a human pasting from a CSV export, they are obviously the same.
The mismatch is one of the most common reasons a “why didn’t dedup work” bug shows up in a clean-looking list.
DedupeLines handles this with a trim toggle on the homepage tool. The toggle is on by default, which makes the comparison whitespace-insensitive. But the way it interacts with the comparison key (and with the output) is worth understanding, especially if you chain multiple tools together. Knowing which order to apply trim, dedupe, and other passes can change the final row count by orders of magnitude.
What the trim toggle actually does
Inside the engine, every line goes through a normalisation step before it’s used as a hash key:
function normalize(line, opts) {
var s = line;
if (opts.trim) s = s.replace(/^\s+|\s+$/g, '');
if (!opts.case) s = s.toLowerCase();
return s;
}
With trim on, leading and trailing whitespace is stripped before computing the comparison key.
With case off (the default), the key is also lowercased. The original line is kept untouched in memory
— only the key is normalised. So FOO and foo hash to
the same key but the kept output preserves the casing of whichever row appeared first.
The output, however, also gets trimmed when the trim toggle is on:
keep.push(opts.trim ? raw.replace(trimRe, '') : raw);
This is intentional — if you turn trim on, you’re saying “treat trailing whitespace as garbage,” and the output reflects that. If you want comparison-only trim (compare without whitespace, but keep the original spacing in the output), the engine doesn’t expose that as a separate toggle. It’s a deliberate simplification.
What changes with trim off
Same input, trim off, gives a different result. Consider:
[email protected]
[email protected]
[email protected]
[email protected]
- Trim ON, case OFF (defaults): 1 unique row. All four rows normalise to
[email protected]. - Trim OFF, case OFF: 3 unique rows.
[email protected]/[email protected]/[email protected]are three different keys; the lowercased ALICE row collapses with the first. - Trim ON, case ON: 2 unique rows. The trimmed comparison merges three rows; the uppercased ALICE row stays distinct.
- Trim OFF, case ON: 4 unique rows. Every row is different.
Most data-cleaning workflows want the first behaviour: trim on, case off. That’s why both toggles default to that state. But there are real cases where you want trim off — for example, when leading or trailing whitespace is meaningful (indented code, formatted output where alignment matters).
The empty-line interaction
There’s a third toggle that interacts with trim: empty (drop blank lines). The four combinations are easier to reason about as a matrix than a paragraph — we keep this open on a second monitor when tuning a tricky cleanup:
| Empty | Trim | Behaviour |
|---|---|---|
| ON | ON | A row that becomes empty after trimming (just spaces) is dropped. “Clean me up entirely” mode. |
| ON | OFF | Only truly empty lines are dropped. Whitespace-only rows are still caught (the engine’s blank check uses /^\s*$/), but the kept rows preserve their spacing. |
| OFF | ON | Blank lines pass through but get trimmed. A row of 5 spaces becomes an empty string in the output. |
| OFF | OFF | Nothing is touched. Useful when dedup-ing structured data where line breaks and spacing are meaningful. |
Where order matters: chained workflows
Because each tool runs independently, when you chain tools together the order matters. A common real-world pipeline is the “CSV cleanup” chain:
- Trim Lines — strip leading/trailing whitespace
- Remove Blank Lines — drop now-empty rows
- Dedupe — remove duplicates
This is the “clean first, then dedup” order. Two consequences:
- By the time the dedup pass runs, all rows are already trimmed — so the trim toggle on the homepage tool becomes effectively a no-op (no trailing whitespace exists to strip).
- If your input had internal whitespace differences (e.g.
foo barvsfoo bar— double space inside), the chain doesn’t fix them. The trim tool only touches the edges.
Compare with the reverse order — dedupe first, trim later. If you do that, the dedup pass uses your toggles (trim on by default), so the result is identical for “edge whitespace” differences. But you process more rows on the dedup side (the heavy step), which is slower for large inputs. Performance-wise, trimming first means fewer rows reach the dedup hash table.
The Excel comparison: UNIQUE() does not trim
Excel’s UNIQUE() function and Power Query “Remove Duplicates” both use
strict string comparison — no implicit trim. "foo" and "foo " are
different. Microsoft’s UNIQUE() documentation
confirms this: it compares values as Excel sees them, with no whitespace normalisation.
So in Excel you build a helper column with =TRIM(A1), then dedup the helper column,
then map back. Two extra steps for a job that DedupeLines does in one pass with the trim toggle.
Note: Excel’s TRIM() function strips ASCII whitespace and collapses internal
multi-spaces, but does not handle Unicode whitespace (NBSP, ideographic space, etc.)
— you need =TRIM(SUBSTITUTE(A1, CHAR(160), "")) for NBSP. JavaScript’s
String.prototype.trim()
(and our regex equivalent /^\s+|\s+$/g) handles all of these natively, as discussed in our
blank-line trap post.
SQL: same problem, different syntax
SQL SELECT DISTINCT col FROM t also doesn’t trim. The standard fix:
SELECT DISTINCT TRIM(col) FROM t;
Postgres TRIM()
handles ASCII whitespace by default, with optional explicit character lists for Unicode.
MySQL TRIM()
is ASCII-only. Both behave like Excel — you have to apply trim before the DISTINCT. Python’s
str.strip()
is the closest match to our engine’s behaviour; it strips all Unicode whitespace by default.
Recommended defaults
For most clipboard-paste cleanup jobs:
- Leave trim ON. The cases where trailing whitespace is meaningful are rare; the cases where trailing whitespace breaks comparison are common.
- Leave case OFF. Email addresses, URLs, and most identifiers are case-insensitive in practice. Turn it on for case-sensitive identifiers (Git commit hashes, base64 strings).
- Leave empty ON. Blank rows are almost always noise from CSV exports or spreadsheet copy-paste.
- Turn shuffle OFF unless you specifically want randomised output. The default keeps original order, which preserves your mental model of the input.
These defaults match what the engine ships with. Override them when you have a specific reason — not preemptively. We’ve tuned the defaults against several months of internal data-cleanup work and they hold for the workflows we’ve seen most often.
Frequently asked questions
Why does the output get trimmed when trim is on, instead of just the comparison?
Two reasons. First, when you turn trim on you’re telling the engine “trailing whitespace is garbage,” and the output should reflect that — not preserve the garbage you’ve just declared garbage. Second, a separate comparison-only toggle would double the configuration matrix without a strong use case. If you do need comparison-only trim (rare in our experience), copy the deduped output and paste it back in with trim off — or process the original column twice, once for each behaviour.
Does trim handle the Unicode whitespace family (NBSP, ideographic space, etc.)?
Yes. The trim regex is /^\s+|\s+$/g, and JavaScript’s \s covers everything in the Unicode “White_Space” property: ASCII space/tab, NBSP (U+00A0), all the typographic spaces (U+2000-U+200A), ideographic space (U+3000), and a handful of others. The blank-line trap post has the full table.
What’s the right way to dedupe a CSV column where leading/trailing spaces are part of the data?
Turn trim off. The case where leading whitespace is meaningful usually involves indented code, fixed-width data exports, or hand-formatted tables — all rare in clipboard-paste workflows but real. With trim off the comparison key is the raw line, byte-for-byte, and your spaces are preserved through the dedup pass.
If I chain trim-lines and the deduper, does the order matter for the final result?
For most data the final result is the same. The performance differs: trim-lines first means the deduper sees pre-trimmed input and runs slightly faster (smaller hash keys, less normalisation work per row). Dedupe-first with trim on means the deduper does the trim work internally, then trim-lines runs on already-trimmed output and is a near-no-op. For a 100K-line input the difference is around 50 ms — not enough to matter unless you’re processing recurringly.
Why is “case off” the default rather than “case on”?
Because email addresses, URLs, and most identifier-shaped data are case-insensitive in practice even when they’re technically case-sensitive. [email protected] and [email protected] route to the same mailbox; Example.com/page and example.com/page serve the same response. Defaulting to case-off matches what users almost always mean by “duplicate.” Turn it on for case-significant identifiers like Git commit hashes or base64 strings.
Related guides
- The blank-line trap — the Unicode whitespace family that trim and the blank check both rely on.
- Keep the last occurrence — chained-tool pattern; trim defaults shape the chain’s output too.
- Regex extractor recipes — when trim-on alone isn’t enough and you need pattern-based cleanup.
Try the toggles: paste a list with mixed casing and trailing spaces, flip trim and case, and watch the unique count change. The defaults are tuned for the 80% case.
Open the deduper