Six tools, one engine — a tour of dedupe.js
Every tool on DedupeLines is powered by a single 566-line vanilla JavaScript file: /static/js/dedupe.js. No framework, no build step, no dependencies. The same file runs in the main browser thread for small inputs and in a Web Worker for large ones. There is no minified version — the file shipped to your browser is the file in our editor.
This post is a tour through that file: how nine different behaviours route off a single
mode parameter, why we use Object.create(null) instead of {},
the Fisher-Yates shuffle, and the O(n) hash table that does the actual deduplication.
The shape
The whole engine is wrapped in an IIFE (Immediately Invoked Function Expression):
(function (root) {
'use strict';
// ... all engine code ...
root.Dedupe = { process: process, splitLines: splitLines };
})(typeof self !== 'undefined' ? self : this);
The trailing typeof self !== 'undefined' ? self : this is what lets the same file run
in two environments. In a browser window, self is the window. In a Web Worker, self
is the worker’s global scope. Either way, Dedupe.process ends up callable.
The Worker imports the same file via
importScripts('dedupe.js?v=37');
no separate Worker build is needed.
Two functions are exposed: process(text, opts) for everything, and splitLines(text)
as a small utility because line-splitting subtleties show up everywhere.
splitLines: handling three line-ending conventions
function splitLines(text) {
if (!text) return [];
return String(text).split(/\r\n|\r|\n/);
}
The regex catches Windows (\r\n), classic Mac (\r), and Unix (\n)
line endings. Order in the alternation matters: \r\n must come first or it would be
matched as two separate splits and produce phantom empty rows.
A single early return on falsy input prevents process(undefined) from throwing.
Everything downstream assumes rawLines is an array of strings.
The mode router
process() is the only public entry point. Inside, the first thing it does is check the
mode string and dispatch to a specialised handler:
if (opts.mode === 'word-dedup') return processWords(text, opts);
if (opts.mode === 'regex') return processRegex(text, opts);
// ... split lines ...
if (opts.mode === 'remove-empty') return processRemoveEmpty(rawLines, opts);
if (opts.mode === 'trim') return processTrim(rawLines, opts);
if (opts.mode === 'reverse') return processReverse(rawLines, opts);
if (opts.mode === 'shuffle') return processShuffle(rawLines, opts);
if (opts.mode === 'add-line-numbers') return processAddLineNumbers(rawLines, opts);
// ... main dedupe path for 'dedupe' / 'find-duplicates' / 'sort' ...
Nine modes, currently mapped to the six visible tools (the homepage uses dedupe; the
five inner tool pages use regex, reverse, remove-empty,
shuffle, trim, add-line-numbers). Three modes
(find-duplicates, sort, word-dedup) are implemented but
don’t have dedicated tool pages yet — they’re wired up but unused.
The mode router pattern means adding a new tool requires only:
- A new mode handler function (~30-50 lines)
- One
ifinprocess()to route to it - A new view template + lang entry + route in the PHP layer
No engine plumbing changes. The shared state (input parsing, blank detection, trim regex) is duplicated across handlers rather than abstracted — deliberately. With ~5 handlers, abstraction would obscure more than it saves.
The dedupe path: O(n) hash table
The main path is for dedupe, find-duplicates, and sort modes:
var keep = [];
var freqMap = Object.create(null);
var annotated = new Array(n);
var blankCount = 0;
var trimRe = /^\s+|\s+$/g;
for (var i = 0; i < n; i++) {
var raw = rawLines[i];
// ... blank handling ...
var key = normalize(raw, opts);
var entry = freqMap[key];
if (entry) {
entry.count++;
annotated[i] = { text: raw, blank: false, _key: key };
} else {
freqMap[key] = { line: raw, count: 1, firstIndex: i, group: null };
annotated[i] = { text: raw, blank: false, _key: key };
keep.push(opts.trim ? raw.replace(trimRe, '') : raw);
}
}
Standard hash-table dedup: walk every line once, look it up in the freqMap. If present, increment the count. If absent, register it and keep it.
The output is built in keep incrementally during the loop, which preserves
first-occurrence order for free. annotated stores per-row metadata (which group, dup or
not) for the UI to render the highlighted preview.
Why Object.create(null)
The choice of map type is intentional:
var freqMap = Object.create(null); // no prototype chain — faster lookups
A plain {} inherits from Object.prototype, which means freqMap['toString']
is not undefined — it’s the inherited toString function. If a user
pasted a list containing the literal string toString, the lookup would silently produce a
function instead of a fresh-entry signal, and the dedup would behave incorrectly.
Object.create(null)
creates an object with no prototype chain. There is no inherited toString,
hasOwnProperty, or anything else. Lookups return undefined
for any key not explicitly set. This is both faster (no prototype walk) and safer (no inheritance edge cases).
The same trick is used in the regex extractor, the word-dedup mode, and a few other places.
Object.create(null) is the canonical “use a plain object as a hash map in JavaScript” idiom.
Complexity: O(n + u log u + s log s)
The complexity comment in the engine reads:
// Total complexity: O(n + u log u + s log s) (s = sort/dedupe output length)
// n = input lines
// u = unique lines
// s = output length
Each line is normalised once and looked up in the hash (O(n) total). After the main loop, unique entries are extracted and sorted by first-occurrence index (O(u log u)). Optional final sort or shuffle on the output (O(s log s) or O(s)).
There’s a story about an earlier version of this same algorithm where one line in the loop —
Object.keys(freqMap).length — was used to compute the next group number on every
iteration. Object.keys is O(n) on the size of the object. Calling it inside the loop made
the whole pass O(n²). On a 156K-line input where every row was unique, the engine was running
roughly 1.2 billion operations and locking the tab for over five minutes.
The fix was a one-line counter:
var nextGroup = 1;
for (var g = 0; g < uniqueArr.length; g++) {
var item = uniqueArr[g];
item.group = item.count >= 2 ? nextGroup++ : null;
}
Group numbers are assigned after the main loop, by walking the already-built uniqueArr
once. The same 156K-line input now processes in under 500 milliseconds. The complexity
comment is now there partly as documentation and partly as a hazard sign: don’t reintroduce
O(n)-inside-the-loop calls.
Fisher-Yates shuffle
The shuffle implementation appears in three places (homepage post-dedupe, dedicated shuffle mode, regex post-match). All three use the same in-place loop:
for (var sh = keep.length - 1; sh > 0; sh--) {
var sj = Math.floor(Math.random() * (sh + 1));
var st = keep[sh];
keep[sh] = keep[sj];
keep[sj] = st;
}
Classic Fisher-Yates (also known as the Knuth shuffle, introduced in Volume 2 of Donald Knuth’s The Art of Computer Programming). For an array of length n, walk from the end to the start, and at each position swap with a random index between 0 and the current position (inclusive). Each permutation of the input is equally likely, with probability 1/n!.
The randomness source is Math.random,
which is not cryptographically secure. For raffles, survey randomisation, ML data prep, this is fine.
For something like picking a regulated lottery winner, use
crypto.getRandomValues
instead. The implementation could swap the source for crypto by changing one line, but the trade-off
(slower, no real-world need) made it not worth doing.
The 64-character optimisation
One small detail in the blank-line check:
if (raw === '' || (raw.length < 64 && /^\s*$/.test(raw))) {
The blank-detection regex is only run on rows shorter than 64 characters. The reasoning: a row that’s
65 characters or more of pure whitespace essentially never appears in real input. The exact-match check
against '' handles the most common “completely empty row” case for free, and
short-circuiting the regex on long rows saves a few percent on inputs that have lots of long lines.
This optimisation is in the dedupe main path only. The dedicated remove-empty mode runs the
full regex always, because finding blank rows is its only job.
Add-line-numbers: cat -n in a browser tab
The Add Line Numbers tool deserves a mention because it’s the
most parameterised mode: separator (tab / space / colon-space / dot-space), padding (none / zero /
space-pad), starting number, and a cat -b-style “skip blank lines but keep numbering
consecutive” option.
The padding width is computed once before the loop:
var maxNum = start + Math.max(0, maxCount - 1);
var width = String(maxNum).length;
So a 1000-line input always pads to 4 digits (1000 = 4 chars), never partial. With the
skip-blanks toggle on, maxCount is the count of non-blank rows, so a sparsely-populated
file gets a width based on what will actually be output.
The default separator is tab. This is intentional — tab-separated output pastes into Excel, Google Sheets, and Numbers automatically as two columns (numbers in column A, original text in column B). One paste, two columns, no formula needed.
What’s deliberately not in the engine
- No DOM access. The engine is pure functions. The Worker version literally cannot touch the DOM, so the engine must work without it.
- No async, no Promises, no fetch. Every function is synchronous. Web Worker async is handled by the message-passing layer in
app.js, not in the engine. - No external dependencies. The engine is one file, vanilla JS, ES5 syntax (so older browsers parse it without transpilation). No npm, no bundler, no source map.
- No minification. The file you load is the file in the repo. Open /static/js/dedupe.js and read it. This is intentional — for a tool whose pitch is “your data never leaves your browser,” auditability matters more than a few KB.
The trade-offs
A 566-line single-file engine isn’t how a SaaS-startup engineer would normally architect this.
A “proper” version would have a Pipeline class, a strategy pattern for modes,
typed input/output schemas, dependency injection. None of that is here.
The trade-off is intentional. With six tools, abstraction overhead would dominate. With a build step, the auditability claim weakens. With async, the synchronous mental model goes. So the engine stays a flat file of functions. It’s easy to read, easy to fork, and easy to verify.
If we ship a 20th tool, this calculus changes. The day the file passes ~1500 lines or the mode router needs a switch statement to read, abstraction starts paying off. Until then, ~566 lines of explicit code beats clever indirection.
Frequently asked questions
Why no TypeScript or modern module syntax?
Two reasons. We want the file you load to be the file in our editor — no build step, no source map, no source-of-truth divergence. And we want the engine to be auditable by anyone reading /static/js/dedupe.js directly, including people without a Node toolchain installed. TypeScript would force a transpile step that pushes the engine one level further from what runs in the browser. The trade-off is no static type checking; we’ve accepted that for a 566-line file with no external API surface.
Why ES5 syntax in 2026?
Because we want the engine to parse without a transpiler everywhere it might run — including locked-down corporate browsers, older mobile WebViews, and Web Workers in environments with strict CSP rules. ES5 is the most permissive baseline. We don’t use modern syntax (let, arrow functions, destructuring) anywhere in the engine; the gain in readability isn’t worth the parser-compatibility risk for a tool that should “just run.”
Is the engine open-source / can I reuse it?
The file is served as plain JavaScript without any usage gate, and you can fork it into your own project freely — we treat it as effectively MIT-licensed even though we haven’t published a formal LICENSE alongside it yet. If you ship something built on the engine, we’d love to hear about it ([email protected]) but there’s no obligation. A formal LICENSE file is on the to-do list.
How does the Worker version handle errors?
The Worker side wraps the process() call in a try/catch and posts an error message back to the main thread. The UI side (app.js) listens for that message and renders an error banner. Anything that throws in the engine — bad regex, out-of-memory on a 200 MB input, malformed UTF-16 surrogate pair — surfaces as a user-visible error rather than a silent failure. The engine itself doesn’t throw on bad input; the wrapper layer is where error policy lives.
Can I run dedupe.js in Node.js?
Yes, with a small shim. The file exposes Dedupe.process on whatever globally accessible object exists (the trick is the (typeof self !== 'undefined' ? self : this) at the bottom). In Node, set global.self = {} before requiring the file, then call global.self.Dedupe.process(text, opts). We use this exact pattern for the engine’s performance benchmarks — the same code path the browser uses, but driven from a script.
Related guides
- Introducing DedupeLines — the product-level overview; this engineering post is its technical companion.
- The blank-line trap — the Unicode whitespace handling that the engine relies on internally.
- Keep the last occurrence — how the engine’s first-wins invariant composes with the reverse tool.
The whole engine is one file, served as-is. Open it, read it, fork it.
Read dedupe.js