Keep the last occurrence of every duplicate — reverse + dedupe + reverse
By default, DedupeLines keeps the first occurrence of every duplicate row. That’s the right behaviour 90% of the time — users expect the original ordering preserved with duplicates collapsed at the position they first appeared.
But sometimes you want the last occurrence instead:
- Log triage: a server log has
user=123 action=loginrepeated 8 times. You want the most recent state, not the first. - Status snapshots: a CSV export of order status has each
order_idupdated multiple times. You want the latest status per order. - Activity feeds: a list of user actions where the same
user_idappears repeatedly — you want each user’s latest action. - Spreadsheet revisions: rows pasted in append order, where later rows correct earlier ones.
SQL handles this with SELECT ... GROUP BY id ORDER BY ts DESC LIMIT 1 (see
PostgreSQL’s GROUP BY docs for the canonical reference).
Pandas does df.groupby('id').tail(1).
But if you’re working from a clipboard paste of a flat list, we don’t think you should have
to spin up a database or a Python kernel just to keep the last copy of each row.
Three DedupeLines tools chained together do exactly the same thing — in 3 clicks.
The pattern: reverse → dedupe → reverse
The trick is dead simple once you see it:
- Reverse the list — what was last is now first.
- Dedupe — first occurrence wins, which is the original last occurrence.
- Reverse again — restore the original order, with the kept rows now sitting at their last-occurrence positions.
Walked through with real input
Imagine an order-status log:
order-447 pending
order-448 pending
order-447 paid
order-449 pending
order-447 shipped
order-448 paid
You want the latest state of each order:
order-447 shipped
order-449 pending
order-448 paid
(Note order-447 kept the “shipped” line, not “pending”, and the surviving lines preserve their last appearance order.)
Step 1 — Reverse
Paste into Reverse Lines:
order-448 paid
order-447 shipped
order-449 pending
order-447 paid
order-448 pending
order-447 pending
Step 2 — Dedupe
Copy that output, paste into the homepage deduper. Important: the deduper compares full lines by default, so identical rows collapse. If you want to dedupe by a prefix (just the order ID, ignoring status), you need a slightly different workflow — see “Dedup by prefix” below.
For the order-status example, every line is unique (different statuses), so a full-line dedupe doesn’t collapse anything. The pattern works when your duplicate signal is the whole line being identical. Real example: a user-action log where the same line repeats:
2026-05-16 10:01 user=alice login
2026-05-16 10:02 user=bob login
2026-05-16 10:01 user=alice login ← duplicate
2026-05-16 10:03 user=carol login
2026-05-16 10:01 user=alice login ← duplicate
After reverse + dedupe + reverse, you get one alice line per unique row, kept at the last position. The first/middle alice rows drop out.
Step 3 — Reverse again
Paste the deduped output back into Reverse Lines. Final output is in the original direction with last-occurrence semantics.
Dedup by prefix (the harder case)
If you want to dedupe by a key (order ID) that isn’t the whole line, the chained pattern alone won’t work — you need to make the key the whole line first. Two approaches:
- Pre-trim with the regex extractor: use Regex Extractor with a capture pattern, then dedupe, then re-join. Workable but multi-step.
- Use a spreadsheet for keyed dedup: Excel
=UNIQUE(A:A, FALSE, TRUE)or Sheets=UNIQUE(A:A,FALSE,TRUE)with theby_columnflag handles this natively. DedupeLines is line-oriented, not column-oriented, by design.
Performance on real data
On a 2024 M3 MacBook Air, 100K-line input through the full reverse → dedupe → reverse chain takes roughly:
- Reverse: ~80 ms
- Dedupe: ~380 ms
- Reverse: ~80 ms
- Total: ~540 ms for 100K lines, 3 clipboard round-trips
For comparison: the equivalent SQL on a small SQLite database is ~50 ms (faster, but you have to import
the data first — the I/O dominates). Pandas groupby().tail(1) on the same data is
~200 ms after CSV load.
The chained-tool approach wins when: (a) your data is already in clipboard, (b) you don’t want to import / set up anything, (c) you want to keep the data local. Loses when: you’re processing millions of rows and the clipboard round-trips dominate.
Why DedupeLines defaults to first-occurrence
First-occurrence is the more common need (preserves original intent), and it’s the
Excel / SQL DISTINCT default. We considered exposing a “keep last” toggle
on the homepage, but the chain pattern keeps the engine’s hot path clean and lets users compose
behaviours from primitive tools rather than memorising flag combinations. The whole-engine architecture
is roughly 350 lines of vanilla JS; complexity has a real cost.
If you find yourself doing this chain often, bookmark all three tools and clipboard-tab between them — once you’ve done it twice, the muscle memory takes ~10 seconds.
Pitfalls we ran into building this workflow
A few traps to know about before you wire the chain into a recurring task. We’ve hit each of these at least once while testing DedupeLines on real customer data.
- Reversed order matters, not just “dedupe with keep=last.” If you skip step 3 (the final reverse), your output is correct in terms of which rows survived but wrong in terms of order — the kept rows sit at their last position but the list is bottom-to-top. For visual review or copy-back into a spreadsheet, the second reverse is non-optional.
- Trailing whitespace will silently break the chain. DedupeLines defaults to trim-on, which masks this most of the time, but the No-Break Space (U+00A0) and other Unicode whitespace characters survive the default trim regex in some text editors. If your “deduplicated” output still has near-duplicates, that’s the first place to check — see the blank-line trap for a full audit.
- All-unique input is a no-op, not an error. The chain costs you ~540 ms on 100K lines whether or not duplicates exist. For one-off cleanup that’s fine; for a recurring batch job on data we know is mostly unique, we’d skip the chain and dedupe directly.
- Composite-key dedup needs preprocessing. If your “duplicate” signal is one field of a multi-field row (order_id, user_id, etc.), the engine can’t see that natively — whole-line comparison is the only mode. Use the Regex Extractor first to extract just the key column, dedupe that, then re-join in a spreadsheet. We treat this as a deliberate simplification; an open-ended column selector would push the engine toward CSV-parsing complexity that DedupeLines deliberately avoids.
Equivalents in other tools
| Tool | Snippet |
|---|---|
| SQL | SELECT * FROM t WHERE id IN (SELECT MAX(id) FROM t GROUP BY key) |
| Pandas | df.drop_duplicates(subset='key', keep='last') |
| awk + tac | tac file | awk '!seen[$0]++' | tac — literally the same chain |
| jq | group_by(.key) | map(.[-1]) |
| DedupeLines | reverse → dedupe → reverse (this post) |
The awk one-liner is the closest mental model — tac is reverse, !seen[$0]++
is the same first-wins-after-reverse logic. DedupeLines is “awk for people who don’t live in
a terminal.”
Frequently asked questions
Why doesn’t DedupeLines just add a “keep last” toggle to the homepage?
We considered it. The reason we didn’t: the chained-tool pattern composes from primitives you already have to learn anyway (reverse, dedupe), so the surface area of the engine stays small. Adding a toggle creates a new flag combination matrix — trim × case × empty × shuffle × keep-direction = 16 combinations to test, document, and translate. Three clicks costs ~10 seconds once your muscle memory is built; a wider config surface costs us indefinitely.
Does the chain preserve internal ordering when there are no duplicates?
Yes. If every row is already unique, reverse → dedupe → reverse is a no-op for the rows themselves — you get exactly the input back. The intermediate steps still run, so there’s a small time cost (~540 ms on 100K lines), but the data is unchanged.
What about very large files — a million rows or more?
The engine has an 80 MB hard ceiling per run. For 100K-200K-row inputs the chain stays comfortable under a second total. Past 500K rows the clipboard round-trips dominate — you’re copying the whole array three times between tools. At that scale a one-shot Pandas script is usually faster than three browser passes, even counting the Python startup overhead.
Can I dedupe by a column instead of the whole line?
Not natively. DedupeLines is line-oriented by design. For column-aware dedup, the two workable patterns are (a) Extract the key column with the Regex Extractor, dedupe, then VLOOKUP-merge back in your spreadsheet, or (b) Use Pandas’ drop_duplicates(subset='key', keep='last') directly — it’s the right tool for that shape.
Does the dedupe step need trim and case toggles set a particular way?
For most chained workflows we leave the defaults: trim on, case off, empty on. If your data’s “duplicate” signal depends on exact whitespace (e.g. you’re comparing indented code), turn trim off. The trim interaction is covered in detail in Trim before dedupe: why workflow order changes the result.
Related guides
- Trim before dedupe: why workflow order changes the result — how the trim toggle interacts with the comparison key.
- The blank-line trap — if your chain seems to miss duplicates, invisible Unicode whitespace is the usual culprit.
- Regex extractor recipes — preprocessing patterns for composite-key dedup.
Try it now: paste a list with duplicates, run reverse + dedupe + reverse, and watch the kept rows shift to their last-occurrence position. No upload, all browser-local.
Open the deduper