DedupeLines
By the DedupeLines engineering team · Published 2026-05-16 · Updated 2026-05-17 · 6 min read · tip

Keep the last occurrence of every duplicate — reverse + dedupe + reverse

By default, DedupeLines keeps the first occurrence of every duplicate row. That’s the right behaviour 90% of the time — users expect the original ordering preserved with duplicates collapsed at the position they first appeared.

But sometimes you want the last occurrence instead:

SQL handles this with SELECT ... GROUP BY id ORDER BY ts DESC LIMIT 1 (see PostgreSQL’s GROUP BY docs for the canonical reference). Pandas does df.groupby('id').tail(1). But if you’re working from a clipboard paste of a flat list, we don’t think you should have to spin up a database or a Python kernel just to keep the last copy of each row.

Three DedupeLines tools chained together do exactly the same thing — in 3 clicks.

The pattern: reverse → dedupe → reverse

The trick is dead simple once you see it:

  1. Reverse the list — what was last is now first.
  2. Dedupe — first occurrence wins, which is the original last occurrence.
  3. Reverse again — restore the original order, with the kept rows now sitting at their last-occurrence positions.

Walked through with real input

Imagine an order-status log:

order-447 pending
order-448 pending
order-447 paid
order-449 pending
order-447 shipped
order-448 paid

You want the latest state of each order:

order-447 shipped
order-449 pending
order-448 paid

(Note order-447 kept the “shipped” line, not “pending”, and the surviving lines preserve their last appearance order.)

Step 1 — Reverse

Paste into Reverse Lines:

order-448 paid
order-447 shipped
order-449 pending
order-447 paid
order-448 pending
order-447 pending

Step 2 — Dedupe

Copy that output, paste into the homepage deduper. Important: the deduper compares full lines by default, so identical rows collapse. If you want to dedupe by a prefix (just the order ID, ignoring status), you need a slightly different workflow — see “Dedup by prefix” below.

For the order-status example, every line is unique (different statuses), so a full-line dedupe doesn’t collapse anything. The pattern works when your duplicate signal is the whole line being identical. Real example: a user-action log where the same line repeats:

2026-05-16 10:01 user=alice login
2026-05-16 10:02 user=bob   login
2026-05-16 10:01 user=alice login    ← duplicate
2026-05-16 10:03 user=carol login
2026-05-16 10:01 user=alice login    ← duplicate

After reverse + dedupe + reverse, you get one alice line per unique row, kept at the last position. The first/middle alice rows drop out.

Step 3 — Reverse again

Paste the deduped output back into Reverse Lines. Final output is in the original direction with last-occurrence semantics.

Dedup by prefix (the harder case)

If you want to dedupe by a key (order ID) that isn’t the whole line, the chained pattern alone won’t work — you need to make the key the whole line first. Two approaches:

Performance on real data

On a 2024 M3 MacBook Air, 100K-line input through the full reverse → dedupe → reverse chain takes roughly:

For comparison: the equivalent SQL on a small SQLite database is ~50 ms (faster, but you have to import the data first — the I/O dominates). Pandas groupby().tail(1) on the same data is ~200 ms after CSV load.

The chained-tool approach wins when: (a) your data is already in clipboard, (b) you don’t want to import / set up anything, (c) you want to keep the data local. Loses when: you’re processing millions of rows and the clipboard round-trips dominate.

Why DedupeLines defaults to first-occurrence

First-occurrence is the more common need (preserves original intent), and it’s the Excel / SQL DISTINCT default. We considered exposing a “keep last” toggle on the homepage, but the chain pattern keeps the engine’s hot path clean and lets users compose behaviours from primitive tools rather than memorising flag combinations. The whole-engine architecture is roughly 350 lines of vanilla JS; complexity has a real cost.

If you find yourself doing this chain often, bookmark all three tools and clipboard-tab between them — once you’ve done it twice, the muscle memory takes ~10 seconds.

Pitfalls we ran into building this workflow

A few traps to know about before you wire the chain into a recurring task. We’ve hit each of these at least once while testing DedupeLines on real customer data.

Equivalents in other tools

ToolSnippet
SQLSELECT * FROM t WHERE id IN (SELECT MAX(id) FROM t GROUP BY key)
Pandasdf.drop_duplicates(subset='key', keep='last')
awk + tactac file | awk '!seen[$0]++' | tac — literally the same chain
jqgroup_by(.key) | map(.[-1])
DedupeLinesreverse → dedupe → reverse (this post)

The awk one-liner is the closest mental model — tac is reverse, !seen[$0]++ is the same first-wins-after-reverse logic. DedupeLines is “awk for people who don’t live in a terminal.”

Frequently asked questions

Why doesn’t DedupeLines just add a “keep last” toggle to the homepage?

We considered it. The reason we didn’t: the chained-tool pattern composes from primitives you already have to learn anyway (reverse, dedupe), so the surface area of the engine stays small. Adding a toggle creates a new flag combination matrix — trim × case × empty × shuffle × keep-direction = 16 combinations to test, document, and translate. Three clicks costs ~10 seconds once your muscle memory is built; a wider config surface costs us indefinitely.

Does the chain preserve internal ordering when there are no duplicates?

Yes. If every row is already unique, reverse → dedupe → reverse is a no-op for the rows themselves — you get exactly the input back. The intermediate steps still run, so there’s a small time cost (~540 ms on 100K lines), but the data is unchanged.

What about very large files — a million rows or more?

The engine has an 80 MB hard ceiling per run. For 100K-200K-row inputs the chain stays comfortable under a second total. Past 500K rows the clipboard round-trips dominate — you’re copying the whole array three times between tools. At that scale a one-shot Pandas script is usually faster than three browser passes, even counting the Python startup overhead.

Can I dedupe by a column instead of the whole line?

Not natively. DedupeLines is line-oriented by design. For column-aware dedup, the two workable patterns are (a) Extract the key column with the Regex Extractor, dedupe, then VLOOKUP-merge back in your spreadsheet, or (b) Use Pandas’ drop_duplicates(subset='key', keep='last') directly — it’s the right tool for that shape.

Does the dedupe step need trim and case toggles set a particular way?

For most chained workflows we leave the defaults: trim on, case off, empty on. If your data’s “duplicate” signal depends on exact whitespace (e.g. you’re comparing indented code), turn trim off. The trim interaction is covered in detail in Trim before dedupe: why workflow order changes the result.

Related guides

Try it now: paste a list with duplicates, run reverse + dedupe + reverse, and watch the kept rows shift to their last-occurrence position. No upload, all browser-local.

Open the deduper