DedupeLines
By the DedupeLines engineering team · Published 2026-05-16 · Updated 2026-05-17 · 7 min read · tip

Filter 50K emails in 5 minutes — regex extractor recipes

A 50,000-row email list lands on your desk. You need to: filter only Gmail addresses, drop the role addresses (info@, support@), find every .edu domain, or pull only your own @your-company.com internal accounts. Excel filter dialogs are slow and need you to set custom criteria. grep works but requires a terminal and an SSH session if the data lives elsewhere. Python re is overkill for a one-off.

The DedupeLines Regex Extractor handles all of these from a clipboard paste. This post is five copy-paste regex patterns covering the common email-filtering jobs. Each is tested against a 50K-line synthetic dataset on a 2024 M3 MacBook Air.

All patterns use JavaScript-flavour RegExp (lookbehind, named groups, Unicode property escapes all supported, per ECMA-262 §22.2). Don’t prefix patterns with /; the slashes around the input field on the tool page are decorative.

Recipe 1 — By TLD (.edu, .gov, .com only)

Pull every line ending with a specific top-level domain. Useful for: identifying institutional addresses (.edu for education campaigns), government contacts (.gov), or filtering out commercial domains (!.com) when prospecting.

@[\w.-]+\.edu\b

Swap edu for gov, org, io, ai, etc. The \b word boundary stops false matches on .education or .governance.

Performance: 50K-line input filters to ~3K matching .edu addresses in ~190 ms on the M3 MBA test. Excel’s “custom filter → ends with .edu” takes ~5 seconds on the same data and needs a fresh dialog interaction every time.

Recipe 2 — By specific domain (only @gmail.com)

Keep only addresses at one specific domain. Useful for: B2C campaigns targeting personal addresses (Gmail, Outlook), or the inverse — isolating your own company’s internal addresses.

@gmail\.com$

The $ anchor matches end-of-line, important so you don’t accidentally match fakegmail.com.evil.com. Adjust the case toggle: by default the regex is case-insensitive so @Gmail.COM still matches.

Variations:

Recipe 3 — Drop role addresses (info@, support@, no-reply@)

Role addresses (info@, support@, sales@, no-reply@, postmaster@, etc.) usually shouldn’t go into a marketing email blast — they’re shared inboxes, often filtered, and CAN-SPAM violations if treated as personal addresses.

The Regex Extractor only keeps matching lines, so to drop role addresses we use a negative lookahead at the start:

^(?!(info|support|sales|admin|contact|hello|noreply|no-reply|postmaster|webmaster|abuse|hr|jobs|press|marketing)@)\S+@\S+

This keeps every line that does not start with a role prefix. Lookahead syntax (?!...) works in modern Chrome / Firefox / Safari / Edge. \S+@\S+ is a loose email shape filter (catches anything that looks like an address).

Performance: 50K-row input filters to ~46K personal addresses in ~210 ms. Trying to do this in Excel requires either a helper column with =NOT(OR(LEFT(A1,5)="info@",...)) or a manual filter on each role — both painful.

Recipe 4 — Format validity (catch typos and broken addresses)

Email lists from CRMs and form submissions often contain garbage: missing @, spaces in the local part, trailing commas from CSV mishandling, or just plain typos like jane@gmail (missing TLD).

A practical “is-this-an-email-shape” regex (not RFC 5322 §3.4.1 compliant, but catches 95% of typos):

^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$

This requires: at least one valid local-part character, an @, a domain with at least one dot, and a TLD of at least 2 letters. Anchored to start-of-line and end-of-line so partial matches don’t sneak through.

Trade-offs: rejects valid IDN (internationalised) emails like [email protected]. If you need IDN support, switch to: ^[\w.+%-]+@[\p{L}\d.-]+\.[\p{L}]{2,}$ (Unicode property escapes).

Run with the case toggle off (case-insensitive default) and pair with trim-on — trailing whitespace breaks the $ anchor and silently excludes valid lines.

Recipe 5 — By region or country signal

Country-coded TLDs let you segment by region. The full list of valid ccTLDs is maintained by IANA:

@[\w.-]+\.(de|at|ch)$

DACH region (Germany / Austria / Switzerland). Common variations:

Caveat: ccTLDs aren’t a perfect signal — many international companies use .com regardless of country. For higher-fidelity geo filtering, pair this with email-name heuristics (German-language first names) or enrich with an IP-to-country lookup at the time of capture.

Combining recipes (workflow chains)

The recipes above are line-level filters. To combine them, run the Regex Extractor multiple times in sequence:

  1. Run Recipe 4 (format validity) — drops typos.
  2. Copy output, run Recipe 3 (drop role addresses).
  3. Copy output, run Recipe 2 (by domain) or Recipe 5 (by region).
  4. (Optional) Use the homepage deduper to remove duplicate addresses across the cleaned list.

A 50K-row dirty list through the full chain takes ~1 second total in clipboard time on the M3 MBA. Compared to the equivalent Excel workflow (4 separate filter dialogs + a Remove Duplicates pass) which runs ~30-60 seconds of click time, this is a measurable productivity win for one-off list cleaning.

Why regex (and not Excel filters)

Three reasons regex wins for ad-hoc email cleaning:

The downside: regex has a learning curve. The five recipes above cover ~80% of email-filtering jobs; bookmark this page or copy the patterns into a personal snippets file.

Equivalents in other tools

Jobgrep / awkPython
By TLD .edu grep -E '@[\w.-]+\.edu\b' file [l for l in lines if re.search(r'@[\w.-]+\.edu\b', l)]
Drop role addrs grep -Ev '^(info|support|sales|...)@' file [l for l in lines if not re.match(r'^(info|...)@', l)]
Format check grep -E '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' file EMAIL_RE.match(l)

Same regex flavour applies in all three contexts (POSIX ERE in grep -E is close enough for these patterns; the only divergence is Recipe 3’s lookahead, which standard grep -E doesn’t support — you’d need grep -P or pcregrep).

Frequently asked questions

Can the regex extractor handle a million-row list?

Yes, up to the 80 MB hard ceiling per run. We’ve tested up to 800K rows on the M3 MBA with the Recipe 4 format-validity regex; the tool stays responsive because anything above 100K lines automatically routes to a Web Worker thread. The output downloads as a .txt file rather than rendering inline.

How is this different from Excel’s regex filter (newer Office 365)?

Office 365 added REGEXMATCH, REGEXEXTRACT, and REGEXREPLACE functions in early 2024. They use a slightly different regex flavour (close to PCRE) and are excellent for in-spreadsheet workflows. The trade-off: you need the latest Office build, the formula recalculates every keystroke, and your data is in Excel cells, which means a 100K-row dataset starts to lag. We use Excel regex for one-off cell-level matching and DedupeLines regex for “clean this 50K-line list and download the output” workflows.

Does the regex extractor work with internationalised (IDN) email addresses?

Recipe 4 as written doesn’t — the [A-Za-z] character class excludes accented and non-Latin characters. The IDN-aware version uses Unicode property escapes: ^[\w.+%-]+@[\p{L}\d.-]+\.[\p{L}]{2,}$. The other four recipes work as-is with IDN addresses because they don’t restrict character classes.

What about quoted local-parts like "hello world"@example.com?

RFC 5322 technically allows quoted strings, dots in unusual positions, comments, and other oddities in the local-part. Our recipes don’t match those. In practice we’ve never seen a legitimate marketing-list email with a quoted local-part — this corner of the spec is mostly historical. If your data really does contain them, we’d recommend pre-cleaning with a script rather than trying to express the full RFC syntax as a single regex.

Why is the regex case-insensitive by default?

The email spec (RFC 5321 §2.4) says the domain part is case-insensitive; the local-part is technically case-sensitive but every major mail provider treats it as insensitive in practice. So matching @Gmail.COM and @gmail.com as equivalent is what users expect. Flip the case toggle off only if you have a specific reason — e.g. you’re comparing exact spellings as they appeared in a sign-up form.

Related guides

Paste your dirty email list, write one of the patterns above, click Run. 50K rows filtered in under a second — no upload, no install.

Open the regex extractor