Filter 50K emails in 5 minutes — regex extractor recipes
A 50,000-row email list lands on your desk. You need to: filter only Gmail addresses, drop the role
addresses (info@, support@), find every .edu domain, or pull only your own
@your-company.com internal accounts. Excel filter dialogs are slow and need you to set
custom criteria. grep works but requires a terminal and an SSH session if the data lives
elsewhere. Python re is overkill for a one-off.
The DedupeLines Regex Extractor handles all of these from a clipboard paste. This post is five copy-paste regex patterns covering the common email-filtering jobs. Each is tested against a 50K-line synthetic dataset on a 2024 M3 MacBook Air.
All patterns use JavaScript-flavour RegExp
(lookbehind, named groups, Unicode property escapes all supported, per
ECMA-262 §22.2).
Don’t prefix patterns with /; the slashes around the input field on the tool page are decorative.
Recipe 1 — By TLD (.edu, .gov, .com only)
Pull every line ending with a specific top-level domain. Useful for: identifying institutional addresses
(.edu for education campaigns), government contacts (.gov), or filtering out
commercial domains (!.com) when prospecting.
@[\w.-]+\.edu\b
Swap edu for gov, org, io, ai, etc.
The \b word boundary stops false matches on .education or .governance.
Performance: 50K-line input filters to ~3K matching .edu addresses in
~190 ms on the M3 MBA test. Excel’s “custom filter → ends with .edu” takes ~5 seconds
on the same data and needs a fresh dialog interaction every time.
Recipe 2 — By specific domain (only @gmail.com)
Keep only addresses at one specific domain. Useful for: B2C campaigns targeting personal addresses (Gmail, Outlook), or the inverse — isolating your own company’s internal addresses.
@gmail\.com$
The $ anchor matches end-of-line, important so you don’t accidentally match
fakegmail.com.evil.com. Adjust the case toggle: by default the regex is case-insensitive
so @Gmail.COM still matches.
Variations:
@(gmail|outlook|yahoo|hotmail)\.com$— major free providers in one regex.@your-company\.com$— only your internal addresses.- Add the dedup-after-match toggle on the tool page if the same address appears multiple times in your input.
Recipe 3 — Drop role addresses (info@, support@, no-reply@)
Role addresses (info@, support@, sales@, no-reply@, postmaster@, etc.) usually shouldn’t go into a marketing email blast — they’re shared inboxes, often filtered, and CAN-SPAM violations if treated as personal addresses.
The Regex Extractor only keeps matching lines, so to drop role addresses we use a negative lookahead at the start:
^(?!(info|support|sales|admin|contact|hello|noreply|no-reply|postmaster|webmaster|abuse|hr|jobs|press|marketing)@)\S+@\S+
This keeps every line that does not start with a role prefix. Lookahead syntax
(?!...) works in modern Chrome / Firefox / Safari / Edge. \S+@\S+ is a loose
email shape filter (catches anything that looks like an address).
Performance: 50K-row input filters to ~46K personal addresses in ~210 ms. Trying to
do this in Excel requires either a helper column with =NOT(OR(LEFT(A1,5)="info@",...))
or a manual filter on each role — both painful.
Recipe 4 — Format validity (catch typos and broken addresses)
Email lists from CRMs and form submissions often contain garbage: missing @, spaces in the local part,
trailing commas from CSV mishandling, or just plain typos like jane@gmail (missing TLD).
A practical “is-this-an-email-shape” regex (not RFC 5322 §3.4.1 compliant, but catches 95% of typos):
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
This requires: at least one valid local-part character, an @, a domain with at least one dot,
and a TLD of at least 2 letters. Anchored to start-of-line and end-of-line so partial matches don’t
sneak through.
Trade-offs: rejects valid IDN (internationalised) emails like
mü[email protected]. If you need IDN support, switch to:
^[\w.+%-]+@[\p{L}\d.-]+\.[\p{L}]{2,}$ (Unicode property escapes).
Run with the case toggle off (case-insensitive default) and pair with
trim-on — trailing whitespace breaks the $ anchor and silently
excludes valid lines.
Recipe 5 — By region or country signal
Country-coded TLDs let you segment by region. The full list of valid ccTLDs is maintained by IANA:
@[\w.-]+\.(de|at|ch)$
DACH region (Germany / Austria / Switzerland). Common variations:
- Nordics:
@[\w.-]+\.(se|no|dk|fi|is)$ - UK + Ireland:
@[\w.-]+\.(co\.uk|uk|ie)$ - LATAM:
@[\w.-]+\.(mx|ar|co|cl|pe|br)$ - APAC:
@[\w.-]+\.(jp|kr|cn|tw|hk|sg|au|nz)$
Caveat: ccTLDs aren’t a perfect signal — many international companies use
.com regardless of country. For higher-fidelity geo filtering, pair this with email-name
heuristics (German-language first names) or enrich with an IP-to-country lookup at the time of capture.
Combining recipes (workflow chains)
The recipes above are line-level filters. To combine them, run the Regex Extractor multiple times in sequence:
- Run Recipe 4 (format validity) — drops typos.
- Copy output, run Recipe 3 (drop role addresses).
- Copy output, run Recipe 2 (by domain) or Recipe 5 (by region).
- (Optional) Use the homepage deduper to remove duplicate addresses across the cleaned list.
A 50K-row dirty list through the full chain takes ~1 second total in clipboard time on the M3 MBA. Compared to the equivalent Excel workflow (4 separate filter dialogs + a Remove Duplicates pass) which runs ~30-60 seconds of click time, this is a measurable productivity win for one-off list cleaning.
Why regex (and not Excel filters)
Three reasons regex wins for ad-hoc email cleaning:
- Composable. Five filters become one regex with alternation, or chained extractor passes.
- Reproducible. Save the pattern, paste it next time. Excel filter dialogs reset between sessions.
- Local. Pasted into a browser tab, never uploaded. CRM exports often contain addresses you can’t legally upload to a SaaS — this matters.
The downside: regex has a learning curve. The five recipes above cover ~80% of email-filtering jobs; bookmark this page or copy the patterns into a personal snippets file.
Equivalents in other tools
| Job | grep / awk | Python |
|---|---|---|
| By TLD .edu | grep -E '@[\w.-]+\.edu\b' file |
[l for l in lines if re.search(r'@[\w.-]+\.edu\b', l)] |
| Drop role addrs | grep -Ev '^(info|support|sales|...)@' file |
[l for l in lines if not re.match(r'^(info|...)@', l)] |
| Format check | grep -E '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' file |
EMAIL_RE.match(l) |
Same regex flavour applies in all three contexts
(POSIX ERE
in grep -E is close enough for these patterns; the only divergence is Recipe 3’s
lookahead, which standard grep -E doesn’t support — you’d need
grep -P or pcregrep).
Frequently asked questions
Can the regex extractor handle a million-row list?
Yes, up to the 80 MB hard ceiling per run. We’ve tested up to 800K rows on the M3 MBA with the Recipe 4 format-validity regex; the tool stays responsive because anything above 100K lines automatically routes to a Web Worker thread. The output downloads as a .txt file rather than rendering inline.
How is this different from Excel’s regex filter (newer Office 365)?
Office 365 added REGEXMATCH, REGEXEXTRACT, and REGEXREPLACE functions in early 2024. They use a slightly different regex flavour (close to PCRE) and are excellent for in-spreadsheet workflows. The trade-off: you need the latest Office build, the formula recalculates every keystroke, and your data is in Excel cells, which means a 100K-row dataset starts to lag. We use Excel regex for one-off cell-level matching and DedupeLines regex for “clean this 50K-line list and download the output” workflows.
Does the regex extractor work with internationalised (IDN) email addresses?
Recipe 4 as written doesn’t — the [A-Za-z] character class excludes accented and non-Latin characters. The IDN-aware version uses Unicode property escapes: ^[\w.+%-]+@[\p{L}\d.-]+\.[\p{L}]{2,}$. The other four recipes work as-is with IDN addresses because they don’t restrict character classes.
What about quoted local-parts like "hello world"@example.com?
RFC 5322 technically allows quoted strings, dots in unusual positions, comments, and other oddities in the local-part. Our recipes don’t match those. In practice we’ve never seen a legitimate marketing-list email with a quoted local-part — this corner of the spec is mostly historical. If your data really does contain them, we’d recommend pre-cleaning with a script rather than trying to express the full RFC syntax as a single regex.
Why is the regex case-insensitive by default?
The email spec (RFC 5321 §2.4) says the domain part is case-insensitive; the local-part is technically case-sensitive but every major mail provider treats it as insensitive in practice. So matching @Gmail.COM and @gmail.com as equivalent is what users expect. Flip the case toggle off only if you have a specific reason — e.g. you’re comparing exact spellings as they appeared in a sign-up form.
Related guides
- The blank-line trap — if your regex output has phantom rows, Unicode whitespace inside addresses is the usual cause.
- Trim before dedupe: workflow order — how to dedupe the filtered output without trailing-space surprises.
- Keep the last occurrence — the chained-tool pattern for last-write-wins on filtered lists.
Paste your dirty email list, write one of the patterns above, click Run. 50K rows filtered in under a second — no upload, no install.
Open the regex extractor