What a regex actually is
A regular expression is a tiny domain-specific language for describing the shape of a string. You write a pattern; an engine compiles it into a state machine; you feed strings to the engine and it tells you whether they match, where they match, and what sub-parts (capture groups) it pulled out.
The reason regex feels arcane the first few times is that the syntax was designed for terseness on 1970s terminals, not for readability in 2026. The good news is the surface area is small: maybe two dozen metacharacters and a handful of constructs.
The bad news is that the same compact syntax means a one-character mistake silently changes meaning. 'a.b' matches 'aab', 'axb', 'a b'. 'a\.b' matches only 'a.b'. Always test against both positive and negative examples before shipping a pattern.
The core syntax in one pass
Anchors. ^ matches the start of the string (or line, in multiline mode), $ matches the end. \b is a word boundary.
Character classes. [abc] matches one of a, b, c. [^abc] matches anything except those. [a-z] is a range. The shorthands \d (digit), \w (word char), \s (whitespace) and their uppercase negations cover most cases. The dot . matches any character except newline.
Quantifiers. * means zero or more, + means one or more, ? means zero or one, {3} means exactly three, {2,5} means between two and five. By default these are greedy — they match as much as possible. Appending ? makes them lazy.
Groups. (abc) is a capturing group. (?:abc) is a non-capturing group, used when you only need the grouping for quantifier scope. (?<name>abc) is a named group, retrieved as match.groups.name.
Lookaround. (?=foo) is a positive lookahead — 'followed by foo', but foo isn't consumed. (?!foo) is negative lookahead. (?<=foo) and (?<!foo) are the lookbehind versions.
ECMAScript vs PCRE — the flavour fault line
Regex isn't one language; it's a family of dialects with subtly different syntax and features. The two you'll meet most often are ECMAScript (JavaScript's built-in RegExp) and PCRE (Perl Compatible Regular Expressions — used by PHP, by most command-line tools, and as the conceptual reference for Python's re module).
Differences that bite: PCRE supports atomic groups (?>...) and possessive quantifiers (a++) to prevent backtracking; ECMAScript historically did not. Lookbehind was an ECMAScript-2018 addition. PCRE's recursion ((?R)) and subroutine calls ((?1)) are not in ECMAScript.
The pragmatic rule: write your regex in the flavour of the engine that will run it, and test it there. The SnapToolz Regex Tester runs in the browser, so what you see is exactly what ECMAScript will do.
The pitfalls everyone hits
Greedy vs lazy. Given the string '<b>hi</b><b>there</b>', the pattern <b>.*</b> matches the entire string — the .* is greedy and gobbles up everything between the first <b> and the last </b>. The fix is the lazy quantifier: <b>.*?</b> stops at the first </b>.
Catastrophic backtracking. Patterns like (a+)+b applied to 'aaaaaaaaaaaaaaaaX' can take exponential time as the engine tries every possible way to split the as between the inner and outer +. Cures: rewrite to avoid nested quantifiers, or use a linear-time engine like Google's RE2.
Validating email with a regex. The RFC 5322 grammar is famously not a regular language. The right answer is: a forgiving pattern that catches typos (something@something.something), then send a confirmation email. The regex is a UX hint; the round-trip is the validation.
Parsing HTML or JSON with a regex. Don't. HTML can nest arbitrarily; regex cannot count. Use a real parser.
12 patterns you'll actually reuse
- Email (forgiving): ^[^@\s]+@[^@\s]+\.[^@\s]+$
- URL (loose): ^(https?):\/\/([^\/\s]+)(\/[^\s]*)?$
- Slugify: /[^a-z0-9]+/gi → replace with -, then trim leading/trailing -.
- Simple CSV row split (no embedded commas): /,(?=(?:[^"]*"[^"]*")*[^"]*$)/
- ISO 8601 timestamp: ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|[+-]\d{2}:?\d{2})?$
- Semantic version: ^(\d+)\.(\d+)\.(\d+)
- IPv4 address (loose): ^(\d{1,3}\.){3}\d{1,3}$
- UUID v4: ^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$
- Hex colour: ^#?([a-f0-9]{6}|[a-f0-9]{3})$
- Whitespace collapse: /\s+/g → replace with single space.
- Markdown heading capture: ^(#{1,6})\s+(.*)$ (multiline flag).
- Phone number (loose international): ^\+?[\d\s\-()]{7,}$
When regex is the wrong tool
Counting structure. Balanced brackets, nested HTML, recursive grammars — regex can recognise regular languages, and these aren't. Reach for a parser.
Deep JSON or YAML traversal. Use a real parser, then JSONPath / jq for queries.
Natural language. 'Find all proper nouns' is a problem for an NLP library, not a regex.
Anything performance-critical on attacker-controlled input. Assume they will craft a catastrophic-backtracking input. Use RE2 or pre-validate length.
And finally — anything you'd be embarrassed to debug at 3 a.m. If a single regex grows past 60–80 characters, split it into named functions that each match a piece.
Tools used in this guide
FAQ
- What's the difference between match, search, and findall?
- Terms vary by language. In Python, re.match anchors at the start of the string, re.search finds the first match anywhere, re.findall returns every non-overlapping match. In JavaScript, str.match(re) returns the first match (or all if the g flag is set), str.matchAll(re) returns an iterator of all matches with capture groups.
- Should I write one mega-regex or compose several smaller ones?
- Almost always several smaller ones. Compose by code, not by parentheses. A 200-character regex is unreadable; a function that calls match three times with three named patterns is debuggable.
- Is RE2 actually faster than PCRE?
- RE2 has a worst-case guarantee of linear time in input length; PCRE/ECMAScript can degrade exponentially with backtracking. On typical patterns and typical input, PCRE is often faster. On adversarial input, or any pattern with nested quantifiers, RE2 wins by not freezing.
- How do I test a regex without shipping it to production?
- Use a live tester. Paste your pattern and a representative input — both happy-path and adversarial — and watch matches highlight. The SnapToolz Regex Tester does this locally in the browser with the ECMAScript engine.