i am a regular expression syntax designed by:
humans
machines
beelzebub, Lord of the flies (but not the one that's currently trending at the moment, that's a good fly brent)



okay, so we know pcre supports recursion, and that means it's a CFG with some other class of language that is a superset of regular languages, so pcre syntax isn't *regular* expressions. that's not news, that's a semantic choice. I don't have a syntatic beef with that.
would you like some syntatic beef? because I have some. oh goodness me do I have some
You know how some languages have a "everything is a ..." kind of approach? everything is a list, everything is an object, everything is a segfault. that sort of thing. pcre is essentially an experiment in "can we make a language where everything is a special case?"
for example:
for example:
So like you know backreferences exist. Those are famous. the syntax for them is like \\n where n is a unsigned integer. That's not *regular*, but it's what pcre has. right? well let's see what the documentation says about that.
*MIGHT* be a backreference.
It's only a backreference if you have that many capturing groups! To parse this syntax, the meaning depends on the state you populate during parse.
It's like typedef when parsing C, where the lexical token depends on what's in your symbol table.
It's only a backreference if you have that many capturing groups! To parse this syntax, the meaning depends on the state you populate during parse.
It's like typedef when parsing C, where the lexical token depends on what's in your symbol table.
Is your syntax too difficult to remember? add 2× more syntax! parsing this is easy if you're a computer, they're aliases in the lexer. There are a *lot* of aliases in the lexer.
e.g. in our implementation, we have special tokens for the start and end of [xyz] classes:
'[' .. ']' — simple case, only ^ and - change meaning depending on their locations
'[^' .. ']'
'[]' .. ']' — A
'[^]' .. ']' — B
To have a CFG, ] needs a different token type for A and B
'[' .. ']' — simple case, only ^ and - change meaning depending on their locations
'[^' .. ']'
'[]' .. ']' — A
'[^]' .. ']' — B
To have a CFG, ] needs a different token type for A and B
This is bullshit nuts stuff, just to be clear. If someone asks you "what's the grammar for pcre? can you write it down as BNF please?" the answer is yes, you can write it down as BNF, but only as long as you don't also want the terminals for it.
All attempts to do this are lies.
All attempts to do this are lies.
perl and pcre disagree in a couple of places.
The first sentence here speaks for itself:
> the use of \\K withn assertions is "not well defined"
Fantastic.
But I'm here for the start of the match being after the end of the match. Great API! \\10 out of \\10 developer experience.
The first sentence here speaks for itself:
> the use of \\K withn assertions is "not well defined"
Fantastic.
But I'm here for the start of the match being after the end of the match. Great API! \\10 out of \\10 developer experience.
There are features that everybody agrees should not exist, \\C is one of those. It matches *part* of a utf8 sequence (or whatever encoding).
So there's unity here. Unfortunately it does exist. But it's disallowed in lookbehind assertions because it's completely meaningless there.
So there's unity here. Unfortunately it does exist. But it's disallowed in lookbehind assertions because it's completely meaningless there.
There's a bunch of other little things that you probably know about, from being impossible to get right when using pcre. I didn't want to talk about that stuff here. Comment about them if you like.
My perspective is just about parsing pcre. I don't write regexps very often.
My perspective is just about parsing pcre. I don't write regexps very often.
I guess you can see why.