For serious projects, I find myself typically resorting to making a hand-written parser (usually recursive descent) because it’s the only way I can really get excellent and contextual error reporting and recovery. Parser generators are computer science marvels, but for the most part they’re typically too hard to customize well enough to produce a sensible error to the user, with line numbers, token stream provenance, and suggested recovery information. I can’t imagine the difficulties of trying to get good messages for something as large and pervasive as, say, C.
I think that with enough work, we can adapt and evolve parser generators to mesh well with better error reporting, and give the programmer more options and control, but it’ll take some elbow grease and probably breaking backwards compatibility with the production syntax of these parser generator systems. There’s also a risk, of course, that if there’s too much customization, you sort of lose the value of having a parser generator DSL in the first place.
One of my favorite PLT related books is Programming Language Pragmatics, which spends a non trivial section near the beginning on just this issue. Among other things it's very useful to continue a best effort parse on the remainder after the parse error so you can show more syntax errors vs merely the first. But this is something of a nebulous art to accomplish.
I agree that I think generators could address this issue, they simply haven't.
Also I'm particularly fond of PEGs, because they match the intuitiveness of recursive descent parsers with a generator suitable formalism (though they have their rough edges as well).
In my programming language MethodScript, (which has a handwritten parser), I fail immediately during lexing, (because then that was a really bad error) but during compilation, I only immediately fail on some things that are likely to cause a cascade of errors, such as a missing bracket or something.
But yeah, it makes the compiler output way easier to work through when you have everything at once.
I'm always shocked to find major projects that use parser generators. A hand-written recursive descent parser is easier to write, read, understand, and modify. It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.
So looking at the list I'm dumbfounded that CPython and Ruby use generators.
My strategy when faced with doing a DSL of some kind is to write an accepter in bison or similar and knock the bugs out of the grammar and basic readability issues. Then go recursive descent with a table-driven precedence-climbing parser for expressions when ready for the real deal.
This is, I think, a very good strategy for many reasons. Not only is it a cheap way to work out tthe design of the language and a way to verify the correctness of your implementation. It's also a good fallback solution: if your (more expensive, expressive) implementation is not ready in time, you can fall back to integrating the generated implementation to meet an important deadline or whatnot.
It's harder to ensure that the grammar doesn't have unintended ambiguities when using a parser generator. That doesn't matter as much when "just" developing another implementation of an existing fully specced language, but for cases like the examples you cite that doesn't exist.
That issue is the primary reason why postgres continues to use bison. The SQL standard introduced potential parsing ambiguities frequently and leaves a large portion of the things necessary for a functioning RDBMS unspecced.
In my experience (with both hand-written and generators, eg Antler), that's not the case.
Hand-written recursive descent are simple to write, debug and maintain and extend, while with generators I always had to fight the tool at some point, and that always ended in frustration.
We've just had different experiences then. It's funny you mention ANTLR because it was in particular very slow last time I tried it (just a couple years ago).
As for fighting the tool and simplicity, sure, that's beside my point.
I really hope to know why people choose one parsing algorithm vs another.
I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.
"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."
however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.
Early and similar algorithms give you a parse forest rather than a parse tree. For a programming language you don't want to have your grammar be ambiguous. When Early gives you a bunch of alternative parse trees, how do you disambiguate?
If you are parsing incorrect programs, you want resilience in the parser, including considering alternate parsings and generation of parses of sections of the file.
Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems.
Of course parsing without cpp is, in general, hopeless. But for something like a structure-aware editor you don't need to parse it perfectly, just "good enough". Ditto for a source-to-source transformation system.
It's not that the grammar is ambiguous, it's that the thing you're parsing may not even be in the language generated by the grammar. But you want to parse it as well as possible, and there may be more than one way to do that.
wow, absolutely not. if i'm parsing a programming language I want to basically stop and bail from the current context as soon as possible. there is basically nothing to gain by continuing.
if the shit i'm parsing isn't in the language there is no reason to continue. its invalid.
If you are writing a code editor you might want to continue parsing it. Or maybe you detected an error, but you want to report more than one error at a time for a better programmer experience. There are lots of use cases for partially parsing a computer program.
I had already explained why one might want to do this:
"Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems."
One cannot in general parse un-cpped C/C++. But that doesn't mean it's useless to do as well as one can, even if it cannot be parsed exactly. Parsing is for more than just compilers.
I've written parsers for context-sensitive grammars, and as it turns out, this is not a desirable feature of a language. The reason you typically see simple parsers is because simple grammars are usually unambiguous and context-free, which is more convenient for humans to understand and internalize.
After all, when you're writing software, you pretend to be the compiler to some extent.
Natural languages, like English, are a good example of something which humans struggle with because they are complex, ambiguous, and often require context. Sure, it's extremely expressive, but that is the sharpest double-edged sword in programming languages.
Most languages have a simple, unambiguous syntax, so LL or LR is fine. LL or LR is almost certainly faster then Earley, since in general more restricted = faster.
As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.
In general most people think advanced = bad, they want the easiest solution which gets the job done well.
This is probably known somewhere, but something I noticed is that a way to think about Earley parsing is that, when you create an LR(1) shift-reduce parser, rather than having it be an error in the grammar for there to be shift-reduce or reduce-reduce conflicts, instead you nondeterministically reduce (i.e., your state machine has a set of states rather than a requiring there to be a single state). This also ends up solving the left recursion problem in an efficient way.
This is certainly slower than a deterministic LR(1) parser, but maybe if you're clever about the implementation you can efficiently handle local ambiguities in a grammar. Large-scale ambiguity is bad since you don't want small changes to have action-at-a-distance, but you could in principle have a strict LR(1) grammar that contains Earley sub-grammars for expressions.
> LL or LR is almost certainly faster then Earley, since in general more restricted = faster.
Earley's algorithm should be able to be trivially parallelized which might make it a contender for languages which are ambiguous (like most real world languages) where they are hand-rolling parsers. I haven't tried since I have no real need but looking at the steps I can't see a reason it couldn't do its work across multiple threads.
Honestly, other than JavaScript 1.1 I can't think of a popular language which has an unambiguous syntax and I really like playing with language grammars for some odd reason -- probably wrong though...
From what I infer from articles like https://jeffreykegler.github.io/personal/timeline_v3 , the original Earley paper had a bug, and wasn't a good fit for 1960s hardware, and had poor performance for some types of grammars.
By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."
It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).
(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)
> I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.
That one is really the only Earley parser I've found used in the wild (don't know what marpa is used for) and unfortunately it is mostly unhackable because they did some serious optimization voodoo on it so it was replaced by a hand-written recursive decent parser a while back because nobody in the world could figure how it works[0] -- which is kind of strange since ASDL is super simple to parse and the generator which used spark was meant to check files into source control but, whatever.
Its easy to play around with but not a great source if you want to see how an Earley parser is put together. There are also some bugs with parser action on duplicate rules not working properly that were pretty easy to fix but python pulled it out of the source tree so no upstream to send patches to?
You are one of the "couple of people" I was referring to. :)
I know SPARK's docstring use influenced PLY.
PLY doesn't use Earley, but "Earley" does come up in the show notes of an interview with Beazley, PLY's author, at https://www.pythonpodcast.com/episode-95-parsing-and-parsers... . No transcript, and I'm not going to listen to it just to figure out the context.
Lark is amazing... But it's also one of the best LR parsers out there and I would guess that mode is used a lot more than the Earley mode.
Either way, I have never used a better parser generator. It has the best usability and incredible performance when you consider it is written in pure Python.
jq uses bison, and it takes a startling amount of work to diagnose parsing problems so as to give helpful error messages. Last time I checked, the issues were full of new users stopped by something an appropriate error messsge would resolve.
I think that with enough work, we can adapt and evolve parser generators to mesh well with better error reporting, and give the programmer more options and control, but it’ll take some elbow grease and probably breaking backwards compatibility with the production syntax of these parser generator systems. There’s also a risk, of course, that if there’s too much customization, you sort of lose the value of having a parser generator DSL in the first place.