For serious projects, I find myself typically resorting to making a hand-written...

jasonwatkinspdx · on Aug 21, 2021

One of my favorite PLT related books is Programming Language Pragmatics, which spends a non trivial section near the beginning on just this issue. Among other things it's very useful to continue a best effort parse on the remainder after the parse error so you can show more syntax errors vs merely the first. But this is something of a nebulous art to accomplish.

I agree that I think generators could address this issue, they simply haven't.

Also I'm particularly fond of PEGs, because they match the intuitiveness of recursive descent parsers with a generator suitable formalism (though they have their rough edges as well).

LadyCailin · on Aug 22, 2021

In my programming language MethodScript, (which has a handwritten parser), I fail immediately during lexing, (because then that was a really bad error) but during compilation, I only immediately fail on some things that are likely to cause a cascade of errors, such as a missing bracket or something.

But yeah, it makes the compiler output way easier to work through when you have everything at once.

eddieh · on Aug 21, 2021

I'm always shocked to find major projects that use parser generators. A hand-written recursive descent parser is easier to write, read, understand, and modify. It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.

So looking at the list I'm dumbfounded that CPython and Ruby use generators.

jasonwatkinspdx · on Aug 21, 2021

You won't find me defending the use of yacc, but in general hand written parsers are far more error prone.

dbcurtis · on Aug 22, 2021

My strategy when faced with doing a DSL of some kind is to write an accepter in bison or similar and knock the bugs out of the grammar and basic readability issues. Then go recursive descent with a table-driven precedence-climbing parser for expressions when ready for the real deal.

kqr · on Aug 22, 2021

This is, I think, a very good strategy for many reasons. Not only is it a cheap way to work out tthe design of the language and a way to verify the correctness of your implementation. It's also a good fallback solution: if your (more expensive, expressive) implementation is not ready in time, you can fall back to integrating the generated implementation to meet an important deadline or whatnot.

anarazel · on Aug 22, 2021

It's harder to ensure that the grammar doesn't have unintended ambiguities when using a parser generator. That doesn't matter as much when "just" developing another implementation of an existing fully specced language, but for cases like the examples you cite that doesn't exist.

That issue is the primary reason why postgres continues to use bison. The SQL standard introduced potential parsing ambiguities frequently and leaves a large portion of the things necessary for a functioning RDBMS unspecced.

dataflow · on Aug 21, 2021

> It's almost trivial to translate a language specification's grammar productions into a recursive descent parser.

It's also potentially quite slow.

thomasmg · on Aug 22, 2021

In my experience (with both hand-written and generators, eg Antler), that's not the case.

Hand-written recursive descent are simple to write, debug and maintain and extend, while with generators I always had to fight the tool at some point, and that always ended in frustration.

dataflow · on Aug 22, 2021

We've just had different experiences then. It's funny you mention ANTLR because it was in particular very slow last time I tried it (just a couple years ago).

As for fighting the tool and simplicity, sure, that's beside my point.

dboreham · on Aug 22, 2021

Some grammars can't be parsed recursive descent. At least if I remember parsing class from decades ago they can't.

samus · on Aug 22, 2021

There are techniques that help parsers parse a far larger class of grammars than their underlying model permits. Some of these are:

- turning it into a state machine (required to deal with nested comments and C++ template syntax)

- building the symbol table while parsing and querying for disambiguation

- treating expressions as a stream of atoms and using a specialised precedence parser (probably required to deal with operator overloading)

Parser generators usually employ these as well.

deterministic · on Aug 23, 2021

Not true in practice. C++ is one of the hardest practical languages to parse. And it can be parsed using a recursive descent scanner less parser.

billconan · on Aug 21, 2021

I really hope to know why people choose one parsing algorithm vs another.

I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.

"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."

however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.

sirwhinesalot · on Aug 21, 2021

Early and similar algorithms give you a parse forest rather than a parse tree. For a programming language you don't want to have your grammar be ambiguous. When Early gives you a bunch of alternative parse trees, how do you disambiguate?

pfdietz · on Aug 21, 2021

If you are parsing incorrect programs, you want resilience in the parser, including considering alternate parsings and generation of parses of sections of the file.

Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems.

rightbyte · on Aug 21, 2021

It is quite possible to redifine much of the syntax with the cpp. Parsing without the cpp is hopeless.

My favourite is Bourne's longing for Algol: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd...

pfdietz · on Aug 21, 2021

Of course parsing without cpp is, in general, hopeless. But for something like a structure-aware editor you don't need to parse it perfectly, just "good enough". Ditto for a source-to-source transformation system.

chrisseaton · on Aug 21, 2021

If your grammar is ambiguous, then an ambiguous program isn't incorrect.

pfdietz · on Aug 21, 2021

It's not that the grammar is ambiguous, it's that the thing you're parsing may not even be in the language generated by the grammar. But you want to parse it as well as possible, and there may be more than one way to do that.

jatone · on Aug 22, 2021

wow, absolutely not. if i'm parsing a programming language I want to basically stop and bail from the current context as soon as possible. there is basically nothing to gain by continuing.

if the shit i'm parsing isn't in the language there is no reason to continue. its invalid.

greiskul · on Aug 22, 2021

If you are writing a code editor you might want to continue parsing it. Or maybe you detected an error, but you want to report more than one error at a time for a better programmer experience. There are lots of use cases for partially parsing a computer program.

chrisseaton · on Aug 22, 2021

> there is basically nothing to gain by continuing.

You gain error messages for subsequent errors, so you can fix them all before recompiling.

pfdietz · on Aug 22, 2021

I had already explained why one might want to do this:

"Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems."

One cannot in general parse un-cpped C/C++. But that doesn't mean it's useless to do as well as one can, even if it cannot be parsed exactly. Parsing is for more than just compilers.

billconan · on Aug 21, 2021

in my implementation, I give each rule a priority. I always choose the one with the highest priority to expand. I basically followed this https://loup-vaillant.fr/tutorials/earley-parsing/parser

xyzzy_plugh · on Aug 21, 2021

I've written parsers for context-sensitive grammars, and as it turns out, this is not a desirable feature of a language. The reason you typically see simple parsers is because simple grammars are usually unambiguous and context-free, which is more convenient for humans to understand and internalize.

After all, when you're writing software, you pretend to be the compiler to some extent.

Natural languages, like English, are a good example of something which humans struggle with because they are complex, ambiguous, and often require context. Sure, it's extremely expressive, but that is the sharpest double-edged sword in programming languages.

sirwhinesalot · on Aug 21, 2021

Neat, though this just seems like PEG with extra steps and more pain. Still this answers my question. Thanks!

armchairhacker · on Aug 21, 2021

Most languages have a simple, unambiguous syntax, so LL or LR is fine. LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.

In general most people think advanced = bad, they want the easiest solution which gets the job done well.

kmill · on Aug 22, 2021

This is probably known somewhere, but something I noticed is that a way to think about Earley parsing is that, when you create an LR(1) shift-reduce parser, rather than having it be an error in the grammar for there to be shift-reduce or reduce-reduce conflicts, instead you nondeterministically reduce (i.e., your state machine has a set of states rather than a requiring there to be a single state). This also ends up solving the left recursion problem in an efficient way.

This is certainly slower than a deterministic LR(1) parser, but maybe if you're clever about the implementation you can efficiently handle local ambiguities in a grammar. Large-scale ambiguity is bad since you don't want small changes to have action-at-a-distance, but you could in principle have a strict LR(1) grammar that contains Earley sub-grammars for expressions.

UncleEntity · on Aug 21, 2021

> LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

Earley's algorithm should be able to be trivially parallelized which might make it a contender for languages which are ambiguous (like most real world languages) where they are hand-rolling parsers. I haven't tried since I have no real need but looking at the steps I can't see a reason it couldn't do its work across multiple threads.

Honestly, other than JavaScript 1.1 I can't think of a popular language which has an unambiguous syntax and I really like playing with language grammars for some odd reason -- probably wrong though...

eesmith · on Aug 21, 2021

From what I infer from articles like https://jeffreykegler.github.io/personal/timeline_v3 , the original Earley paper had a bug, and wasn't a good fit for 1960s hardware, and had poor performance for some types of grammars.

By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."

It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).

(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)

UncleEntity · on Aug 21, 2021

> I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.

That one is really the only Earley parser I've found used in the wild (don't know what marpa is used for) and unfortunately it is mostly unhackable because they did some serious optimization voodoo on it so it was replaced by a hand-written recursive decent parser a while back because nobody in the world could figure how it works[0] -- which is kind of strange since ASDL is super simple to parse and the generator which used spark was meant to check files into source control but, whatever.

Its easy to play around with but not a great source if you want to see how an Earley parser is put together. There are also some bugs with parser action on duplicate rules not working properly that were pretty easy to fix but python pulled it out of the source tree so no upstream to send patches to?

[0] might be making that part up, dunno?

eesmith · on Aug 21, 2021

You are one of the "couple of people" I was referring to. :)

I know SPARK's docstring use influenced PLY.

PLY doesn't use Earley, but "Earley" does come up in the show notes of an interview with Beazley, PLY's author, at https://www.pythonpodcast.com/episode-95-parsing-and-parsers... . No transcript, and I'm not going to listen to it just to figure out the context.

https://github.com/lark-parser/lark "implements both Earley(SPPF) and LALR(1)".

Kegler, the author of that timeline I linked to, is the author of Marpa. Home page is http://savage.net.au/Marpa.html . The most recent HN comments about it are from a year ago, at https://hackernews.hn/item?id=24321395 .

jnwatson · on Aug 22, 2021

Lark [1] is a battle-tested pure-Python Earley parser. My company uses it in production. It is by far the easiest parser generator I've ever used.

1. https://github.com/lark-parser/lark

sirwhinesalot · on Aug 22, 2021

Lark is amazing... But it's also one of the best LR parsers out there and I would guess that mode is used a lot more than the Earley mode.

Either way, I have never used a better parser generator. It has the best usability and incredible performance when you consider it is written in pure Python.

DylanSp · on Aug 21, 2021

nearley.js is an Earley parser that sees at least some use. https://nearley.js.org/

yxhuvud · on Aug 22, 2021

Without certain optimizations it will be too slow. You really need to optimize right recursion to not be quadratic, for example.

hyperpallium2 · on Aug 22, 2021

jq uses bison, and it takes a startling amount of work to diagnose parsing problems so as to give helpful error messages. Last time I checked, the issues were full of new users stopped by something an appropriate error messsge would resolve.