HN2new | past | comments | ask | show | jobs | submitlogin

I really hope to know why people choose one parsing algorithm vs another.

I implemented an earley parser, because from what I read on wikipedia, it seems to be more advanced.

"Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages."

however I seldom see languages use Earley parser, there must be a reason, but I've never seen anybody explaining why choosing one algorithm over another.



Early and similar algorithms give you a parse forest rather than a parse tree. For a programming language you don't want to have your grammar be ambiguous. When Early gives you a bunch of alternative parse trees, how do you disambiguate?


If you are parsing incorrect programs, you want resilience in the parser, including considering alternate parsings and generation of parses of sections of the file.

Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems.


It is quite possible to redifine much of the syntax with the cpp. Parsing without the cpp is hopeless.

My favourite is Bourne's longing for Algol: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd...


Of course parsing without cpp is, in general, hopeless. But for something like a structure-aware editor you don't need to parse it perfectly, just "good enough". Ditto for a source-to-source transformation system.


If your grammar is ambiguous, then an ambiguous program isn't incorrect.


It's not that the grammar is ambiguous, it's that the thing you're parsing may not even be in the language generated by the grammar. But you want to parse it as well as possible, and there may be more than one way to do that.


wow, absolutely not. if i'm parsing a programming language I want to basically stop and bail from the current context as soon as possible. there is basically nothing to gain by continuing.

if the shit i'm parsing isn't in the language there is no reason to continue. its invalid.


If you are writing a code editor you might want to continue parsing it. Or maybe you detected an error, but you want to report more than one error at a time for a better programmer experience. There are lots of use cases for partially parsing a computer program.


> there is basically nothing to gain by continuing.

You gain error messages for subsequent errors, so you can fix them all before recompiling.


I had already explained why one might want to do this:

"Consider the problem of parsing a C program that has NOT been put through the preprocessor. You want to be able to do this for structure editors and source-to-source transformation systems."

One cannot in general parse un-cpped C/C++. But that doesn't mean it's useless to do as well as one can, even if it cannot be parsed exactly. Parsing is for more than just compilers.


in my implementation, I give each rule a priority. I always choose the one with the highest priority to expand. I basically followed this https://loup-vaillant.fr/tutorials/earley-parsing/parser


I've written parsers for context-sensitive grammars, and as it turns out, this is not a desirable feature of a language. The reason you typically see simple parsers is because simple grammars are usually unambiguous and context-free, which is more convenient for humans to understand and internalize.

After all, when you're writing software, you pretend to be the compiler to some extent.

Natural languages, like English, are a good example of something which humans struggle with because they are complex, ambiguous, and often require context. Sure, it's extremely expressive, but that is the sharpest double-edged sword in programming languages.


Neat, though this just seems like PEG with extra steps and more pain. Still this answers my question. Thanks!


Most languages have a simple, unambiguous syntax, so LL or LR is fine. LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

As the above commenter mentioned, most language designers make hand-rolled parsers. Making a hand-rolled LL or LR parser is easier too.

In general most people think advanced = bad, they want the easiest solution which gets the job done well.


This is probably known somewhere, but something I noticed is that a way to think about Earley parsing is that, when you create an LR(1) shift-reduce parser, rather than having it be an error in the grammar for there to be shift-reduce or reduce-reduce conflicts, instead you nondeterministically reduce (i.e., your state machine has a set of states rather than a requiring there to be a single state). This also ends up solving the left recursion problem in an efficient way.

This is certainly slower than a deterministic LR(1) parser, but maybe if you're clever about the implementation you can efficiently handle local ambiguities in a grammar. Large-scale ambiguity is bad since you don't want small changes to have action-at-a-distance, but you could in principle have a strict LR(1) grammar that contains Earley sub-grammars for expressions.


> LL or LR is almost certainly faster then Earley, since in general more restricted = faster.

Earley's algorithm should be able to be trivially parallelized which might make it a contender for languages which are ambiguous (like most real world languages) where they are hand-rolling parsers. I haven't tried since I have no real need but looking at the steps I can't see a reason it couldn't do its work across multiple threads.

Honestly, other than JavaScript 1.1 I can't think of a popular language which has an unambiguous syntax and I really like playing with language grammars for some odd reason -- probably wrong though...


From what I infer from articles like https://jeffreykegler.github.io/personal/timeline_v3 , the original Earley paper had a bug, and wasn't a good fit for 1960s hardware, and had poor performance for some types of grammars.

By 1991, "Most researchers see the Parsing Problem as "solved" -- a closed issue. Earley parsing is almost forgotten, and Leo's discovery is ignored. Two decades will pass before anyone attempts a practical implementation of Leo 1991."

It takes Aycock and Horspool's work in 2002 and Kegler's work in 2010 in Marpa to have a "practical implementation" (quoting that link).

(I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.)


> I quote that also because Aycock distributed SPARK, an Earley parser, which was included as part of the Python distribution, in the Parser/ subdirectory, and a couple of people here on HN report having used it.

That one is really the only Earley parser I've found used in the wild (don't know what marpa is used for) and unfortunately it is mostly unhackable because they did some serious optimization voodoo on it so it was replaced by a hand-written recursive decent parser a while back because nobody in the world could figure how it works[0] -- which is kind of strange since ASDL is super simple to parse and the generator which used spark was meant to check files into source control but, whatever.

Its easy to play around with but not a great source if you want to see how an Earley parser is put together. There are also some bugs with parser action on duplicate rules not working properly that were pretty easy to fix but python pulled it out of the source tree so no upstream to send patches to?

[0] might be making that part up, dunno?


You are one of the "couple of people" I was referring to. :)

I know SPARK's docstring use influenced PLY.

PLY doesn't use Earley, but "Earley" does come up in the show notes of an interview with Beazley, PLY's author, at https://www.pythonpodcast.com/episode-95-parsing-and-parsers... . No transcript, and I'm not going to listen to it just to figure out the context.

https://github.com/lark-parser/lark "implements both Earley(SPPF) and LALR(1)".

Kegler, the author of that timeline I linked to, is the author of Marpa. Home page is http://savage.net.au/Marpa.html . The most recent HN comments about it are from a year ago, at https:https://hackernews.hn/item?id=24321395 .


Lark [1] is a battle-tested pure-Python Earley parser. My company uses it in production. It is by far the easiest parser generator I've ever used.

1. https://github.com/lark-parser/lark


Lark is amazing... But it's also one of the best LR parsers out there and I would guess that mode is used a lot more than the Earley mode.

Either way, I have never used a better parser generator. It has the best usability and incredible performance when you consider it is written in pure Python.


nearley.js is an Earley parser that sees at least some use. https://nearley.js.org/


Without certain optimizations it will be too slow. You really need to optimize right recursion to not be quadratic, for example.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: