Hacker News
Parsing Advances
smj-edison
|next
[-]
sureglymop
|next
|previous
[-]
It is lexx/yacc style lexer and parser generation and generates an LR1 parser but using the CPCT+ algorithm for error recovery. Iirc the way it works is that when an error occurs, the nearest likely valid token is inserted, the error is recorded and parsing continues.
I would use this for anything that is simple enough and recursive descent for anything more complicated and where even more context is needed for errors.
ratmice
|root
|parent
[-]
What drew me to the grmtools (eventually contributing to it) was that you can evaluate grammars basically like an interpreter without going through that compilation process. Leading to a fairly quick turnaround times during language development process.
I hope this year I can work on porting my grmtools based LSP to browser/wasm.
dcrazy
|next
|previous
[-]
const result: ast.Expression[] = [];
p.expect("(");
while (!p.eof() && !p.at(")")) {
subexpr = expression(p);
assert(p !== undefined); // << here
result.push(subexpr);
if (!p.at(")")) p.expect(",");
}
p.expect(")");
return result;
matklad
|root
|parent
[-]
This is resilient parsing --- we are parsing source code with syntax errors, but still want to produce a best-effort syntax tree. Although expression is required by the grammar, the `expression` function might still return nothing if the user typed some garbage there instead of a valid expression.
However, even if we return nothing due to garbage, there are two possible behaviors:
* We can consume no tokens, making a guess that what looks like "garbage" from the perspective of expression parser is actually a start of next larger syntax construct:
``` function f() { let x = foo(1, let not_garbage = 92; } ```
In this example, it would be smart to _not_ consume `let` when parsing `foo(`'s arglist.
* Alternatively, we can consume some tokens, guessing that the user _meant_ to write an expression there
``` function f() { let x = foo(1, /); } ```
In the above example, it would be smart to skip over `/`.
kccqzy
|next
|previous
[-]
This also has another benefit of work sharing. A production like `A B | C B` will ensure that in case parsing A or C consumes the same number of characters, the work to parse B will be shared, despite not literally factoring the production into `(A | C) B`.
luizfelberti
|root
|parent
|next
[-]
It's a bit harder to adapt the technique to parsers because the Thompson NFA always increments the sequence pointer by the same amount, while a parser's production usually has a variable size, making it harder to run several parsing heads in lockstep.
Porygon
|root
|parent
|next
|previous
[-]
I recently tried that approach while simultaneously building an abstract syntax tree, but I dropped it in favor of a right-recursive grammar for now, since restoring the AST when backtracking got a bit complex.
eru
|next
|previous
[-]
tubs
|root
|parent
[-]
The majority of production compilers use hand rolled parsers, ostensibly for better error reporting and panic synch.
cipherself
|root
|parent
[-]
Admittedly, INI is quite a simple format, hence I mention this as an anecdote.
thechao
|root
|parent
|next
[-]
I've got 10 full time senior engineers on a project heading in to its 15th year. We rewrite even extremely low level code like std::vector or malloc to make sure it matches our requirements.
UNIX was written by a couple of dudes.
kccqzy
|root
|parent
|previous
[-]
cipherself
|root
|parent
[-]
In this case, this was a project at $EMPLOYER in an existing codebase with colleagues who have never seen Haskell code, using Haskell would've been a major error in judgement.