Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An out-of-the-box generator for pest grammar. #29

Open
TheVeryDarkness opened this issue Nov 23, 2023 · 20 comments
Open

An out-of-the-box generator for pest grammar. #29

TheVeryDarkness opened this issue Nov 23, 2023 · 20 comments

Comments

@TheVeryDarkness
Copy link

I've written some codes for generating Concrete Syntax Tree from Pest grammar.

Maybe that repository can be merged into some repository under pest-parser in the future if appreciated.

Examples can be found on docs.rs or examples directory.

An issue in pest suggests me to ask for feedback here :)

@tomtau
Copy link
Contributor

tomtau commented Nov 23, 2023

Looks pretty good. Yes, we can possibly put it under pest-parser or see how else it can be incorporated.

@tomtau
Copy link
Contributor

tomtau commented Jan 4, 2024

Hi @TheVeryDarkness , I had a more detailed look and I like it, but feel it can't be easily incorporated in a backward-compatible way in pest 2.X because of many breaking changes. Overall though, I think it's very promising and it fits very well with some original ideas for pest3: pest-parser/pest#416 (comment)

(Plus for better errors, see: pest-parser/pest#885 (reply in thread) )

I started locally trying to extract out pest-typed into pest3, but I don't have that much time these days and my progress is slow in that. Would you like try to add that to pest3?

The repo is here: https://github.com/pest-parser/pest3 so you can just open a PR with your changes there. Besides a basic parser of the surface grammar, it's incomplete, so you'd need to adapt pest-typed generator etc. to the new pest3 syntax, but other than that, you have a complete freedom there.

⚠️ Note: (the current WIP) pest3 has a different (simplified) surface grammar: (see this sample https://github.com/pest-parser/pest3/blob/master/meta/tests/pest3sample.pest )

  1. trivia (comments and whitespaces) are handled via special operators instead of rule types (non-atomic and atomic): pest language evolution pest#333 (comment)

Apart from -, one can use ~ and ^ as infix sequence operators. E.g. ~ could be used for optional trivia, while ^ could be used for mandatory trivia. With this addition, repetition can then follow the trend rule~*, rule^*, rule~{3}, rule^{5}, and so on. I

  1. there are modules and aliasing: you can perhaps ignore that in your first PR for the moment when extracting pest-typed; I think the only place you may need to worry about it are builtins (as they are now under pest::* module name), so you can perhaps have some hacky / hardcoded check for that before we implement the module system

  2. meta-rules: it's possible to define rules with arguments ([RFC] "Template"/Macro/Generic rules pest#261), I think you can ignore that in your first PR for now

There are also a few more ideas for pest3, so there may be other changes, but we can perhaps start with that surface syntax parser in the repository, add the typed AST generation and iterate from there. What do you think?

@tomtau
Copy link
Contributor

tomtau commented Jan 4, 2024

@TheVeryDarkness FYI regarding trivia, currently the intended semantics / usage is this:

  1. a user can define / override these special rules:
~ = (whitespace | comment)*
^ = (whitespace | comment)*
  1. there are then 3 types of sequencing operators:

A) -: no implicit whitespace/trivia (just like atomic rules in pest 2, but for a single sequence)
B) ~: optional implicit whitespace
C) ^: mandatory implicit whitespace

  1. based on those implicit whitespace sequence operators, there are also corresponding infix, e.g.:

A) rule*: rule repeated without implicit whitespace/trivia
B) rule~*: rule repeated with optional implicit whitespace/trivia
C) rule^*: rule repeated with mandatory implicit whitespace/trivia

@TheVeryDarkness
Copy link
Author

@tomtau I'll have some time one or two weeks later. I still have some exams and assignments now. And I guess it won't take too long to implement them.
I've also read those discussions before. 3 points you listed above are definitely part of what I like to see. And I think we're making pest a modern parser.

@marcfir
Copy link

marcfir commented Jan 5, 2024

Can I jump in here and ask some general questions about ASTs in Rust and pest?

From my point of view, there are two important questions about generating an AST from the grammar.

  1. How do we want to do the abstraction?
  2. What kind of data structure do we want to use?

In general, the question is who should generate the AST. Because an AST is always an abstracted view of the parsing result. So, depending on what you want to do, your AST may look different and may be implemented differently.
But we can look at a generic use of parser generators. These are language workbenches such as xtext, langium, or monticore.

AST abstraction

If you look at the language workbenches mentioned above, they typically do the following when translated to pest:

  • A sequence like entry = { SOI ~ expr ~ number ~ EOI } is mapped to single struct fields struct Entry {expr: Expr, Number: Number}. More precisely, the fields would be wrapped in a Box to support expressions.
  • An ordered choice like operation = { add | subtract | multiply | divide | power } is mapped to multiple optional fields struct operation {add: Option<Add>, Subtract: Option<Subtract>, Multiply: Option<Multiply>, divide: Option<Divide>, power: Option<Power> }.
  • A Repetition like expr = { ( term)* } is mapped to a vector struct Expr{term:Vec<Term>}.

This would be somewhat simpler than the current procedure in TheVeryDarkness/pest-typed

Data Structure

One problem in Rust is references to AST nodes. In other dynamic languages it is easy to have references inside a tree structure and to three nodes. But in Rust it is more complicated, especially if you have mutable references.
So the first question might be, why do we need references? There are e.g. use cases:

  • Validation We typically want to validate the semantics of our language. This is usually done with a visitor pattern. A nice API will give us access to the parent node.
  • Symboling We typically have symbols/identifiers in our language that are handled in a symbol table. We want to reference the AST node from a particular symbol.

So we could construct our AST based on Rc<RefCell<...>> or we could use an arena allocator based approach with a flat vector where the references are indicated by the index in the array. The arena AST is very simple in referencing and lifetime. Also the mutability for AST nodes is easier.

We experimented with all the approaches and ended up with an arena AST.

@TheVeryDarkness What were your thoughts for your approache?

@TheVeryDarkness
Copy link
Author

TheVeryDarkness commented Jan 5, 2024

Can I jump in here and ask some general questions about ASTs in Rust and pest?

From my point of view, there are two important questions about generating an AST from the grammar.

  1. How do we want to do the abstraction?
  2. What kind of data structure do we want to use?

In general, the question is who should generate the AST. Because an AST is always an abstracted view of the parsing result. So, depending on what you want to do, your AST may look different and may be implemented differently. But we can look at a generic use of parser generators. These are language workbenches such as xtext, langium, or monticore.

AST abstraction

If you look at the language workbenches mentioned above, they typically do the following when translated to pest:

  • A sequence like entry = { SOI ~ expr ~ number ~ EOI } is mapped to single struct fields struct Entry {expr: Expr, Number: Number}. More precisely, the fields would be wrapped in a Box to support expressions.
  • An ordered choice like operation = { add | subtract | multiply | divide | power } is mapped to multiple optional fields struct operation {add: Option<Add>, Subtract: Option<Subtract>, Multiply: Option<Multiply>, divide: Option<Divide>, power: Option<Power> }.
  • A Repetition like expr = { ( term)* } is mapped to a vector struct Expr{term:Vec<Term>}.

This would be somewhat simpler than the current procedure in TheVeryDarkness/pest-typed

Data Structure

One problem in Rust is references to AST nodes. In other dynamic languages it is easy to have references inside a tree structure and to three nodes. But in Rust it is more complicated, especially if you have mutable references. So the first question might be, why do we need references? There are e.g. use cases:

  • Validation We typically want to validate the semantics of our language. This is usually done with a visitor pattern. A nice API will give us access to the parent node.
  • Symboling We typically have symbols/identifiers in our language that are handled in a symbol table. We want to reference the AST node from a particular symbol.

So we could construct our AST based on Rc<RefCell<...>> or we could use an arena allocator based approach with a flat vector where the references are indicated by the index in the array. The arena AST is very simple in referencing and lifetime. Also the mutability for AST nodes is easier.

We experimented with all the approaches and ended up with an arena AST.

@TheVeryDarkness What were your thoughts for your approache?

@marcfir Thanks for sharing your opinions.

It should be a great idea to use an arena or an allocator here considering performance. And the lifetime can be omitted if we create Strings for nodes. And I may provide features for rule references with inner mutability, allocator and owned strings in the future if someone does like it.

And I'm also considering supporting something like hooks or visitors, which should be able to satisfy the usage of validation and symbolling.
I'm just not sure about how to do this elegantly (pest-typed are using macros, but macros can't communicate with each other easily), and I may be able to implement one when I have enough time. However, some changes to the pest grammar may need to be made to support hooks or visitors. @tomtau Do you have some decisions or preferences about that?

Well, I'd like to say that I adopted my current approach mainly because the way you listed above may not be trivially extended to arbitrary rules.

For example, if we have a complex sequence, in which there are repetitions to a compound parsing expression, just like row = { (Item ~ Comma)* ~ Item } (and I think it will be tedious for users to write another rule for Item ~ Comma). The repetition can't be named with some identifier we already have.
If one does want to access rule references with names, pest-typed provides some functions. For rule Entry = { SOI ~ expr ~ number ~ EOI }, users can call entry.expr() on an instance entry of Entry to access the expr. It doesn't cost too much.

And generating Options from ordered choices may also result in higher memory usage, as we must preserve spaces for all of those fields though only one of them can exist meanwhile. pest-typed just uses some enums for ordered choices.

And pest-typed also provide some APIs to handle nodes with values but not references, though I must admit that they are not well tested.
pest-typed can even simulate pairs APIs.

@TheVeryDarkness
Copy link
Author

There's a PR pest-parser/pest#815 about hooks.

@tomtau
Copy link
Contributor

tomtau commented Jan 6, 2024

There were a few interesting comments in the pest3 discussion regarding typed AST:

  1. trouble with Vec<Value>: Restoration of the pest3 work effort 🙌 pest#885 (reply in thread)
  2. potential improvement of error reporting with typed AST: Restoration of the pest3 work effort 🙌 pest#885 (reply in thread)

And I'm also considering supporting something like hooks or visitors, which should be able to satisfy the usage of validation and symbolling.
I'm just not sure about how to do this elegantly (pest-typed are using macros, but macros can't communicate with each other easily), and I may be able to implement one when I have enough time. However, some changes to the pest grammar may need to be made to support hooks or visitors. @tomtau Do you have some decisions or preferences about that?

There are no concrete decisions yet. Regarding "hooks" (as in arbitrary Rust code executed within a grammar), I spoke to @dragostis about it and our consensus was that it'd bring extra complexity and drawbacks, e.g. grammar rewriting optimizer will need to give up when it encounters hooks or it won't be possible or easy to have the embeddable pest_vm (that's used for the online scratchpad editor on the main website). With the pest's focus on accessibility, correctness, and performance, instead of adding hooks, we'd preferably look for adding self-contained grammar features that can address main use cases people would want to use "hooks" for.

As for my own preferences:

  1. short-term: simply adding the existing pest-typed as close as it is now into WIP pest3 (i.e. just to get something basic working and no need to worry too much about supporting other features...)
  2. long-term: refactoring pest3 codebase to be more amenable to changes:
  • make it more compiler-pipeline-like: split the internal AST/IR into "high-level" (which will capture various rich features from the surface grammar) and "low-level" (which will be much simpler, the optimizer(s) will operate on it, the generator will use it, we could write down its semantics formally/semi-formally etc.)
  • for the generator output, I wrote my thoughts here: Restoration of the pest3 work effort 🙌 pest#885 (reply in thread) i.e. have the parser output "event-based", and then have different event processor adapters that users can pick (such as Pairs or pest-typed style) and choose or write/generate their own (something like pest-ast)

@marcfir
Copy link

marcfir commented Jan 6, 2024

@TheVeryDarkness The problem with a rule like this row = { (item ~ comma)* ~ item } can be solved with the tagging feature invented with pest v2.6.0. So you can write this row = { (item ~ comma)* ~ #last_item = item } which will be translated to something like this struct Row { item: Vec<Item>, comma: Vec<Comma>, last_item: Item}.

I also hit the problem, that macros cannot share state. There is a related issue rust-lang/rust#44034.

I am currently working on a language workbench in Rust based on pest. It's not feature complete and very unstable yet. So we haven't released it yet. Our first start was an automatic AST generation. So I think it makes sense to at least show the AST generation. I released the code at pestast. If I have more time I can also commit the arena tree based AST

@tomtau I didn't have the discussions in mind. So maybe we can continue there at least for pest3.
And I agree with your event-based approach. It makes it very modular and there could be multiple libraries providing different types of ASTs.

@TheVeryDarkness
Copy link
Author

TheVeryDarkness commented Jan 6, 2024

@marcfir Yes, it's practicable if rule components are well tagged (though tags are available only under feature grammar-extras). I think we can provide typed AST in both styles.

By the way, pest-typed also supports accessing using tag names. row.last_item() is also available, and there's an attribute that controls whether one can access rules or tags inside a tag from outside directly.

@tomtau
Copy link
Contributor

tomtau commented Jan 8, 2024

@tomtau I didn't have the discussions in mind. So maybe we can continue there at least for pest3.

Yes, for a greater visibility, it's better to use that pest3 brainstorming discussions instead. Once @TheVeryDarkness adds the pest-typed-based generator to that initial pest3 prototype, we can perhaps open issues and have more implementation-focused discussions on that repo..

@TheVeryDarkness
Copy link
Author

TheVeryDarkness commented Jan 10, 2024

@TheVeryDarkness FYI regarding trivia, currently the intended semantics / usage is this:

  1. a user can define / override these special rules:
~ = (whitespace | comment)*
^ = (whitespace | comment)*
  1. there are then 3 types of sequencing operators:

A) -: no implicit whitespace/trivia (just like atomic rules in pest 2, but for a single sequence) B) ~: optional implicit whitespace C) ^: mandatory implicit whitespace

  1. based on those implicit whitespace sequence operators, there are also corresponding infix, e.g.:

A) rule*: rule repeated without implicit whitespace/trivia B) rule~*: rule repeated with optional implicit whitespace/trivia C) rule^*: rule repeated with mandatory implicit whitespace/trivia

@tomtau Sorry, I have some questions about the definition of trivia:

~ = (whitespace | comment)*
^ = (whitespace | comment)*

Are there any differences between them? And should (or could) both of them be defined in the same grammar?

I though there is only one form of trivia, and there should be only one way to define the trivia.

@TheVeryDarkness
Copy link
Author

TheVeryDarkness commented Jan 10, 2024

And one issue is whether we should keep skipped trivia inside our generated data structure? It may be useful in some cases, such as implementing a formatter, but can also be solved by defining the rule explicitly. So, I may drop skipped trivia in my PR, unlike what I've done in pest-typed.

By the way, another question is whether silent rule should be either inline or ignored in the data structure.

@tomtau
Copy link
Contributor

tomtau commented Jan 10, 2024

Are there any differences between them? And should (or could) both of them be defined in the same grammar?

Yes, it's mentioned below; ~ means that there's an optional trivia (0 or more) between items, ^ means there's a mandatory trivia (1 or more).
It's currently unspecified whether we'd always want both to be defined. I guess we could mimic pest2 where WHITESPACE and COMMENT don't need to be defined?

@tomtau
Copy link
Contributor

tomtau commented Jan 10, 2024

And one issue is whether we should keep skipped trivia inside our generated data structure? It may be useful in some cases, such as implementing a formatter, but can also be solved by defining the rule explicitly. So, I may drop skipped trivia in my PR, unlike what I've done in pest-typed.

I'm not sure. I'd need to revisit some of the older issues and discussions or ask people for more feedback; maybe it'll make sense to keep it in the mandatory one, but not the optional one? Anyway, for the initial prototype, you can just pick whatever is easiest to go with for the start.

By the way, another question is whether silent rule should be either inline or ignored in the data structure.

Maybe silent rules could be ignored, while the meta-rules/template functions could be inline?

@TheVeryDarkness
Copy link
Author

Are there any differences between them? And should (or could) both of them be defined in the same grammar?

Yes, it's mentioned below; ~ means that there's an optional trivia (0 or more) between items, ^ means there's a mandatory trivia (1 or more). It's currently unspecified whether we'd always want both to be defined. I guess we could mimic pest2 where WHITESPACE and COMMENT don't need to be defined?

Does it mean we should write codes like below?

~ = { (whitespace | comment)* }
^ = { (whitespace | comment)+ }

@tomtau
Copy link
Contributor

tomtau commented Jan 12, 2024

I guess * and + may be redundant in that definition, as it's implied by the operator, so perhaps just: comment | whitespace.
It's up to each grammar

@tomtau
Copy link
Contributor

tomtau commented Feb 2, 2024

@TheVeryDarkness how's going with that pest-typed port to pest3? Do you need any help?

@TheVeryDarkness
Copy link
Author

I have a fork here and I'm working on it.
I'm sorry for my slow progress. I didn't pay enough time on it.

@tomtau
Copy link
Contributor

tomtau commented Feb 2, 2024

No problem, I'll check out the current fork!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants