EBNF Grammar for ANSI C (+ Guide on reading EBNF) (gist.github.com)
from ChubakPDP11@programming.dev to programming@programming.dev on 09 Mar 2024 16:02
https://programming.dev/post/11237994

This is EBNF grammar for ANSI C (C99) and it contains almost every rule. It may be missing stuff, please tell me if you notice something missing.

I am writing a C compiler, with my backend and hopefully my own frontend in OCaml. That is why I wrote this grammar. I also have written the AWK grammar, but it’s not uploaded anywhere. Tell me if you want it.

Thanks.

#programming

threaded - newest

solrize@lemmy.world on 09 Mar 2024 16:31 next collapse

Is there a parser generator you’re going to use with that grammar? Why not C23?

ChubakPDP11@programming.dev on 09 Mar 2024 17:28 collapse

Not with this grammar. There’s this parser-generator-immedate called BNFC that uses it’s own flavor of BNF (Labeled BNF) to generate Yacc/Lex (or ANTLR when can), an abstract syntax tree, etc, but I don’t like it. There are no EBNF parser generators AFAIK. One could, possibly, feed this to ChatGPT and ask for a Yacc/Lex pair in return, or even a manual parser! I may do that, but I first have to clean this up and add stuff that aren’t there.

ChatGPT has changed langdev a lot for me. I automate a good portion of the processo with it. But one needs solid specs to feed to it.

As I said I wish to implement the frontend myself, basically the lexer/parser. But I kinda get bored with LP because it’s too time-consuming. Plus LR(1) can only be generated, it’s only LL(1) which can be hand-written. I have not decided yet. I wish to focus more on the backend, because that is where you can do innovative shit and perhaps, write a paper on it.

Also, I’m going to leave C23 to people who have years of experience. ANSI C is the lower denomniator of C. I am using C99 standard, which should be able to compile a good portion of code bases. C99 is the last required POSIX standard for C. That’s when C went under ISO.

Thanks.

OmnipotentEntity@beehaw.org on 09 Mar 2024 18:49 next collapse

Are digraphs and trigraphs deprecated?

Did you reference the standard?

ChubakPDP11@programming.dev on 10 Mar 2024 00:40 collapse

I think digraphs and trigraphs are part of the preprocessor? I did not add any preprocessor stuff to this grammar. I am adding them to the new version I am working on.

I have read the C17 standard fully and I did recall it from memory from time to time but it seems like I had forgotten a lot of stuff. I am redefining it, and I am redigning my AWK grammar too.

I am hoping I could perhaps make a Github pages website called Internet Grammar Database and have all sorts of grammar inside it. Thoughts?

navigatron@beehaw.org on 10 Mar 2024 02:25 next collapse

I love grammars. It’s like an API or a data schema, but for a language. This would be very cool and I would love to see it!

ChubakPDP11@programming.dev on 10 Mar 2024 04:26 collapse

Cool! I will make it.

OmnipotentEntity@beehaw.org on 11 Mar 2024 16:22 collapse

Trigraphs are handled by the preprocessor, so if you’re not handling that, then that’s fine. Digraphs are handled by the tokenizer, however.

ChubakPDP11@programming.dev on 11 Mar 2024 16:26 collapse

Cool, I am making a second version were things are cleaner, I will add digraphs, trigraphs and preprocessor directives to the gramamr as wel. Thanks.

sim642@lemm.ee on 10 Mar 2024 06:52 collapse

I am currently writing a C compiler, with my own backend (and hopefully, frontend) in OCaml.

But why write your own C frontend? It’s much more of a pain than people imagine. I maintain a C frontend implemented in OCaml (the project itself goes back 25 years) and it’s still not on par with GCC or Clang.

For any other language, sure, but C has so many “wonderful” features, starting with the lexer hack. Your grammar conveniently overlooks this issue but it’s something you’ll have to deal with to actually implement it. So it simply won’t be as nice as theory suggests.

ChubakPDP11@programming.dev on 10 Mar 2024 09:02 collapse

You’re right yeah. Hand-implementing lexers and parsers is kind of ‘inane’. I’m not saying it’s stupid. For a small grammar it makes sense. But for a big grammar, just use a PEG generator, or Yacc/Lex. Rust has Lalrpop and Java has ANTLR. There’s truly no need to implement a parser from scratch. But people on the internet really seem to think using lexer and parser generators ‘limits’ them. There are some hacks involed in most Lex/Yacc or PEG specs, but at the end people should keep in mind that LR parsers MUST be generated!

Maybe implement the scanner? Even that is kinda stupid. Unless you do what Rob Pike says: www.youtube.com/watch?v=HxaD_trXwRE