Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"but why wouldn't it be _stored_ as an AST?"

It profoundly is. You can't store "an AST". You can only store a serialization of it. The official language grammar is a serialization of the AST custom crafted for that language. It is as much an "AST" as any other serialization would be; all such alternative representations would all produce isomorphic memory representations if parsed from a proper library.

At a high level it may sound useful to try to then provide a cross-language AST representation, but it's one of those things that sounds great at a high level but as soon as you actually tried to implement it for, say, Python and C++, you'd rapidly discover that in practice there's not as much opportunity for "generic AST operations" as you may think.

The problem isn't that it isn't "stored as an AST" but that $YOUR_LANGUAGE apparently doesn't have good libraries or mechanisms for getting at it. Go, for instance, ships with the relevant bits of the compiler exposed, and as a result there are tons of tools that operate on Go code as ASTs and not textually, because it's readily available and supported by the core language team. I use this only as an example I know personally, there are other languages that have similar sorts of support as well.



> It profoundly is. You can't store "an AST". You can only store a serialization of it. The official language grammar is a serialization of the AST custom crafted for that language. It is as much an "AST" as any other serialization would be; all such alternative representations would all produce isomorphic memory representations if parsed from a proper library.

But a code file isn't coming out of a serializer when you hit save. It has all kinds of idiosyncrasies beyond the AST it turns into, and the user expects them all to be preserved. That makes it pretty profoundly not an AST.

And often it doesn't even parse.


What do you call “an AST plus the additional information (such as whitespace info) needed to recover the original text the AST was parsed from” (if that has a name)? I think I saw it mentioned somewhere as having a name.


I’ve seen “concrete syntax tree,” delightfully.


Thanks, that term came up with good search results. I appreciate it.


I feel like you're picking a strawman here. The AST serialization everyone is implying is one where you don't need to token/lex but can just load it directly & manipulate it (i.e. implying the on-disk version is a valid AST or one who's validity can be trivially validated without needing to have the entire language syntax & grammar). First, that makes the compiler much faster because tokenization/lexing is moved to the "save" phase which happens infrequently at human scale vs the compile/processing phase which happens in an automated fashion where the overhead can be notable. Additionally, if you mmap the AST from disk into memory, you can use finer-grained caching to memoize expensive analysis that happens for faster compiles of code that's only changed slightly (e.g. changing whitespace/comments wouldn't recompile anything).

More importantly for advocates, it avoids needing to ship the deserialization library and makes tooling simpler. That's really why the idea of a simple AST format is so attractive. Typically compiler frontends are typically very tightly coupled to the underlying middle & back end. There's some work in some languages to decouple this (e.g. LSPs & Idea's failable parsing approach), but the efforts are still very immature & it's still not clear to me that it's worth it (see the last paragraph).

The main underlying challenge with making sure the on-disk contents is well-formed according to the syntax rules is that frequently you want to pause work at an intermediate stage. This means you either have to make sure that whatever state the user saves is a valid AST via editor tricks (although I think this also typically means you have to design the language around it), you reject saves, every tooling library has to be capable of parsing malformed ASTs, or you save a dirty transformation to apply to the last known saved version so you can have the user resume editing but otherwise tooling uses the "last known good" version. That's the real challenge with having a serialized version that's amenable to 3p tooling for interop.

Finally, all the "serialize the AST" solutions ignore the problem of wanting to grep the codebase. This means you need to change out several decades of line-oriented manipulation tools in favor of new ones that are AST-based & likely more complicated to write/maintain as compared with one-line regular expressions. At least I've yet to see any AST manipulation libraries that aren't drastically different from existing text manipulation tools if clang-tidy and Rust macros are any indication about what good solutions to the problem look like today.

I think eventually we'll get AST serialization, but I think it will be packaged into an entirely new language (like Rust did with ownership) that also considers the tooling aspect end-to-end rather than as a retrofit into existing languages. Once that's successful, then I think we'll see retrofits because the space will have been better explored & other languages will benefit from the R&D into what a successful path would look like.


> that makes the compiler much faster because tokenization/lexing is moved to the "save" phase which happens infrequently at human scale

For dynamic languages like Ruby or Python, storing a pre-parsed representation makes it a little faster. But for compiled languages, lexing and parsing tend to be swamped by the codegen step


If you think about it more broadly where you can memoize expensive results of AST -> code gen transformation or AST -> AST simplification, then this will help significantly for codegen, especially for incremental builds but also clean builds if you have your CI cluster sharing build cache information with your local devs.

Also, for a language like Rust, I'm not sure that there isn't a significant amount of time spent validating ownership & doing type inference. These are the kinds of analysis you could save into the AST & thus save a significant amount of build time when talking about large projects. I agree for smaller projects a lot of these optimizations are probably unimportant.


For rust in particular, there is just such an intermediate representation called MIR that has desugared AST, control flow graph, type inference and lifetime information baked into it so it can be reused to speed up incremental builds.

But that's pretty far from what the programmer wants to interact with. In the context of the thread where we're discussing whether the source code is just a serialized version of the AST.

(Also, with Rust, type checking and borrow checking are expensive, but codegen still dominates)


Yup. I’m not claiming anyone has successfully applied the general concept of memoizing expensive compiler operations at the AST level. There’s all sorts of practical complexities that come up with such a design but conceptually I think it should be doable. Using your “the text is the serialized AST” argument, we see primitive tools around this concept that operate just on the text contents and codegen end-result with ccache/sccache/bazel/etc.

However, using a richer AST representation and knowledge about what expensive operations the compiler performs should let you retain cache acceleration even when the high-level text->compiled code fails. Heck, one could imagine that some AST processing could end up being agnostic to a codebase so that the compiler ends up memoizing certain results across all code bases since code is very similar and have similar constructs that should generate very case similarly.


> This means you need to change out several decades of line-oriented manipulation tools in favor of new ones that are AST-based

I wonder if a generic binary->text tool/library could solve this. Grep could check the file mime type then call the tool to convert from the binary format to the text format if available. I could see this being useful for a lot of binary formats.


I've always thought a nifty way to deal with this would be to use a FUSE filesystem. So you'd have a directory structure like:

    /workspace
      /ast
        /module-1.ast
      /fuse-src
        /module-1.txt
where opening and modifying "module-1.txt" would cause "module-1.ast" to be updated, and vice-versa. I actually think I've seen this approach used in some experimental language or another. I'm sure there would be a LOT of tricky edge cases and/or synchronization bugs to address, but it seems like it's not very far outside the realm of feasibility.


I think everyone may be interested in: https://github.com/afnanenayet/diffsitter

Github having an option to have their PR GUI use an AST diff like this could be a fun and useful option.


And this is why I love Python. It forces people to use the same coding standard in some regards, and it forces people to indent properly.

I really don't care anymore what that standard might be (well, ok, I do prefer tabs) but I do care that it be consistent. And I do DEMAND that proper nested indentation be respected. Source code is meant to be human readable.


Python is not unique nor innovative in that respect. Even FORTRAN and COBOL programs from the early days had very strict rules about indentation and blocks.

The thing is, there's no reason we have to store the code like text. Even the punch card got this right: the program wasn't stored as text, it was stored as physical holes in paper. A very experience programmer could often look at a card with just the holes and have a rough idea what it encoded, provided they knew if it was EBCDIC or ASCII or whatever, but the computer didn't care. The printed representation across the top line of the card was just that: a representation.


I didn't claim that Python is unique nor innovative. However, FORTRAN and COBOL are not modern languages in the sense that one can reasonably expect a large selection of first- and third- party libraries for most common situations, and their availability on e.g. servers is far more limited, thus learning them just for scripting is not as good a choice as is Python.


> Even the punch card got this right: the program wasn't stored as text, it was stored as physical holes in paper.

We still do that, storing executable programs as executable binary rather than text. What else could you do?


We store them as serialized text. We could store them as nodes in an AST; we could store them as OLE/CFBF structures like older versions of Microsoft Word, or do what Ted Nelson suggested decades ago: T. H. Nelson, “Complex information processing: a file structure for the complex, the changing and the indeterminate,” in Proceedings of the 1965 20th national conference, New York, NY, USA, Aug. 1965, pp. 84–100. doi: 10.1145/800197.806036.


We do not store executables as serialized text, unless they are meant to be executed by an interpreter.


In the context of this thread, where we are discussing Git as a medium for storing the form of the programs in a format that is meant to be read and maintained by humans, we do store the programs in text.


No, in the context of this thread we are storing programs as punch cards, which are not meant to be read or maintained by humans. They are the executable form of the software, that's all.


No, punch cards are not the binary. Punch cards contain source which is compiled.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: