MASAK CARL WILHELM
Graduated from Uppsala University,metaprogramming domain expert.
Cangjie macro expansion
Macros are convenient function-like procedures that help generate code during the compilation of a program. Cangjie is relatively alone among modern programming languages to support macros – Rust and Swift have macros too, but not Go, Python, C#, JavaScript, or Kotlin.
In this article, we'll take a look at what happens under the hood when the Cangjie compiler finds a macro in the source code of a program it's compiling.
Tokens in, Tokens out
You can compare macros to functions: they take tokens as input and give tokens back as output. Tokens are the smallest units of the input program: individual keywords, string literals, operators, etc.
This happens as the program is compiled; but the macro is running code (a kind of "runtime" happening within the compilation), which is what gives it its strength and flexibility — you can do anything in a macro which you can do normally in Cangjie.
@LogEnterExit
func foo(x: Int64) {
println("Got a value ${x}")
}
foo(42) // prints: Entered 'foo'
// Got a value 42
// Exited 'foo'
The act of going from the macro's input to the macro's output is called "macro expansion". The word "expansion" here is a little unfortunate, because it implies that the output code is always larger than the input code. Typically it is, but not always.
Macro expansion happens after parsing but before type checking; the goal of the macro expansion stage in the compiler is to find and expand all macro invocations until none of them are left.
Annotation macros and expression macros
There are two syntaxes for making macro invocations:
one is on top of declarations (annotation macros, or "macros without parentheses")
one is as part of expressions (expression macros, or "macros with parentheses")
class C {
}
println("The answer is " + )
One subtle detail is that this difference is in how you invoke the macro; the macro declaration itself only cares about receiving tokens, and doesn't make a syntactic distinction between annotation macros and expression macros.
Another difference is that the annotation macros receive an input which has already been checked by the parser as a valid declaration, whereas the expression macros receive an arbitrary input, any sequence of tokens without any attached meaning. Specifically, expression macros can receive an input which isn't a valid Cangjie expression, but perhaps some domain-specific language (DSL).
A question about order
In which order should we expand our macro invocations in the program?
The correct answer is "it shouldn't matter", if the macros are pure and without side effects. In practice, macros aren't always pure; more about this later.
A "left-to-right" order is chosen as a kind of least surprise; this is similar to the order in which we evaluate the program during runtime.
Nested macros
But there's an exception to the "left-to-right" order: macro invocations can contain other macro invocations inside of their argument.
class C {
func m() {
// ...
}
}
println( )
println("some more Cangjie code")
)
SELECT * FROM MyTable WHERE @GenerateWhereClause()
)
Here we have a choice, basically a language design choice: do we handle inner/nested macro invocations first, or outer/wrapping macro invocations first? There is no right answer here: Lisp languages famously do outermost first; Cangjie and a number of other languages do innermost first.
One nice rationalization that innermost-first macro expansion mirrors the evaluation order of nested arithmetic expressions; one always evaluates the leaves first and works up the tree.
Macros that generate macro invocations
Macros can generate any code; this code can contain further macro invocations.
In this case, it might seem that questions about expansion order get raised again, but actually no; we just follow the previous rules of innermost-first, otherwise left-to-right. Of course there is already an implicit ordering if the newly generated macro invocation didn't exist a moment ago.
This is a powerful technique; in fact, one could use it create an infinite loop during compilation of a program. More usefully, it can be used to create a kind of "delegation" mechanism between macros, similar to delegation between functions.
Quoting: code talking about code
Inside of a macro, we tend to use code quotes to construct sequences of tokens. These will ultimately be returned by the macro as the expanded code.
public macro CodeQuote(input: Tokens) {
return quote(
println("Hi")
)
}
We could theoretically do without code quotes, and instead manually create all the tokens we wanted using token constructors. But this would be horrible in practice. Using code quotes allows us to "show" which code we want to represent instead of having to painstakingly "tell" how to construct it.
The code quotes support interpolations, which allows us to mix "fixed" parts of our constructed code with "varying" parts, supplied as either tokens or AST fragments.
public macro Interpolation(input: Tokens) {
return quote(
println("before")
$(input)
println("after")
)
}
A horrible mistake that was averted
The "delegation" mechanism mentioned above was originally specified in a different way: macro invocations nested in code quotes would immediately fire, invoke the macro, and be replaced with their result.
This part of the spec was clearly motivated because people realized how useful so-called "recursive macros" could be, but it was fueled by a confusion about when macro expansion should take place.
Today's mechanism is as follows: code quotes don't do any expansion, they are just inert descriptions of code. If a macro returns some tokens back to the macro expander, these tokens contain a macro invocation, the macro expander will expand the macro invocation, as usual. The mechanism is similar to asynchronously queueing up tasks for an implicit scheduler to handle.
This leads to nice and sensible slogans: Code quotes only quote. Only the macro expander expands macros.
Parallel macros
Users who used a lot of macros found that the macro expansion phase could take too long sometimes. This led to a re-examination of the idea of expansion order of macros. Might it be possible to process macros in parallel sometimes (delegating the work to multiple threads, say) instead of left-to-right?
// these used to expand left-to-right
// now they can expand in parallel
Yes, it's possible, and we changed the spec to work that way. Note that if macros are nested, we still handle them innermost first, and if a macro generates a macro invocation, we still necessarily need to handle those sequentially. But all other situations are "unordered" and can expand in parallel.
On paper, there's an expectation that macros be "pure" and without any observable side effects. In practice, sometimes macro authors resort to impure techniques in their macros, such as storing some information in a global variable. In this case, those macro users need to think like parallel programmers, and use methods and data structures which protect against data races.
Macros with context
We noticed some macro authors used nested macros (which worked well), but needed to collect information in the inner macros which the outer macros could then use. The only way to do this was to use something like package-global variables in the macro package, which felt like a weak pattern.
Therefore, we provided "macros with context", which has two major features. The first is for an inner macro to make sure that it's properly nested inside an expected outer macro. The second feature is a messaging system, where an inner macro can "send" information which the outer macro later "receives".
public macro Inner(input: Tokens): Tokens {
AssertParentContext("Outer")
SetItem("key1", "value1")
SetItem("key2", "value2")
// ...
}
public macro Outer(input: Tokens): Tokens {
let messages = GetChildMessages("Inner")
for (m in messages) {
let value1 = m.getString("key1")
let value2 = m.getString("key2")
// ...
}
}
In practice, this made us more confident that the innermost-first expansion order was the right one for Cangjie. Macro authors who wanted the outermost-first expansion order could now instead use the messaging system in "macros with context".
Even better, by nudging macro authors towards messaging, we could remove the biggest single reason they were resorting to using global variables. Of course we made the messaging internally use thread-safe data structures, which means that it's a safe default, and data races are avoided.
Macro hygiene, or the lack thereof
Advanced macro systems tend to have "macro hygiene". This takes some explaining, but it has to do with reasonable explanations about using variables in code and having them mean the same thing in the fully-expanded program as they did when you wrote them. (The "meaning" of a variable here is simply its appropriate declaration.)
// this works...
func main() {
var x = "first"
var y = "second"
@swap(x, y)
println(x) // second
println(y) // first
0
}
public macro swap(tokens: Tokens) {
// pretend this works
let (expr1, expr2) = ParseExprCommaExpr(tokens)
return quote(
var temp = $expr1
$expr1 = $expr2
$expr2 = temp
)
}
// ...but this fails, due to missing macro hygiene
func main() {
var x = "first"
var y = "second"
@swap(x, y)
var temp = "temporary" // boom! – compile error
0
}
Cangjie's macro system does not guarantee macro hygiene in all cases. The reason for this is that we pass around identifiers (for variables) without any of the necessary context. Changing Cangjie's macro system to guarantee more hygiene, while also keeping the simplicity of the token-based handling of input and output, seems challenging.
Summary
Macros are a very powerful programming language feature. They are so powerful that sometimes the excessive power leads to problems. In Cangjie, a lot of effort has gone into providing the best of what macros can offer, while also making them safe and usable by default.
点击下方阅读全文,试用仓颉编程语言SDK