The version of simple described in this document is version 2.1.1. This document is quite alpha right now, but I have good hopes that it will get beyond that stage some day.
I once read a fortune joke about ``the lesser-known programming languages'' which described the language SIMPLE as being composed of only two instructions, ``BEGIN'' and ``STOP'', neither of which did anything: in that way the same things can be achieved as with other programming languages but without any need for frustration and tedious debugging. As the version 1.0.0 of simple had exactly two instructions: @id@ and @void@, I thought the name SIMPLE was quite adequate.
More seriously, when I first started writing simple, I intended to write something very simple, as I had very modest means. It turns out that I produced a language far more complicated and powerful than I had first expected. In fact, in version 1.x, the syntax was so unbelievably strange (albeit completely logical) that SIMPLE programs were the most complicated thing in the world to understand. So the name now remains as a piece of irony, even though in version 2.x the syntax was completely modified (to look a little like HTML, or ``SiMpLe'').
simple is of the ``macro processing'' kind, as are for example cpp and m4. In other words, its essential action consists of evaluating and expanding macros which are defined by the user (generally). In fact, SIMPLE is similar to m4 in its functioning (I was strongly inspired by the sources of m4 when I wrote simple); but it is quite different in its syntax (not so much in version 2.x as in version 1.x but still).
It then appeared that I needed a macro processor to convert whatever meant "fraction with numerator 2 and denominator 3" into whatever output I needed (like {2\over 3}). I first thought of using m4 (cpp is out of question of course). Unfortunately, m4 is made to handle mainly programs and not text files, so I encountered all sorts of difficulties. First of all, I would have had to use the ``m4_ prefix on all builtins'' option because m4 interprets macros wherever they are found (there is no special macro invocation character) and that can be a pain. But most annoying was the problem with the backtick (`) character: apparently the only way in m4 to write a macro which will produce a backtick (without permanently changing the quote characters, because otherwise the same problem would occur for whatever happens to be the open quote character) is to write (with m4_ prefixes):
m4_changequote(`[',`]')m4_define(__lq,`)m4_changequote([`],['])m4_dnl m4_define(_lq,`m4_changequote(`[',`]')__lq[]m4_changequote([`],['])')m4_dnlwhich makes the macro _lq produce a single left quote (exercice for those who know m4 a bit: why did I need the __lq macro and why can't I just use that). Another thing I do not like about m4 is that it does not gobble comments (one might wonder why they're called comments, then), so either one has to use dnl to produce comments or one has to change the comment character to that of TeX which involves making an assumption as to what it is, precisely the sort of things I was trying to avoid. Anyhow, m4 did not suit my needs, so I just had to put my hands in the dirt and write my own macro preprocessor, which is what I did. Et dixi ``fiat SIMPLE''. Et SIMPLE fit.
As to my ``universal TeX'' project, ``STeX'' (for SIMPLE-preprocessed TeX), it has only just begun. But my current idea is to have the files processed by SIMPLE, and even before that by a tiny program changes all ISO8859-1 characters (which I use a lot because I occasionally write in French) to SIMPLE macros (because SIMPLE does not permit invocation of macros by a single character - and on the other hand it's a pain to have to write a SIMPLE macro invocation for every accented character). SIMPLE, of course, might change these macros back to the ISO8859-1 character in question, if ISO8859-1 input is recognized by whatever form of TeX (or other) is sought.
There was a time when I attempted to classify programming languages - it proved fruitless: each programming language seems to occupy its very own class. This applies to macro processing languages. They seem to be closer to functional languages (such as caml or Miranda) than to imperative languages (such as C or Pascal), but the issue is not altogether clear.
One thing that can help distinguish programming languages is the kind of calling mechanism which they use. The kind of calling mechanism which macro processors use is ``call-by-need'' which means that the arguments to a function (macro) are evaluated first, before the macro is itself expanded, and even if the macro does not need these arguments. So, essentially, if you write ignorearg(screwupall()), everything does get screwed up, contrarily to what would happen if call-by-name were used (this obviously illustrates the infinite superiority of macro processing languages :-). Still, macro processing languages provide ways to inhibit evaluation: that is called ``quoting'', and we will have much more to say on the subject.
In an ideal functional language, functions cannot have global effects, so that calling the same function twice with the same arguments should produce the same result. That restriction does not apply to macro processing languages: a macro may modify a variable (that is, redefine a macro), so that applying it twice may yield completely different results.
Macro processors resemble functional languages in that there is, really, no such thing as an ``instruction'', at least no difference between ``expressions'' and ``instructions''. A functional language (say, pure lambda-calculus) may be completely untyped, everything being of the ``function'' type. As far as macro processors go, everything is of the ``list'' type, where ``list'' means ``list of tokens'' or ``character string'' as the case may be.
The central idea behind a macro processor is that of ``re-evaluation'': when a macro has been evaluated (expanded), the expansion obtained is fed back to the input so that it will be evaluated again. Only non-macro tokens and quoted elements are not (re)evaluated. As a very simple example, suppose that the macro infiniteloop evaluates to infiniteloop; then that expansion will be re-evaluated, causing an infinite loop. Wonderful invention, the wheel.
As a slightly more sophisticated example of perpetual motion, let us suppose we have a macro double which takes a parameter and evaluates to that parameter applied to itself. Then we might apply the macro double on itself, which will result in double being applied to itself, and so on, perpetually re-evaluating the same thing. Now there is one important thing to note: we should not write double(double) (if the syntax is m4-like, say) to mean ``double applied to itself'', because if we write that, then the ``inner'' double gets evaluated first (as are any arguments), resulting in either nothing at all or in an error, as it was not given any arguments. Rather, we should quote the inner double to prevent its evaluation and pass the double object itself (rather than its evaluation) to the ``outer'' double. So in m4 we would write double(`double'). In fact, the complete program in m4 is:
define(`double',`$1(`$1')')double(`double')try it and watch your computer start spinning like mad (note that there are three pairs of quotes in the definition of double, the really interesting one being the inner one which sees to it that double(`double') does indeed evaluate to double(`double') and not simply to double(double)). The corresponding program in SIMPLE is:
<def|[double]|[<@1@|[@1@]>]><double|[double]>
<def|greet|Hello, @1@!>% <greet|world>->
Hello, world!The part before the arrow (->) is the input which is presented to simple and second part is the output produced by it.
We encourage readers to try all the examples.
DON'T PANIC->
DON'T PANICIn other words, simple just copies to the output whatever it is fed in; that is true so long as the input does not contain any of the ten special characters, which are (by default) `, ", %, <, |, >, #, [, ] and @.
Santa Claus `<santa.claus`@toys.np`>->
Santa Claus <santa.claus@toys.np>Note that it is not an error to escape an ordinary (i.e. not special) character: it just leaves the ordinary character in question unaltered.
Wonderful`!->
Wonderful!
It is occasionally useful to escape several special characters in a row without having to put a backtick in front of each of them. There is a way to do that: put a double quote (") character around the whole string which should be escaped (one at the beginning and one at the end). Note that this method works provided that the characters in question do not include the backtick or the double quote characters themselves. Let us repeat that: within double quotes, every special character looses its special signification except the backtick (which is still used to escape the next character, generally another backtick or double quote) and the double quote (which is used to terminate the escaped region).
A few more examples should make all this clear:
Santa Claus "<santa.claus@toys.np>" "Percentage #2 is 25%, and that is < [or >] than @ home." There are no quotes around "this". However, there are quotes around `"this`". "Even inside quotes, the backtick (``) must be escaped, as in: ````." "The same applies to the quotes (`") themselves, obviously."->
Santa Claus <santa.claus@toys.np> Percentage #2 is 25%, and that is < [or >] than @ home. There are no quotes around this. However, there are quotes around "this". Even inside quotes, the backtick (`) must be escaped, as in: ``. The same applies to the quotes (") themselves, obviously.
This is ordinary text %and this is a comment. and this is the continuation of it. Note how the new line was swallowed by the comment. 10`% of 90 is 9. `%This is not a comment %but this is. so it should appear on the output.->
This is ordinary text and this is the continuation of it. Note how the new line was swallowed by the comment. 10% of 90 is 9. %This is not a comment so it should appear on the output.
The `<id`> builtin just evaluates to its first argument: <id|First argument|Second argument|Third argument> Of course, if there is only one argument, it evaluates to that: <id|(of course)> As for the `<void`> builtin, it is even less useful: it evaluates to nothing: <void|SIMPLE is really stupid!>->
The <id> builtin just evaluates to its first argument: First argument Of course, if there is only one argument, it evaluates to that: (of course) As for the <void> builtin, it is even less useful: it evaluates to nothing:Perhaps you don't see it, but there's an empty line at the end of the output in the previous example. That is because the linefeed character after the last printable character in the input was copied to the output.
Note that it is quite possible to call a function with no argument (this was not the case in simple version 1.x). For example, <mymacro> calls the macro mymacro with no arguments. As a convention, when one speaks of a macro, one puts angles around it, even if it doesn't make any sense to call it without arguments. Note that the <id> and <void> macros can be called without arguments, and they evaluate to nothing at all (as is always the case for the <void>) macro.
Note that an argument to a macro may perfectly well contain itself a macro call. That constitutes a nested macro call, and it works just like you'd think (the major problem is not to get lost):
<id|This <id|is><void| stupid, <id|really>> a <id|<id|nested>> macro call.|No>->
This is a nested macro call.
<def|macro|This is a simple macro.> <macro>->
This is a simple macro.Note the empty line before in the output (before the ``This is a simple macro.'' line). That is because the line feed character on the first line of input was not gobbled by anything. To avoid this, one generally uses a comment character. So one would have:
<def|macro|This is a simple macro.>% <macro>->
This is a simple macro.
Now how about parameters? We have seen that builtin macros can take parameters. How about user-defined macros? Well, they can take parameters also. To use parameters in a user-defined macro, the definition may contain the special strings @1@, @2@ (and so on) which get replaced by the first, second (and so on) argument when the macro is called. Here are a few examples:
<def|greet|Hello, @1@!>% <greet|world> <def|introduce|Dear @1@, let me introduce you to @2@.>% <introduce|Peter|Paul> <def|exch|@2@,@1@>% <exch|First|Second> <exch|First|Second|Third> <exch|First>->
Hello, world! Dear Peter, let me introduce you to Paul. Second,First Second,First ,FirstNote from the last line that when a user macro is called with fewer arguments than it was intended for then the missing arguments get replaced by empty strings. Conversely, the before-last line shows that when a user macro is called with more arguments than intended then the extra arguments are simply discarded. This, however, does not apply to builtins: a builtin makes precise assumptions about its number of arguments, and when these assumptions are not met, an error will occur. For example, <def> expects exactly two arguments, and if this is not the case, simple will complain.
<def|i|GARBAGE>% <def|i|42>i=<i> <def|i|1729>i=<i> <def|i|````infinity''>i=<i>->
i=42 i=1729 i=``infinity''(The first line is useless of course, but just leave it alone.) Now suppose we want to write a macro <printi> which should print i= followed by the value of <i> as we have done above. The obvious thing would be to write
<def|printi|i=<i>>However, this does not work. And indeed, here is what we get:
<def|i|GARBAGE>% <def|printi|i=<i>>% <def|i|42><printi> <def|i|1729><printi> <def|i|````infinity''><printi>->
i=GARBAGE i=GARBAGE i=GARBAGEIt is clear what has gone wrong: <i> was evaluated when <printi> was defined, and not when it was called, so that the definition of <printi> is i=GARBAGE and not i=<i> as we wanted.
So how do we get around this? That's where ``quoting'' comes in. We want to define <printi> to be (begin quote) i=<i> (end quote), and not i=whatever happens to be the value of <i> when the definition is entered. In other words, we want to prevent <i> from being evaluated when the arguments to <def> are read. This is called quoting and it is done by means of the two quote characters, namely [ and ]: whatever is between them will be kept unevaluated for the time being, and will be passed as such as an argument. So we now write:
<def|i|GARBAGE>% <def|printi|[i=<i>]>% <def|i|42><printi> <def|i|1729><printi> <def|i|````infinity''><printi>->
i=42 i=1729 i=``infinity''as was wanted.
We give another, even more striking example:
Compare this: <def|test|foo>% <def|foo|<test>>% <def|bar|[<test>]>% <def|test|bar>% <foo> (this should be ````foo'') <bar> (this should be ````bar'')->
Compare this: foo (this should be ``foo'') bar (this should be ``bar'')Here, <test> is first defined to be foo and then bar. In the mean time, <foo> and <bar> are defined: <foo> is defined to be whatever <test> is at the time of definition, viz. foo. As for <bar>, it is defined to be <test> (literally), so that when <bar> is evaluated, it gets replaced by <test> and that gets replaced by whatever the value of <test> is, in our case, bar.
One more word about the EOF token: it is produced when the last file ends, and will have the effect of terminating the SIMPLE session in the normal way. It is not an error for the EOF to be quoted either by a backtick (which is then simply ignored) or by double-quotes (in other words, double quotes which never close). However, it is an error for the end of file to occur within a comment. Now if you want an explanation for the logic of this (it's rather poor, I'm afraid): this is what you'd expect if EOF is considered as an ordinary character, which must always be present exactly once, at the end of a file (commenting it would delete it and that's inadmissible).
It is not an error for a comment that begins in one file to end in another (the file that input it, most likely). However, I very strongly recommend not using that kind of ridiculous constructions, and there is no guarantee that the current behaviour will be maintained in the future. Similarily for quoting at the end of one file the beginning of another one.
Let us start with ordinary expansion. It reads one token: if that token is a regular token it gets shipped out. If it is a QUOTE_NEXT token, the next token gets shipped out (but not the QUOTE_NEXT token itself). If it is OPEN_QUOTE, the following tokens, up to the next balanced CLOSE_QUOTE, all get shipped out and are not interpreted, except QUOTE_NEXT which still has the same effect (compare with the role of the double-quotes and backtick on the lexer level). Neither the OPEN_QUOTE nor the corresponding CLOSE_QUOTE token get shipped out, however. It is invalid for an END_COMMAND, NEXT_PARAM or CLOSE_QUOTE to be found during ordinary expansion (``Misplaced special token''). Along the same logic, I guess it should also be invalid for the AT_SIGN token to be found, but it actually turns out to be much more practical if it is allowed (a lot of macro definitions don't have to be quoted), so it is. When a BEGIN_COMMAND token is read, parameter expansion begins (but the token itself is lost):
Parameter expansion proceeds in exactly the same way as ordinary expansion (described above - indeed it is performed by the same procedure) with two notable differences: one, the tokens resulting of the expansion do not go to the output stream, but rather to the argument token lists, starting with argument list 0. Two, the tokens NEXT_PARAM and END_COMMAND are no longer invalid. When a NEXT_PARAM token is read, the current argument list is concluded and another one is started (it starts empty) to which the following tokens will be expanded (the NEXT_PARAM token itself does not go anywhere). When an END_COMMAND token is read, macro expansion is performed:
Macro expansion decodes the argument 0 of the argument list, the command name. It is illegal for that argument to contain any special tokens, or the character 0 (because of the way character strings are coded in C). Any other character is valid, but only alphanumerical characters and the underscore are garanteed to be always valid in future versions. When the command name is read, it is matched against the known command names: builtins first and user-defined later. If it does not match anything, an error is produced. Otherwise, the actual expansion is performed, based on whatever parameters (arguments) were read. Builtins and user expansion are somewhat different: builtins may have ``magical effects'' such as changing global variables, whereas user macros cannot (at least not directly). Both builtins and user macros, however, have an expansion value. The important thing to note, however, is that this expansion value will not be shipped to the output but will be added back to the input, before the tokens produced by the lexer, and before any other ``extra input'' of that kind which may have been previously produced. Thus, after a macro is expanded, the results of this expansion are reread, and for example if they contain BEGIN_COMMAND tokens they may lead to further macro expansions.
There isn't much to be said about expansion of builtins: it must be described individually for each builtin. Note well that each builtin has an expansion value and an effect, and either one (or both) may be nil. The <id> builtin for example has an expansion but no effect whereas the <out> builtin has an effect but no expansion.
User-defined macros on the other hand never have an effect. They only have an expansion, which is normally simply equal to their definition: the expansion of a user-defined macro is normally simply obtained by copying the definition. However, there is an exception to this (and an important one indeed), namely argument replacement. An argument replacement is triggered by the appearance of the AT_SIGN token in the definition string of the macro. When that token is found, tokens are read until the next AT_SIGN token and must constitute an ordinary string (i.e. contain no special token). Note that the appearance of a QUOTE_NEXT token in the definition string of a macro has no influence whatsoever on the process. It will not make the AT_SIGN any less active or less special. (However, of course, the @ regular character token is in no way special, but that is another matter.) What actually goes to the expansion string depends on what was found between the AT_SIGN tokens:
We now say a word about the EOF token: it is invalid for this token to be found in any other operation than ordinary expansion. It is impossible (except for an internal error, of course :-) for this token to appear in the definition of a macro or a parameter of a well-completed call, so that the question is irrelevant at that point. But it can appear while expanding arguments, and that is an error (``Unterminated command'').
Note in particular that the first argument to <id> will be evaluated twice; indeed, arguments to macros are always evaluated, and the result of the macro call (here, the first argument again) is re-evaluated. So, in a way, <id> is the ``opposite'' of the quotes [ and ] (or #).
Note that even though all arguments are discarded, they are still evaluated, as arguments always are (except of course that quotes may prevent this). In particular, we have this
<def|i|0>% <void|<def|i|1>>% <i>->
1(the <def|i|1> argument gets evaluated and so the definition of <i> changes).
The <void> builtin can be useful, given a macro which does something and evaluates to something, to perform the macro's action while throwing the evaluation away.
Trying to redefine a builtin is an error.
Note (for what it's worth) that token strings such as @1@ or similar, are not interpreted upon macro definition but upon macro call.
Thus, the <out> builtin bypasses all normal evaluation and puts a string directly on the output stream as soon as the builtin is evaluated.
The following example illustrates the difference between <id> and <out>:
<out|This text gets printed.> <id|So does this one.> <void|<out|This text also gets printed.>> <void|<id|This one doesn't, however.>> <def|double|@1@@1@>% <double|<out|This text gets printed once.>> <double|<id|This one, twice.>>->
This text gets printed. So does this one. This text also gets printed. This text gets printed once. This one, twice.This one, twice.
In other words, <if> provides the analog of an ``if... then... else if... else if... ... else...'' construction. Here is an example:
<def|duck|[<if|@1@|1|one|@1@|2|two|@1@|3|three|infinity>]>% <duck|1> <duck|2> <duck|3> <duck|4>->
one two three infinity
Anyhow, simple will work on Linux, with either libc5 or glibc2 (aka libc6) because that's its original platform. I've had it run on Solaris and I'm currently working on the SunOS port (still untested). I gather that it will work nicely on any recent flavor of Unix. Note that it requires an ANSI C compiler (just get gcc if you don't have one).
Portability checks include: the presence of getopt_long() (if you don't have that, you can't use long options), the presence of getopt() (if you don't have that, you can't use any options), the presence of memmove() (if you don't have that, some things will work real slow), the presence of snprintf() (if you don't have that, you might have a security problem, but in fact that should never happen because I allow 30 characters just for printing an integer and I guess that should always suffice) and the fact that realloc(NULL...) and free(NULL) do what they should (if they don't it ought to be no problem because I just use a small wrapper around them).
Beyond that, it gets complicated. An MS-DOS port seems hopeless because I make a very intensive use of realloc(), sometimes with rather large memory blocks, so the huge model would be a necessity, and probably the poor thing would choke itself out of memory real fast even if it can be compiled.