1101 lines
44 KiB
Plaintext
1101 lines
44 KiB
Plaintext
|
|
|
|
|
|
What is flat assembler g?
|
|
|
|
It is an assembly engine designed as a successor of the one used in
|
|
flat assembler 1, one of the recognized assemblers for x86 processors.
|
|
This is a bare engine that by itself has no ability to recognize and
|
|
encode instructions of any processor, however it has the ability to
|
|
become an assembler for any CPU architecture. It has a macroinstruction
|
|
language that is substantially improved compared to the one provided by
|
|
flat assembler 1 and it allows to easily implement instruction encoders
|
|
in form of customizable macroinstructions.
|
|
The source code of this tool can be compiled with flat assembler 1,
|
|
but it is also possible to use flat assembler g itself to compile it.
|
|
The source contains clauses that include different header files depending
|
|
on the assembler used. When flat assembler g compiles itself, it uses
|
|
the provided set of headers that implement x86 instructions and formats
|
|
with a syntax mostly compatible with flat assembler 1.
|
|
The example programs for x86 architecture that come in this package are
|
|
the selected samples that originally came with flat assembler 1 and they
|
|
use sets of headers that implement instruction encoders and output formatters
|
|
required to assemble them just like the original flat assembler did.
|
|
To demonstrate how the instruction sets of different architectures
|
|
may be implemented, there are some example programs for the microcontrollers,
|
|
8051 and AVR. They have been kept simple and therefore they do not provide
|
|
a complete framework for programming such CPUs, though they may provide
|
|
a solid base for the creation of such environments.
|
|
There is also an example of assembling the JVM bytecode, which is
|
|
a conversion of the sample originally created for flat assembler 1. For this
|
|
reason it is somewhat crude and does not fully utilize the capabilities
|
|
offered by the new engine. However it is good at visualising the structure
|
|
of a class file.
|
|
|
|
|
|
|
|
How does this work?
|
|
|
|
The essential function of flat assembler g is to generate output defined
|
|
by the instructions in the source code. Given the one line of text as
|
|
shown below, the assembler would generate a single byte with the stated
|
|
value:
|
|
|
|
db 90h
|
|
|
|
The macroinstructions can be defined to generate some specific
|
|
sequences of data depending on the provided parameters. They may correspond
|
|
to the instructions of chosen machine language, as in the following example,
|
|
but they could as well be defined to generate other kinds of data, for
|
|
various purposes.
|
|
|
|
macro int number
|
|
if number = 3
|
|
db 0CCh
|
|
else
|
|
db 0CDh, number
|
|
end if
|
|
end macro
|
|
|
|
int 20h ; generates two bytes
|
|
|
|
The assembly as seen this way may be considered a kind of interpreted
|
|
language, and the assembler certainly has many characteristics of the
|
|
interpreter. However it also shares certain aspects with a compiler.
|
|
It is possible for an instruction to use the value which is defined
|
|
later in the source and may depend on the instructions that come before
|
|
that definition, as demonstrated by the following sample.
|
|
|
|
macro jmpi target
|
|
if target-($+2) < 80h & target-($+2) >= -80h
|
|
db 0EBh
|
|
db target-($+1)
|
|
else
|
|
db 0E9h
|
|
dw target-($+2)
|
|
end if
|
|
end macro
|
|
|
|
jmpi start
|
|
db 'some data'
|
|
start:
|
|
|
|
The "jmpi" defined above produces the code of jump instruction as
|
|
in 8086 architecture. Such code contains the relative offset of the
|
|
target of a jump, stored in either single byte or 16-bit word.
|
|
The relative offset is computed as a difference between the address
|
|
of the target and the address of the next instruction. The special
|
|
symbol "$" provides the address of current instruction and it is
|
|
used to calculate the relative offset and determine whether it may
|
|
fit in a single byte.
|
|
Therefore the code generated by "jmpi start" in the above sample
|
|
depends on the value of an address labeled as "start", and this
|
|
in turn depends on the length of the output of all the instructions
|
|
that precede it, including the said jump. This creates a loop of
|
|
dependencies and the assembler needs to find a solution that
|
|
fulfills all the constraints created by the source text. This would
|
|
not be possible if assembler was just an imperative interpreter.
|
|
Its language is thus in some aspects declarative.
|
|
Finding a solution for such circular dependencies may resemble
|
|
solving an equation, and it is even possible to construct an example
|
|
where flat assembler g is indeed capable of solving one:
|
|
|
|
x = (x-1)*(x+2)/2-2*(x+1)
|
|
db x
|
|
|
|
The circular reference has been reduced here to a single definition
|
|
that references itself to construct the value. The flat assembler g
|
|
is able to find a solution in this case, though in many others it may
|
|
fail. The method used by this assembler is to perform multiple passes
|
|
over the source text and then try to predict all the values with the
|
|
knowledge gathered this way. This approach is in most cases good enough
|
|
for the assembly of machine codes, but rarely suffices to solve the
|
|
complex equations and the above sample is one of the exceptions.
|
|
|
|
|
|
|
|
What are the means of parsing the arguments of an instruction?
|
|
|
|
Not all instructions have a simple syntax like then ones in the
|
|
previous examples. To aid in the processing of arguments that may
|
|
contain special constructions, flat assembler g provides a few
|
|
capable tools, demonstrated below on the examples that implement
|
|
selected few instructions of the Z80 processor. The rules governing
|
|
the use of presented features are found in the manual.
|
|
When an instruction has a very small set of allowed arguments,
|
|
each one of them can be treated separately with the "match"
|
|
construction:
|
|
|
|
macro EX? first,second
|
|
match (=SP?), first
|
|
match =HL?, second
|
|
db 0E3h
|
|
else match =IX?, second
|
|
db 0DDh,0E3h
|
|
else match =IY?, second
|
|
db 0FDh,0E3h
|
|
else
|
|
err "incorrect second argument"
|
|
end match
|
|
else match =AF?, first
|
|
match =AF'?, second
|
|
db 08h
|
|
else
|
|
err "incorrect second argument"
|
|
end match
|
|
else match =DE?, first
|
|
match =HL?, second
|
|
db 0EBh
|
|
else
|
|
err "incorrect second argument"
|
|
end match
|
|
else
|
|
err "incorrect first argument"
|
|
end match
|
|
end macro
|
|
|
|
EX (SP),HL
|
|
EX (SP),IX
|
|
EX AF,AF'
|
|
EX DE,HL
|
|
|
|
The "?" character appears in many places to mark the names as
|
|
case-insensitive and all these occurrences could be removed to
|
|
further simplify the example.
|
|
When the set of possible values of an argument is larger but
|
|
has some regularities, the textual substitutions can be defined
|
|
to replace some of the symbols with carefully chosen constructions
|
|
that can then be recognized and parsed:
|
|
|
|
A? equ [:111b:]
|
|
B? equ [:000b:]
|
|
C? equ [:001b:]
|
|
D? equ [:010b:]
|
|
E? equ [:011b:]
|
|
H? equ [:100b:]
|
|
L? equ [:101b:]
|
|
|
|
macro INC? argument
|
|
match [:r:], argument
|
|
db 100b + r shl 3
|
|
else match (=HL?), argument
|
|
db 34h
|
|
else match (=IX?+d), argument
|
|
db 0DDh,34h,d
|
|
else match (=IY?+d), argument
|
|
db 0FDh,34h,d
|
|
else
|
|
err "incorrect argument"
|
|
end match
|
|
end macro
|
|
|
|
INC A
|
|
INC B
|
|
INC (HL)
|
|
INC (IX+2)
|
|
|
|
This approach has a trait that may not always be desirable:
|
|
it allows to use an expression like "[:0:]" directly in an argument.
|
|
But it is possible to prevent exploiting the syntax in such way
|
|
by using a prefix in the "match" construction:
|
|
|
|
REG.A? equ [:111b:]
|
|
REG.B? equ [:000b:]
|
|
REG.C? equ [:001b:]
|
|
REG.D? equ [:010b:]
|
|
REG.E? equ [:011b:]
|
|
REG.H? equ [:100b:]
|
|
REG.L? equ [:101b:]
|
|
|
|
macro INC? argument
|
|
match [:r:], REG.argument
|
|
db 100b + r shl 3
|
|
else match (=HL?), argument
|
|
db 34h
|
|
else match (=IX?+d), argument
|
|
db 0DDh,34h,d
|
|
else match (=IY?+d), argument
|
|
db 0FDh,34h,d
|
|
else
|
|
err "incorrect argument"
|
|
end match
|
|
end macro
|
|
|
|
In case of an argument structured like "(IX+d)" it could sometimes
|
|
be desired to allow other algebraically equivalent forms of the
|
|
expression, like "(d+IX)" or "(c+IX+d)". Instead of parsing every
|
|
possible variant individually, it is possible to let the assembler
|
|
evaluate the expression while treating the selected symbol in a distinct
|
|
way. When a symbol is declared as an "element", it has no value and
|
|
when it is used in an expression, it is treated algebraically like
|
|
a variable term in a polynomial.
|
|
|
|
element HL?
|
|
element IX?
|
|
element IY?
|
|
|
|
macro INC? argument
|
|
match [:r:], REG.argument
|
|
db 100b + r shl 3
|
|
else match (a), argument
|
|
if a eq HL
|
|
db 34h
|
|
else if a relativeto IX
|
|
db 0DDh,34h,a-IX
|
|
else if a relativeto IY
|
|
db 0FDh,34h,a-IY
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
else
|
|
err "incorrect argument"
|
|
end match
|
|
end macro
|
|
|
|
INC (3*8+IX+1)
|
|
|
|
virtual at IX
|
|
x db ?
|
|
y db ?
|
|
end virtual
|
|
|
|
INC (y)
|
|
|
|
There is a small problem with the above macroinstruction. A parameter
|
|
may contain any text and when such value is placed into an expression,
|
|
it may induce erratic behavior. For example if "INC (1|0)" was processed,
|
|
it would turn the "a eq HL" expression into "1|0 eq HL" and this logical
|
|
expression is correct and true even though the argument was malformed.
|
|
Such unfortunate side-effect is a consequence of macroinstructions
|
|
operating on a simple principle of text substitution (and the best way
|
|
to avoid such problems is to use CALM instead). Here, to prevent it
|
|
from happening, a local variable may be used as a proxy holding the value
|
|
of an argument:
|
|
|
|
macro INC? argument
|
|
match [:r:], REG.argument
|
|
db 100b + r shl 3
|
|
else match (a), argument
|
|
local value
|
|
value = a
|
|
if value eq HL
|
|
db 34h
|
|
else if value relativeto IX
|
|
db 0DDh,34h,a-IX
|
|
else if value relativeto IY
|
|
db 0FDh,34h,a-IY
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
else
|
|
err "incorrect argument"
|
|
end match
|
|
end macro
|
|
|
|
There is an additional advantage of such proxy variable, thanks to
|
|
the fact that its value is computed before the macroinstruction begins
|
|
to generate any output. When an expression contains a symbol like "$",
|
|
it may give different values depending where it is calculated and
|
|
the use of proxy variable ensures that the value taken is the one
|
|
obtained by evaluating the argument before generating the code of
|
|
an instruction.
|
|
When the set of symbols allowed in expressions is larger, it is
|
|
better to have a single construction to process an entire family
|
|
of them. An "element" declaration may associate an additional value
|
|
with a symbol and this information can then be retrieved with
|
|
the "metadata" operator applied to a linear polynomial that contains
|
|
given symbol as a variable. The following example is another
|
|
variant of the previous macroinstruction that demonstrates the use
|
|
of this feature:
|
|
|
|
element register
|
|
element A? : register + 111b
|
|
element B? : register + 000b
|
|
element C? : register + 001b
|
|
element D? : register + 010b
|
|
element E? : register + 011b
|
|
element H? : register + 100b
|
|
element L? : register + 101b
|
|
|
|
element HL?
|
|
element IX?
|
|
element IY?
|
|
|
|
macro INC? argument
|
|
local value
|
|
match (a), argument
|
|
value = a
|
|
if value eq HL
|
|
db 34h
|
|
else if value relativeto IX
|
|
db 0DDh,34h,a-IX
|
|
else if value relativeto IY
|
|
db 0FDh,34h,a-IY
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
else match any more, argument
|
|
err "incorrect argument"
|
|
else
|
|
value = argument
|
|
if value eq value element 1 & value metadata 1 relativeto register
|
|
db 100b + (value metadata 1 - register) shl 3
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
end match
|
|
end macro
|
|
|
|
The "any more" pattern is there to catch any argument that
|
|
contains a complex expressions consisting of more than one token.
|
|
This prevents the use of syntax like "INC A+0" or "INC A+B-A".
|
|
But in case of some of the instructions sets, the inclusion of such
|
|
constraint may depend on a personal preference.
|
|
The "value eq value element 1" condition ensures that the value does not
|
|
contain any terms other than the name of a register. Even when an argument
|
|
is forced to contain no more than a single token, it is still possible
|
|
that is has a complex value, for instance if there were definitions like
|
|
"X = A + B" or "Y = 2 * A". Both "INC X" and "INC Y" would then cause
|
|
the operator "element 1" to return the value "A", which differs from the
|
|
value checked in either case.
|
|
If an instruction takes a variable number of arguments, a simple
|
|
way to recognize its various forms is to declare an argument with "&"
|
|
modifier to pass the complete contents of the arguments to "match":
|
|
|
|
element CC
|
|
|
|
NZ? := CC + 000b
|
|
Z? := CC + 001b
|
|
NC? := CC + 010b
|
|
C? := CC + 011b
|
|
PO := CC + 100b
|
|
PE := CC + 101b
|
|
P := CC + 110b
|
|
M := CC + 111b
|
|
|
|
macro CALL? arguments&
|
|
local cc,nn
|
|
match condition =, target, arguments
|
|
cc = condition - CC
|
|
nn = target
|
|
db 0C4h + cc shl 3
|
|
else
|
|
nn = arguments
|
|
db 0CDh
|
|
end match
|
|
dw nn
|
|
end macro
|
|
|
|
CALL 0
|
|
CALL NC,2135h
|
|
|
|
This approach also allows to handle other, more difficult cases, like when
|
|
the arguments may contain commas or are delimited in different ways.
|
|
|
|
|
|
|
|
How are the labels processed?
|
|
|
|
A standard way of defining a label is by following its name with ":" (this
|
|
also acts like a line break and any other command, including another label,
|
|
may follow in the same line). Such label simply defines a symbol with
|
|
the value equal to the current address, which initially is zero and increases
|
|
when any bytes are added into the output.
|
|
In some variants of assembly language it may be desirable to allow label
|
|
to precede an instruction without an additional ":" inbetween. It is then
|
|
necessary to create a labeled macroinstruction that after defining a label
|
|
passes processing to the original macroinstruction with the same name:
|
|
|
|
struc INC? argument
|
|
.:
|
|
INC argument
|
|
end struc
|
|
|
|
start INC A
|
|
INC B
|
|
|
|
This has to be done for every instruction that needs to allow this kind
|
|
of syntax. A simple loop like the following one would suffice:
|
|
|
|
iterate instruction, EX,INC,CALL
|
|
struc instruction? argument
|
|
.: instruction argument
|
|
end struc
|
|
end iterate
|
|
|
|
Every built-in instruction that defines data already has the labeled variant.
|
|
By defining a labeled instruction that has "?" in place of name it is
|
|
possible to intercept every line that starts with an identifier that is not
|
|
a known instruction and is therefore assumed to be a label. The following one
|
|
would allow a label without ":" to begin any line in the source text (it also
|
|
handles the special cases so that labels followed with ":" or with "=" and
|
|
a value would still work):
|
|
|
|
struc ? tail&
|
|
match :, tail
|
|
.:
|
|
else match : instruction, tail
|
|
.: instruction
|
|
else match == value, tail
|
|
. = value
|
|
else
|
|
.: tail
|
|
end match
|
|
end struc
|
|
|
|
Obviously, it is no longer needed to define any specific labeled
|
|
macrointructions when a global effect of this kind is applied. A variant
|
|
should be chosen depending on the type of syntax that needs to be allowed.
|
|
Intercepting even the labels defined with ":" may become useful when the
|
|
value of current address requires some additional processing before being
|
|
assigned to a label - for example when a processor uses addresses with a
|
|
unit larger than a byte. The intercepting macroinstruction might then look
|
|
like this:
|
|
|
|
struc ? tail&
|
|
match :, tail
|
|
label . at $ shr 1
|
|
else match : instruction, tail
|
|
label . at $ shr 1
|
|
instruction
|
|
else
|
|
. tail
|
|
end match
|
|
end struc
|
|
|
|
The value of current address that is used to define labels may be altered
|
|
with "org". If the labels need to be differentiated from absolute values,
|
|
a symbol defined with "element" may be used to form an address:
|
|
|
|
element CODEBASE
|
|
org CODEBASE + 0
|
|
|
|
macro CALL? argument
|
|
local value
|
|
value = argument
|
|
if value relativeto CODEBASE
|
|
db 0CDh
|
|
dw value - CODEBASE
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
end macro
|
|
|
|
To define labels in an address space that is not going to be reflected in
|
|
the output, a "virtual" block should be declared. The following sample
|
|
prepares macroinstructions "DATA" and "CODE" to switch between generating
|
|
program instructions and data labels. Only the instruction codes would go to
|
|
the output:
|
|
|
|
element DATA
|
|
DATA_OFFSET = 2000h
|
|
element CODE
|
|
CODE_OFFSET = 1000h
|
|
|
|
macro DATA?
|
|
_END
|
|
virtual at DATA + DATA_OFFSET
|
|
end macro
|
|
|
|
macro CODE?
|
|
_END
|
|
org CODE + CODE_OFFSET
|
|
end macro
|
|
|
|
macro _END?
|
|
if $ relativeto DATA
|
|
DATA_OFFSET = $ - DATA
|
|
end virtual
|
|
else if $ relativeto CODE
|
|
CODE_OFFSET = $ - CODE
|
|
end if
|
|
end macro
|
|
|
|
postpone
|
|
_END
|
|
end postpone
|
|
|
|
CODE
|
|
|
|
The "postpone" block is used here to ensure that the "virtual" block
|
|
always gets closed correctly, even if source text ends with data
|
|
definitions.
|
|
Within the environment prepared by the above sample any instruction
|
|
would be able to distinguish data labels from the ones defined within
|
|
program. For example a branching instruction could be made to accept
|
|
an argument being either a label within a program or an absolute value,
|
|
but to disallow any label of data:
|
|
|
|
macro CALL? argument
|
|
local value
|
|
value = argument
|
|
if value relativeto CODE
|
|
db 0CDh
|
|
dw value - CODE
|
|
else if value relativeto 0
|
|
db 0CDh
|
|
dw value
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
end macro
|
|
|
|
DATA
|
|
|
|
variable db ?
|
|
|
|
CODE
|
|
|
|
routine:
|
|
|
|
In this context either "CALL routine" or "CALL 1000h" would be allowed,
|
|
while "CALL variable" would not be.
|
|
When the labels have values that are not absolute numbers, it is
|
|
possible to generate relocations for instructions that use them.
|
|
A special "virtual" block may be used to store the offsets of values
|
|
inside the program that need to be relocated when its base changes:
|
|
|
|
virtual at 0
|
|
Relocations::
|
|
rw RELOCATION_COUNT
|
|
end virtual
|
|
|
|
RELOCATION_INDEX = 0
|
|
|
|
postpone
|
|
RELOCATION_COUNT := RELOCATION_INDEX
|
|
end postpone
|
|
|
|
macro WORD? value
|
|
if value relativeto CODE
|
|
store $ - CODE : 2 at Relocations : RELOCATION_INDEX shl 1
|
|
RELOCATION_INDEX = RELOCATION_INDEX + 1
|
|
dw value - CODE
|
|
else
|
|
dw value
|
|
end if
|
|
end macro
|
|
|
|
macro CALL? argument
|
|
local value
|
|
value = argument
|
|
if value relativeto CODE | value relativeto 0
|
|
db 0CDh
|
|
word value
|
|
else
|
|
err "incorrect argument"
|
|
end if
|
|
end macro
|
|
|
|
The table of relocations that is created this way can then be accessed
|
|
with "load". The following two lines could be used to put the table
|
|
in its entirety somewhere in the output:
|
|
|
|
load RELOCATIONS : RELOCATION_COUNT shl 1 from Relocations : 0
|
|
dw RELOCATIONS
|
|
|
|
The "load" reads the whole table into a single string, then "dw" writes it
|
|
into output (padded to multiple of a word, but in this case the string never
|
|
requires such padding).
|
|
For more complex types of relocations additional modifier may need to be
|
|
employed. For example, if upper and lower portions of an address needed to be
|
|
stored in separate places (likely across two instructions) and relocated
|
|
separately, necessary modifiers could be implemented as follows:
|
|
|
|
element MOD.HIGH
|
|
element MOD.LOW
|
|
|
|
HIGH? equ MOD.HIGH +
|
|
LOW? equ MOD.LOW +
|
|
|
|
macro BYTE? value
|
|
if value relativeto MOD.HIGH + CODE
|
|
; register HIGH relocation
|
|
db (value - MOD.HIGH - CODE) shr 8
|
|
else if value relativeto MOD.LOW + CODE
|
|
; register LOW relocation
|
|
db (value - MOD.LOW - CODE) and 0FFh
|
|
else if value relativeto MOD.HIGH
|
|
db (value - MOD.HIGH) shr 8
|
|
else if value relativeto MOD.LOW
|
|
db (value - MOD.LOW) and 0FFh
|
|
else
|
|
db value
|
|
end if
|
|
end macro
|
|
|
|
The commands that would register relocation have been omitted for clarity,
|
|
in this case not only offset within code but some additional information would
|
|
need to registered in appropriate structures. With such preparation, relocatable
|
|
units in code might be generated like:
|
|
|
|
BYTE HIGH address
|
|
BYTE LOW address
|
|
|
|
Such approach allows to easily enable syntax with modifiers in any instruction
|
|
that internally uses "byte" macroinstruction when generating code.
|
|
|
|
|
|
|
|
How can multiple sections of file be generated in parallel?
|
|
|
|
This assembly engine has a single main output that has to be generated
|
|
sequentially. This may seem problematic when the file needs to contains
|
|
distinct sections for code and data, collected from interleaved pieces that
|
|
may be spread across multiple source files. There are, however, a couple of
|
|
methods to handle it, all based in one way or another on forward-referencing
|
|
capabilities of the assembler.
|
|
A natural approach is to define contents of auxiliary section in "virtual"
|
|
block and copy it to appropriate position in the output with a single
|
|
operation. When a "virtual" block is labeled, it can be re-opened multiple
|
|
times to append more data to it.
|
|
|
|
include '8086.inc'
|
|
org 100h
|
|
jmp CodeSection
|
|
|
|
DataSection:
|
|
|
|
virtual
|
|
Data::
|
|
end virtual
|
|
|
|
postpone
|
|
virtual Data
|
|
load Data.OctetString : $ - $$ from $$
|
|
end virtual
|
|
end postpone
|
|
|
|
db Data.OctetString
|
|
|
|
CodeSection:
|
|
|
|
virtual Data
|
|
Hello db "Hello!",24h
|
|
end virtual
|
|
|
|
mov ah,9
|
|
mov dx,Hello
|
|
int 21h
|
|
|
|
virtual Data
|
|
ExitCode db 37h
|
|
end virtual
|
|
|
|
mov ah,4Ch
|
|
mov al,[ExitCode]
|
|
int 21h
|
|
|
|
This leads to a relatively simple syntax even without help of additional
|
|
macros.
|
|
Another method could be to put the pieces of the section into macros and
|
|
execute them all at the required position in source. A disadvantage of such
|
|
approach is that tracing errors in definitions might become a bit cumbersome.
|
|
The techniques that allow to easily append to a section generated in
|
|
parallel can also be very useful to generate data structures like relocation
|
|
tables. Instead of "store" commands used earlier when demonstrating
|
|
the concept, regular data directives could be used inside a re-opened
|
|
"virtual" block to create relocation records.
|
|
|
|
|
|
|
|
What options are there to parse other kinds of syntax?
|
|
|
|
In some cases a command that assembler needs to parse may begin with
|
|
something different than a name of instruction or a label. It may be
|
|
that a name is preceded by a special character, like "." or "!",
|
|
or that it is an entirely different kind of construction. It is then
|
|
necessary to use "macro ?" to intercept whole lines of source text
|
|
and process any special syntax of such kind.
|
|
For example, if it was required to allow a command written as ".CODE",
|
|
it would not be possible to implement it directly as a macroinstruction,
|
|
because initial dot causes the symbol to be interpreted as a local one
|
|
and globally defined instruction could never be executed this way.
|
|
The intercepting macroinstruction provides a solution:
|
|
|
|
macro ? line&
|
|
match .=CODE?, line
|
|
CODE
|
|
else match .=DATA?, line
|
|
DATA
|
|
else
|
|
line
|
|
end match
|
|
end macro
|
|
|
|
The lines that contain either ".CODE" or ".DATA" text are processed here
|
|
in such a way, that they invoke the global macroinstruction with
|
|
corresponding name, while all other intercepted lines are executed without
|
|
changes. This method allows to filter out any special syntax and let
|
|
the assembler process the regular instructions as usual.
|
|
Sometimes unconventional syntax is expected only in a specific area
|
|
of source text, like inside a block with defined boundaries. The
|
|
parsing macroinstruction should then be applied only in this place,
|
|
and removed with "purge" when the block ends:
|
|
|
|
macro concise
|
|
macro ? line&
|
|
match =end =concise, line
|
|
purge ?
|
|
else match dest+==src, line
|
|
ADD dest,src
|
|
else match dest-==src, line
|
|
SUB dest,src
|
|
else match dest==src, line
|
|
LD dest,src
|
|
else match dest++, line
|
|
INC dest
|
|
else match dest--, line
|
|
DEC dest
|
|
else match any, line
|
|
err "syntax error"
|
|
end match
|
|
end macro
|
|
end macro
|
|
|
|
concise
|
|
C=0
|
|
B++
|
|
A+=2
|
|
end concise
|
|
|
|
A macroinstruction defined this way does not intercept lines that contain
|
|
directives controlling the flow of the assembly, like "if" or "repeat", and
|
|
they can still be used freely inside such a block. This would change if
|
|
the declaration was in the form "macro ?! line&". Such a variant would
|
|
intercept every line with no exception.
|
|
Another option to catch special commands might be to use "struc ?"
|
|
to intercept only lines that do not start with a known instruction
|
|
(the initial symbol is then treated as label). Since this one only tests
|
|
unknown commands, it should cause less overhead on the assembly:
|
|
|
|
struc (head) ? tail&
|
|
match .=CODE?, head
|
|
CODE tail
|
|
else
|
|
head tail
|
|
end match
|
|
end struc
|
|
|
|
All these approaches hide a subtle trap. A label defined with ":" may be
|
|
followed by another instruction in the same line. If that next instruction
|
|
(which here becomes hidden in the "tail" parameter) is a control directive
|
|
like "if", putting it inside the "else" clause is going to cause broken nesting
|
|
of control blocks. A possible solution is to somehow invoke "tail" contents
|
|
outside of "match" block. One way could be to call a special macro:
|
|
|
|
struc (head) ? tail&
|
|
local invoker
|
|
match .=CODE?, head
|
|
macro invoker
|
|
CODE tail
|
|
end macro
|
|
else
|
|
macro invoker
|
|
head tail
|
|
end macro
|
|
end match
|
|
invoker
|
|
end struc
|
|
|
|
A simpler option is to call the original line directly and when override
|
|
is needed, cause it to be ignored with help of another line interceptor
|
|
(disposing of itself immediately after):
|
|
|
|
struc (head) ? tail&
|
|
match .=CODE?, head
|
|
CODE tail
|
|
macro ? line&
|
|
purge ?
|
|
end macro
|
|
end match
|
|
head tail
|
|
end struc
|
|
|
|
However, a much better way of avoiding this kinds of pitfalls is to use
|
|
CALM instructions instead of standard macros. There it is possible to
|
|
process arguments and assemble the original or modified line without
|
|
use of any control directives. CALM instructions also offer a much better
|
|
performance, which might be especially important in case of interceptors
|
|
that get called for nearly every line in source text.
|
|
|
|
|
|
|
|
How to define an instruction sharing a name with one of the core directives?
|
|
|
|
It may happen that a language can be in general easily implemented with
|
|
macros, but it needs to include a command with the same name as one of
|
|
the directives of assembler. While it is possible to override any
|
|
instruction with a macro, macros themself may require an access to
|
|
the original directive. To allow the same name call a different instruction
|
|
depending on the context, the implemented language may be interpreted
|
|
within a namespace that contains overriding macro, while all the macros
|
|
requiring access to original directive would have to temporarily switch
|
|
to another namespace where it has not have been overridden. This would
|
|
require every such macro to pack its contents in a "namespace" block.
|
|
But there is another trick, related to how texts of macro parameters
|
|
or symbolic variables preserve the context under which the symbols within
|
|
them should be interpreted (this includes the base namespace and
|
|
the parent label for symbols starting with dot).
|
|
Unlike the two mentioned occurences, the text of a macro normally does
|
|
not carry such extra information, but if a macro is constructed in such way
|
|
that it contains text that was once carried within a parameter to another
|
|
macro or within a symbolic variable, then this text retains the information
|
|
about context even when it becomes a part of a newly defined macro.
|
|
For example:
|
|
|
|
macro definitions end?
|
|
namespace embedded
|
|
struc LABEL? size
|
|
match , size
|
|
.:
|
|
else
|
|
label . : size
|
|
end match
|
|
end struc
|
|
macro E#ND? name
|
|
end namespace
|
|
match any, name
|
|
ENTRYPOINT := name
|
|
end match
|
|
macro ?! line&
|
|
end macro
|
|
end macro
|
|
end macro
|
|
|
|
definitions end
|
|
|
|
start LABEL
|
|
END start
|
|
|
|
The parameter given to "definitions" macro may appear to do nothing, as it
|
|
replaces every instance of "end" with exactly the same word - but the text
|
|
that comes from the parameter is equipped with additional information about
|
|
context, and this attribute is then preserved when the text becomes a part
|
|
of a new macro. Thanks to that, macro "LABEL" can be used in a namespace
|
|
where "end" instruction has taken a different meaning, but the instances
|
|
of "end" within its body still refer to the symbol in the outer namespace.
|
|
In this example the parameter has been made case-insensitive, and thus
|
|
it would replace even the "END" in "macro" statement that is supposed to
|
|
define a symbol in "embedded" namespace. For this reason the identifier
|
|
has been split with a concatenation operator to prevent it from being
|
|
recognized as parameter. This would not be necessary if the parameter
|
|
was case-sensitive (as more usual).
|
|
The same effect can be achieved through use of symbolic variables instead
|
|
of macro parameters, with help of "match" to extract the text of a symbolic
|
|
variable:
|
|
|
|
define link end
|
|
match end, link
|
|
namespace embedded
|
|
struc LABEL? size
|
|
match , size
|
|
.:
|
|
else
|
|
label . : size
|
|
end match
|
|
end struc
|
|
macro END? name
|
|
end namespace
|
|
match any, name
|
|
ENTRYPOINT := name
|
|
end match
|
|
macro ?! line&
|
|
end macro
|
|
end macro
|
|
end match
|
|
|
|
start LABEL
|
|
END start
|
|
|
|
This would not work without passing the text through symbolic variable,
|
|
because parameters defined by control directives like "match" do not
|
|
add context information to the text unless it was already there.
|
|
CALM instructions allow for another approach to this kind of problems.
|
|
If a customized instruction set is defined entirely in form of CALM,
|
|
they may not even need an access to original control directives.
|
|
However, if CALM instruction needs to assemble a directive that might not
|
|
be accessible, the symbolic variable passed to "assemble" should be
|
|
defined with appropriate context for the instruction symbol.
|
|
|
|
|
|
|
|
How to convert a macroinstruction to CALM?
|
|
|
|
A classic macroinstruction consists of lines of text that are preprocessed
|
|
(by replacing names of parameters with their corresponding values) every time
|
|
the instruction is called and these preprocessed lines are passed to assembly.
|
|
For example this macroinstruction generates just a single line to be assembled,
|
|
and it does it by replacing "number" with the text given by the only argument
|
|
to the instruction:
|
|
|
|
macro octet value*
|
|
db value
|
|
end macro
|
|
|
|
A CALM instruction can be viewed as customized preprocessor, which needs to
|
|
be written in a special language. It is able to use various commands to
|
|
process the arguments and generate lines to be assembled. On the basic
|
|
level, it is also able to simulate what standard preprocessor does - with
|
|
help of "arrange" command. After preprocessing the line, it also needs to
|
|
explicitly pass it to the assembly with an "assemble" command:
|
|
|
|
calminstruction octet value*
|
|
arrange value, =db value
|
|
assemble value
|
|
end calminstruction
|
|
|
|
This gives the same result as the original macroinstruction, as it performs
|
|
the same kind of preprocessing. However, unlike the text of macroinstruction
|
|
a pattern given to "arrange" needs to explicitly state which name tokens are
|
|
to be replaced with their values and which ones (prepended with "=") should
|
|
be left untouched. The tokens that are copied from the pattern are stripped of
|
|
any context information, just like the text of macroinstruction is normally not
|
|
carrying any (while the values that came from arguments retain the recognition
|
|
context in which the instruction was started).
|
|
This is the most straightforward method of conversion and a simple sequence
|
|
of "arrange" and "assemble" commands could be made to generate the same lines as
|
|
by the original macroinstruction. But there is one exception - when a "local"
|
|
command is executed by macroinstruction, it creates a preprocessed parameter
|
|
with a special value that points to a symbol in the namespace unique to given
|
|
instance of the instruction.
|
|
|
|
macro pointer
|
|
local next
|
|
dd next
|
|
next:
|
|
end macro
|
|
|
|
In case of CALM there is no such namespace available, the local namespace of
|
|
a CALM instruction is shared among all its instances. Therefore, if a new unique
|
|
symbol is needed every time the instruction is called, it has to be constructed
|
|
manually. An obvious method might be to append a unique number to the name:
|
|
|
|
global_uid = 0
|
|
|
|
calminstruction pointer
|
|
compute global_uid, global_uid + 1
|
|
local command
|
|
arrange command, =dd =next#global_uid
|
|
assemble command
|
|
arrange command, =next#global_uid:
|
|
assemble command
|
|
end calminstruction
|
|
|
|
Here "arrange" is given a variable that has a numeric value and it has to
|
|
replace it with a text. This works only when the value is a plan non-negative
|
|
number, in such case "arrange" converts it to a text token that contains decimal
|
|
representation of that number. The lines passed to assembly are therefore
|
|
going to contains identifiers like "next#1".
|
|
While incrementation of the global counter could be done by preparing
|
|
a standard assembly command like "global_uid = global_uid + 1" with "arrange"
|
|
and passing it to assembly, "compute" command allows to do it directly in the
|
|
CALM processor. Moreover, it is then not affected by anything that alters
|
|
the context of assembly. If the instruction was defined as unconditional and
|
|
used inside a skipped IF block, the "compute" would still perform its task,
|
|
because execution of CALM commands is - just like standard preprocessing - done
|
|
independently from the main flow of the assembly. Also, references to
|
|
the "global_uid" always point to the same symbol - the one that was in scope
|
|
when the CALM instruction was defined and compiled. Therefore incrementing
|
|
the value with "compute" is more reliable and predictable.
|
|
In a similar manner, the assembly of line defining the label can be replaced
|
|
with a "publish" command. Here the value of the label (which should be equal
|
|
to the address after the line containing "dd" is assembled) needs to be computed
|
|
first, because "publish" only performs the assignment of a value to the symbol:
|
|
|
|
global_uid = 0
|
|
|
|
calminstruction pointer
|
|
compute global_uid, global_uid + 1
|
|
local symbol, command
|
|
arrange symbol, =next#global_uid
|
|
arrange command, =dd symbol
|
|
assemble command
|
|
local address
|
|
compute address, $
|
|
publish symbol:, address
|
|
end calminstruction
|
|
|
|
Because the CALM instruction itself is conditional, the "publish" inside is
|
|
effectively conditional, too. Therefore it works correctly as a replacement
|
|
for the assembly of line with a label.
|
|
While a global counter has several advantages, it can be interfered with,
|
|
so sometimes use of a local counter might be preferable. However, the local
|
|
namespace of CALM instruction is not normally not accessible from outside, so
|
|
it is a bit harder to give an initial value to such counter. One way could be
|
|
to check whether the counter has already been initialized with some value using
|
|
"take" command:
|
|
|
|
calminstruction pointer
|
|
local id
|
|
take id, id
|
|
jyes increment
|
|
compute id, 0
|
|
increment:
|
|
compute id, id + 1
|
|
local symbol, command
|
|
arrange symbol, =next#id
|
|
arrange command, =dd symbol
|
|
assemble command
|
|
local address
|
|
compute address, $
|
|
publish symbol:, address
|
|
end calminstruction
|
|
|
|
But this adds commands that are executed every time the instruction is called.
|
|
A better solution makes use of the ability to define custom instructions
|
|
processed during the definition of CALM instruction:
|
|
|
|
calminstruction calminstruction?.init? var*, val:0
|
|
compute val, val
|
|
publish var, val
|
|
end calminstruction
|
|
|
|
calminstruction pointer
|
|
local id
|
|
init id, 0
|
|
compute id, id + 1
|
|
local symbol, command
|
|
arrange symbol, =next#id
|
|
arrange command, =dd symbol
|
|
assemble command
|
|
local address
|
|
compute address, $
|
|
publish symbol:, address
|
|
end calminstruction
|
|
|
|
The custom statement "init" is called at the time when the CALM instruction is
|
|
defined (it does not generate any commands to be executed by the defined
|
|
instruction - it would itself have to use "assemble" commands to generate
|
|
statements to be compiled). It is given the name of variable from the local
|
|
scope of the CALM instruction, and it uses "publish" to assign an initial
|
|
numeric value to that variable.
|
|
To initialize local variable with a symbolic value, even simpler custom
|
|
instruction would suffice:
|
|
|
|
calminstruction calminstruction?.initsym? var*, val&
|
|
publish var, val
|
|
end calminstruction
|
|
|
|
The text of "val" argument carries the recognition context of the definition
|
|
of CALM instruction that contains the "initsym" statement, therefore it allows
|
|
to prepare a text for "assemble" containing references to local symbols:
|
|
|
|
calminstruction be32? value
|
|
local command
|
|
initsym command, dd value
|
|
compute value, value bswap 4
|
|
assemble command
|
|
end calminstruction
|
|
|
|
Again, after this intruction is compiled, it contains just two actual commands,
|
|
"compute" and "assemble", and the value of local symbol "command" is a text
|
|
that is interpreted in the same local context and refers to the same symbol
|
|
"value" as the "compute" does.
|
|
This example also demonstrates another advantage of CALM over standard
|
|
macroinstructions: its strict semantics prevent various kinds of unwanted
|
|
behavior that is allowed by a simple substitution of text. The text of "value"
|
|
is going to be evaluated by "compute" as a numeric sub-expression, signalling
|
|
an error on any unexpected syntax. Therefore it should be favorable to process
|
|
arguments entirely through CALM commands and only use "assemble" for final
|
|
simple statements. |