lama_byterun/spec/03.01.lexical_structure.tex

117 lines
4.3 KiB
TeX
Raw Normal View History

% !TEX TS-program = pdflatex
% !TeX spellcheck = en_US
% !TEX root = lama-spec.tex
2020-02-04 05:49:12 +03:00
\section{Lexical Structure}
2020-02-05 01:18:20 +03:00
\label{sec:lexical_structure}
2020-02-04 05:49:12 +03:00
The character set for the language is \textsc{ASCII}, case-sensitive. In the following lexical description we will use
2024-07-09 14:53:56 +02:00
the POSIX-Extended Regular Expressions in lexical definitions.
2020-02-04 05:49:12 +03:00
\subsection{Whitespaces and Comments}
Whitespaces and comments are \textsc{ASCII} sequences which serve as delimiters for other tokens but otherwise are
ignored.
The following characters are treated as whitespaces:
\begin{itemize}
\item blank character "\texttt{ }";
\item newline character "\texttt{\textbackslash n}";
2020-02-17 01:51:52 +03:00
\item carriage return character "\texttt{\textbackslash r}";
2020-02-04 05:49:12 +03:00
\item tabulation character "\texttt{\textbackslash t}".
\end{itemize}
Additionally, two kinds of comments are recognized:
\begin{itemize}
2021-02-01 09:52:28 +03:00
\item the end-of-line comment "\texttt{--}" escapes the rest of the line, including itself;
\item the block comment "\texttt{(*} ... \texttt{*)}" escapes all the text between
2020-02-04 05:49:12 +03:00
"\texttt{(*}" and "\texttt{*)}".
\end{itemize}
There is a number of specific cases which have to be considered explicitly.
2020-02-18 03:39:42 +03:00
First, block comments can be properly nested. Then, the occurrences of comment symbols inside string literals (see below) are not
2020-02-04 05:49:12 +03:00
considered as comments.
End-of-line comment encountered \emph{outside} of a block comment escapes block comment symbols:
\begin{lstlisting}
-- the following symbols are not considered as a block comment: (*
-- same here: *)
\end{lstlisting}
Similarly, an end-of-line comment encountered inside a block comment is escaped:
\begin{lstlisting}
(* Block comment starts here ...
-- and ends here: *)
\end{lstlisting}
\subsection{Identifiers and Constants}
The language distinguishes identifiers, signed decimal literals, string and character literals (see Fig.~\ref{idents_and_consts}). There are
two kinds of identifiers: those beginning with uppercase characters (\token{UIDENT}) and lowercase characters (\token{LIDENT}).
String literals cannot span multiple lines; a blockquote character (") inside a string literal has to be doubled to prevent from
being considered as this literal's delimiter.
Character literals as a rule are comprised of a single \textsc{ASCII} character; if this character is a quote (') it has to be doubled. Additionally
two-character abbreviations "\textbackslash t" and "\textbackslash n" are recognized and converted into a single-character representation.
\begin{figure}[t]
\[
\begin{array}{rcl}
\token{UIDENT} & = &\mbox{\texttt{[A-Z][a-zA-Z\_0-9]*}}\\
\token{LIDENT} & = &\mbox{\texttt{[a-z][a-zA-Z\_0-9]*}}\\
\token{DECIMAL}& = &\mbox{\texttt{-?[0-9]+}}\\
\token{STRING} & = &\mbox{\texttt{"([\^{}\textbackslash"]|"")*"}}\\
\token{CHAR} & = &\mbox{\texttt{'([\^{}']|''|\textbackslash n|\textbackslash t)'}}
\end{array}
\]
\caption{Identifiers and constants}
\label{idents_and_consts}
\end{figure}
\subsection{Keywords}
The following identifiers are reserved for keywords:
\begin{lstlisting}
2021-02-01 09:52:28 +03:00
after array at before box case do elif else
esac eta false fi for fun if import infix
infixl infixr lazy od of public sexp skip str
2024-07-09 14:53:56 +02:00
syntax then true val var while let in
2020-02-04 05:49:12 +03:00
\end{lstlisting}
\subsection{Infix Operators}
Infix operators defined as follows:
\[
\token{INFIX}=\mbox{\texttt{[+*/\%\$\#@!|\&\^{}~?<>:=\textbackslash-]+}}
\]
2020-02-18 03:39:42 +03:00
There is a predefined set of built-in infix operators (see Fig.~\ref{builtin_infixes}); additionally
2020-02-28 01:09:33 +03:00
an end-user can define custom infix operators (see Section~\ref{sec:custom_infix}). Note, sometimes
2020-02-04 05:49:12 +03:00
additional whitespaces are required to disambiguate infix operator applications. For example, if a
custom infix operator "\lstinline|+-|" is defined, then the expression "\lstinline|a +- b|" can no longer be
2020-02-18 03:39:42 +03:00
recognized as "\lstinline|a +(-b)|". Note also that a custom operator containing "\lstinline|--|" can not be
2020-02-04 05:49:12 +03:00
defined due to lexical conventions.
\subsection{Delimiters}
The following symbols are treated as delimiters:
\begin{lstlisting}
. , ( ) { }
2020-03-21 13:05:14 +03:00
; # -> |
2020-02-04 05:49:12 +03:00
\end{lstlisting}
2020-03-21 13:05:14 +03:00
Note, custom infix operators can coincide with delimiters "\lstinline|#|", "\lstinline!|!", and "\lstinline|->|", which can
sometimes be misleading.
2020-02-04 05:49:12 +03:00