lama_byterun/spec/03.01.lexical_structure.tex

117 lines
4.3 KiB
TeX
Raw Normal View History

% !TEX TS-program = pdflatex
% !TeX spellcheck = en_US
% !TEX root = lama-spec.tex
2020-02-04 05:49:12 +03:00
\section{Lexical Structure}
2020-02-05 01:18:20 +03:00
\label{sec:lexical_structure}
2020-02-04 05:49:12 +03:00
The character set for the language is \textsc{ASCII}, case-sensitive. In the following lexical description we will use
the GNU Regexp syntax~\cite{GNULib} in lexical definitions.
\subsection{Whitespaces and Comments}
Whitespaces and comments are \textsc{ASCII} sequences which serve as delimiters for other tokens but otherwise are
ignored.
The following characters are treated as whitespaces:
\begin{itemize}
\item blank character "\texttt{ }";
\item newline character "\texttt{\textbackslash n}";
2020-02-17 01:51:52 +03:00
\item carriage return character "\texttt{\textbackslash r}";
2020-02-04 05:49:12 +03:00
\item tabulation character "\texttt{\textbackslash t}".
\end{itemize}
Additionally, two kinds of comments are recognized:
\begin{itemize}
2021-02-01 09:52:28 +03:00
\item the end-of-line comment "\texttt{--}" escapes the rest of the line, including itself;
\item the block comment "\texttt{(*} ... \texttt{*)}" escapes all the text between
2020-02-04 05:49:12 +03:00
"\texttt{(*}" and "\texttt{*)}".
\end{itemize}
There is a number of specific cases which have to be considered explicitly.
2020-02-18 03:39:42 +03:00
First, block comments can be properly nested. Then, the occurrences of comment symbols inside string literals (see below) are not
2020-02-04 05:49:12 +03:00
considered as comments.
End-of-line comment encountered \emph{outside} of a block comment escapes block comment symbols:
\begin{lstlisting}
-- the following symbols are not considered as a block comment: (*
-- same here: *)
\end{lstlisting}
Similarly, an end-of-line comment encountered inside a block comment is escaped:
\begin{lstlisting}
(* Block comment starts here ...
-- and ends here: *)
\end{lstlisting}
\subsection{Identifiers and Constants}
The language distinguishes identifiers, signed decimal literals, string and character literals (see Fig.~\ref{idents_and_consts}). There are
two kinds of identifiers: those beginning with uppercase characters (\token{UIDENT}) and lowercase characters (\token{LIDENT}).
String literals cannot span multiple lines; a blockquote character (") inside a string literal has to be doubled to prevent from
being considered as this literal's delimiter.
Character literals as a rule are comprised of a single \textsc{ASCII} character; if this character is a quote (') it has to be doubled. Additionally
two-character abbreviations "\textbackslash t" and "\textbackslash n" are recognized and converted into a single-character representation.
\begin{figure}[t]
\[
\begin{array}{rcl}
\token{UIDENT} & = &\mbox{\texttt{[A-Z][a-zA-Z\_0-9]*}}\\
\token{LIDENT} & = &\mbox{\texttt{[a-z][a-zA-Z\_0-9]*}}\\
\token{DECIMAL}& = &\mbox{\texttt{-?[0-9]+}}\\
\token{STRING} & = &\mbox{\texttt{"([\^{}\textbackslash"]|"")*"}}\\
\token{CHAR} & = &\mbox{\texttt{'([\^{}']|''|\textbackslash n|\textbackslash t)'}}
\end{array}
\]
\caption{Identifiers and constants}
\label{idents_and_consts}
\end{figure}
\subsection{Keywords}
The following identifiers are reserved for keywords:
\begin{lstlisting}
2021-02-01 09:52:28 +03:00
after array at before box case do elif else
esac eta false fi for fun if import infix
infixl infixr lazy od of public sexp skip str
syntax then true val var while
2020-02-04 05:49:12 +03:00
\end{lstlisting}
\subsection{Infix Operators}
Infix operators defined as follows:
\[
\token{INFIX}=\mbox{\texttt{[+*/\%\$\#@!|\&\^{}~?<>:=\textbackslash-]+}}
\]
2020-02-18 03:39:42 +03:00
There is a predefined set of built-in infix operators (see Fig.~\ref{builtin_infixes}); additionally
2020-02-28 01:09:33 +03:00
an end-user can define custom infix operators (see Section~\ref{sec:custom_infix}). Note, sometimes
2020-02-04 05:49:12 +03:00
additional whitespaces are required to disambiguate infix operator applications. For example, if a
custom infix operator "\lstinline|+-|" is defined, then the expression "\lstinline|a +- b|" can no longer be
2020-02-18 03:39:42 +03:00
recognized as "\lstinline|a +(-b)|". Note also that a custom operator containing "\lstinline|--|" can not be
2020-02-04 05:49:12 +03:00
defined due to lexical conventions.
\subsection{Delimiters}
The following symbols are treated as delimiters:
\begin{lstlisting}
. , ( ) { }
2020-03-21 13:05:14 +03:00
; # -> |
2020-02-04 05:49:12 +03:00
\end{lstlisting}
2020-03-21 13:05:14 +03:00
Note, custom infix operators can coincide with delimiters "\lstinline|#|", "\lstinline!|!", and "\lstinline|->|", which can
sometimes be misleading.
2020-02-04 05:49:12 +03:00