lama_byterun/spec/03.01.lexical_structure.tex

% !TEX TS-program = pdflatex
% !TeX spellcheck = en_US
% !TEX root = lama-spec.tex

\section{Lexical Structure}
\label{sec:lexical_structure}

The character set for the language is \textsc{ASCII}, case-sensitive. In the following lexical description we will use
the POSIX-Extended Regular Expressions in lexical definitions.

\subsection{Whitespaces and Comments}

Whitespaces and comments are \textsc{ASCII} sequences which serve as delimiters for other tokens but otherwise are
ignored.

The following characters are treated as whitespaces:

\begin{itemize}
\item blank character "\texttt{ }";
\item newline character "\texttt{\textbackslash n}";
\item carriage return character "\texttt{\textbackslash r}";
\item tabulation character "\texttt{\textbackslash t}".
\end{itemize}

Additionally, two kinds of comments are recognized:

\begin{itemize}
\item the end-of-line comment "\texttt{--}" escapes the rest of the line, including itself;
\item the block comment "\texttt{(*} ... \texttt{*)}" escapes all the text between
  "\texttt{(*}" and "\texttt{*)}".
\end{itemize}

There is a number of specific cases which have to be considered explicitly.

First, block comments can be properly nested. Then, the occurrences of comment symbols inside string literals (see below) are not
considered as comments.

End-of-line comment encountered \emph{outside} of a block comment escapes block comment symbols:

\begin{lstlisting}
    -- the following symbols are not considered as a block comment: (*
    -- same here: *)
\end{lstlisting}

Similarly, an end-of-line comment encountered inside a block comment is escaped:

\begin{lstlisting}
    (* Block comment starts here ...
       -- and ends here: *)
\end{lstlisting}

\subsection{Identifiers and Constants}

The language distinguishes identifiers, signed decimal literals, string and character literals (see Fig.~\ref{idents_and_consts}). There are
two kinds of identifiers: those beginning with uppercase characters (\token{UIDENT}) and lowercase characters (\token{LIDENT}).

String literals cannot span multiple lines; a blockquote character (") inside a string literal has to be doubled to prevent from
being considered as this literal's delimiter.

Character literals as a rule are comprised of a single \textsc{ASCII} character; if this character is a quote (') it has to be doubled. Additionally
two-character abbreviations "\textbackslash t" and "\textbackslash n" are recognized and converted into a single-character representation.

\begin{figure}[t]
  \[
  \begin{array}{rcl}
    \token{UIDENT} & = &\mbox{\texttt{[A-Z][a-zA-Z\_0-9]*}}\\
    \token{LIDENT} & = &\mbox{\texttt{[a-z][a-zA-Z\_0-9]*}}\\
    \token{DECIMAL}& = &\mbox{\texttt{-?[0-9]+}}\\
    \token{STRING} & = &\mbox{\texttt{"([\^{}\textbackslash"]|"")*"}}\\
    \token{CHAR}   & = &\mbox{\texttt{'([\^{}']|''|\textbackslash n|\textbackslash t)'}}
  \end{array}
  \]
  \caption{Identifiers and constants}
  \label{idents_and_consts}
\end{figure}


\subsection{Keywords}

The following identifiers are reserved for keywords:

\begin{lstlisting}
    after    array    at      before   box   case     do     elif     else
    esac     eta      false   fi       for   fun      if     import   infix
    infixl   infixr   lazy    od       of    public   sexp   skip     str
    syntax   then     true    val      var   while    let    in
\end{lstlisting}

\subsection{Infix Operators}

Infix operators defined as follows:

\[
\token{INFIX}=\mbox{\texttt{[+*/\%\$\#@!|\&\^{}~?<>:=\textbackslash-]+}}
\]

There is a predefined set of built-in infix operators (see Fig.~\ref{builtin_infixes}); additionally
an end-user can define custom infix operators (see Section~\ref{sec:custom_infix}). Note, sometimes 
additional whitespaces are required to disambiguate infix operator applications. For example, if a
custom infix operator "\lstinline|+-|" is defined, then the expression "\lstinline|a +- b|" can no longer be
recognized as "\lstinline|a +(-b)|". Note also that a custom operator containing "\lstinline|--|" can not be
defined due to lexical conventions.

\subsection{Delimiters}

The following symbols are treated as delimiters:

\begin{lstlisting}
    .       ,         (        )        {        }
    ;       #         ->       |
\end{lstlisting}

Note, custom infix operators can coincide with delimiters "\lstinline|#|", "\lstinline!|!", and "\lstinline|->|", which can
sometimes be misleading.
Add latex magic commands to many files Signed-off-by: Kakadu <Kakadu@pm.me> 2022-08-23 17:25:52 +03:00			`% !TEX TS-program = pdflatex`
			`% !TeX spellcheck = en_US`
			`% !TEX root = lama-spec.tex`

Continue Spec 2020-02-04 05:49:12 +03:00			`\section{Lexical Structure}`
More spec 2020-02-05 01:18:20 +03:00			`\label{sec:lexical_structure}`
Continue Spec 2020-02-04 05:49:12 +03:00
			`The character set for the language is \textsc{ASCII}, case-sensitive. In the following lexical description we will use`
Update README.md for 1.3 version 2024-07-09 14:53:56 +02:00			`the POSIX-Extended Regular Expressions in lexical definitions.`
Continue Spec 2020-02-04 05:49:12 +03:00
			`\subsection{Whitespaces and Comments}`

			`Whitespaces and comments are \textsc{ASCII} sequences which serve as delimiters for other tokens but otherwise are`
			`ignored.`

			`The following characters are treated as whitespaces:`

			`\begin{itemize}`
			`\item blank character "\texttt{ }";`
			`\item newline character "\texttt{\textbackslash n}";`
Continue spec; 2020-02-17 01:51:52 +03:00			`\item carriage return character "\texttt{\textbackslash r}";`
Continue Spec 2020-02-04 05:49:12 +03:00			`\item tabulation character "\texttt{\textbackslash t}".`
			`\end{itemize}`

			`Additionally, two kinds of comments are recognized:`

			`\begin{itemize}`
Promoted spec to 1.10 2021-02-01 09:52:28 +03:00			`\item the end-of-line comment "\texttt{--}" escapes the rest of the line, including itself;`
			`\item the block comment "\texttt{(} ... \texttt{)}" escapes all the text between`
Continue Spec 2020-02-04 05:49:12 +03:00			`"\texttt{(}" and "\texttt{)}".`
			`\end{itemize}`

			`There is a number of specific cases which have to be considered explicitly.`

Spec finished 2020-02-18 03:39:42 +03:00			`First, block comments can be properly nested. Then, the occurrences of comment symbols inside string literals (see below) are not`
Continue Spec 2020-02-04 05:49:12 +03:00			`considered as comments.`

			`End-of-line comment encountered \emph{outside} of a block comment escapes block comment symbols:`

			`\begin{lstlisting}`
			`-- the following symbols are not considered as a block comment: (*`
			`-- same here: *)`
			`\end{lstlisting}`

			`Similarly, an end-of-line comment encountered inside a block comment is escaped:`

			`\begin{lstlisting}`
			`(* Block comment starts here ...`
			`-- and ends here: *)`
			`\end{lstlisting}`

			`\subsection{Identifiers and Constants}`

			`The language distinguishes identifiers, signed decimal literals, string and character literals (see Fig.~\ref{idents_and_consts}). There are`
			`two kinds of identifiers: those beginning with uppercase characters (\token{UIDENT}) and lowercase characters (\token{LIDENT}).`

			`String literals cannot span multiple lines; a blockquote character (") inside a string literal has to be doubled to prevent from`
			`being considered as this literal's delimiter.`

			`Character literals as a rule are comprised of a single \textsc{ASCII} character; if this character is a quote (') it has to be doubled. Additionally`
			`two-character abbreviations "\textbackslash t" and "\textbackslash n" are recognized and converted into a single-character representation.`

			`\begin{figure}[t]`
			`\[`
			`\begin{array}{rcl}`
			`\token{UIDENT} & = &\mbox{\texttt{[A-Z][a-zA-Z\_0-9]*}}\\`
			`\token{LIDENT} & = &\mbox{\texttt{[a-z][a-zA-Z\_0-9]*}}\\`
			`\token{DECIMAL}& = &\mbox{\texttt{-?[0-9]+}}\\`
			`\token{STRING} & = &\mbox{\texttt{"([\^{}\textbackslash"]\|"")*"}}\\`
			`\token{CHAR} & = &\mbox{\texttt{'([\^{}']\|''\|\textbackslash n\|\textbackslash t)'}}`
			`\end{array}`
			`\]`
			`\caption{Identifiers and constants}`
			`\label{idents_and_consts}`
			`\end{figure}`


			`\subsection{Keywords}`

			`The following identifiers are reserved for keywords:`

			`\begin{lstlisting}`
Promoted spec to 1.10 2021-02-01 09:52:28 +03:00			`after array at before box case do elif else`
			`esac eta false fi for fun if import infix`
			`infixl infixr lazy od of public sexp skip str`
Update README.md for 1.3 version 2024-07-09 14:53:56 +02:00			`syntax then true val var while let in`
Continue Spec 2020-02-04 05:49:12 +03:00			`\end{lstlisting}`

			`\subsection{Infix Operators}`

			`Infix operators defined as follows:`

			`\[`
			`\token{INFIX}=\mbox{\texttt{[+*/\%\$\#@!\|\&\^{}~?<>:=\textbackslash-]+}}`
			`\]`

Spec finished 2020-02-18 03:39:42 +03:00			`There is a predefined set of built-in infix operators (see Fig.~\ref{builtin_infixes}); additionally`
Typos in spec 2020-02-28 01:09:33 +03:00			`an end-user can define custom infix operators (see Section~\ref{sec:custom_infix}). Note, sometimes`
Continue Spec 2020-02-04 05:49:12 +03:00			`additional whitespaces are required to disambiguate infix operator applications. For example, if a`
			`custom infix operator "\lstinline\|+-\|" is defined, then the expression "\lstinline\|a +- b\|" can no longer be`
Spec finished 2020-02-18 03:39:42 +03:00			`recognized as "\lstinline\|a +(-b)\|". Note also that a custom operator containing "\lstinline\|--\|" can not be`
Continue Spec 2020-02-04 05:49:12 +03:00			`defined due to lexical conventions.`

			`\subsection{Delimiters}`

			`The following symbols are treated as delimiters:`

			`\begin{lstlisting}`
			`. , ( ) { }`
Bugfix in runtime and documentation 2020-03-21 13:05:14 +03:00			`; # -> \|`
Continue Spec 2020-02-04 05:49:12 +03:00			`\end{lstlisting}`

Bugfix in runtime and documentation 2020-03-21 13:05:14 +03:00			`Note, custom infix operators can coincide with delimiters "\lstinline\|#\|", "\lstinline!\|!", and "\lstinline\|->\|", which can`
			`sometimes be misleading.`
Continue Spec 2020-02-04 05:49:12 +03:00