Table of Contents
A fixed-point parser can seem like a complicated idea. It's got lots of types, and lots of functions being passed around. Plus, it's got some monads involved, which can be quite intimidating!
However, it isn't as complex as it may seems once you understand some of the underlying concepts. So, let's dive into what makes up a fixed-point parser!
The Parser
In the fixed-point parser, a parser is represented as a function that takes the current index in the parse string and a continuation to call once it is done. 'done' could mean that it succeeded, it could mean that it failed, or it could mean something else. In addition, the continuation doesn't ever have to be called.
type parser = index -> parser_continuation -> state_transformer
Parser Continuation
Notice that the parser takes a parser_continuation in addition to an index.
This parser continuation could be considered to be a 'done' function. We call
the parser continuation once the parser has completed.
A continuation is called from within the parser. It may be called once, such as
in the case of the unit parser unit x, where the parser doesn't parse anything
and immediately calls the continuation with the value x and the current index.
It also may be called multiple times, such as in the case of the alternation
parser A|B, where it may complete with both A and B. It may also never be
called, such as in the case of the fail parser, where it immediately fails,
and thus never completes.
Parser "Output"
When you run a parser, it usually returns some value representing what was
parsed. It also advances the parent parser (if there is one) along the string.
We represent this with an output_pair. An output_pair is simply a
value/index pair. The value is the result of the parser, and the index is the
index on which the parser ended. For example, given the string "12ab" and a
parser that parses numbers, the parser will return the number 12 and the index
2 (assuming zero-based indexing).
type output_pair = index * value
Value and Index
As a side note, we need to quickly discuss what index and value are. They can
change depending how you decide to represent the two concepts, so we define these
only so that our code later makes sense.
index provides information about where you are in the string. For our
purposes, we will use an string * int pair. The string is the string that we
are attempting to parse. This doesn't change throughout the whole parsing. The
int is the index of the current character in the string, with the index of the first
character being 0.
(* (parse string * current character index) *)
type index = string * int
value is the value that you want your parser to produce. In our case, we will
simply return the parsed string. Thus, we will use strings for our values.
type value = string
Parser Continuation Result
Whatever happens after a parser may possibly update the state of the program.
This doesn't make a ton of sense yet, since we don't know what is inside the
state. We'll get to that next. The continuation will thus return a
state_transformer.
type parser_continuation = output_pair -> state_transformer
State Transformer
Once it has the index and the continuation, the parser then transforms the state that it is given. Thus, this state transformation could be represented as a function that takes a state and returns a state.
type state_transformer = state -> state
Note that in imperative languages, the state could be stored as a sort of 'global state' that is updated by the parser. In a language that supports it, the state could also be managed with algebraic effects.
State
What is the "state" of the program that we keep referring to? It's a map from a given "parser location" to certain information associated with that location (we will get into what this information is in a second).
(* imagine some easy-to-use hashmap instead of the
relatively complicated OCaml one *)
type state = (parser_location, location_info) map
Parser Location
The type of the parser_location is a pair, a rule_tag and an index. The
easiest way to explain this is by considering a parser as a sort of 'function'.
It takes as input the index to start the parse, and returns information about
what it parsed.
From this standpoint, we can view the state as a memoization of a parser call.
Given a parser tagged with P, we can treat mapping of P 1 -> foo to mean
that "when P is called at the index of 1, it returns foo.
let my_state =
(* ... run (P 1) and update the state with the result [foo] ... *)
in
Map.get_value_of ("P", 1) ~from:my_state (* => returns [foo] *)
Note that this requires a parser to be able to be tagged and memoized, which we will get to later.
type parser_location = rule_tag * index
Location Info
We mentioned that the parser location info is mapped to some information related to that location. There are two pieces of information that we need to remember.
First, we need to remember all of the values that the parser call has produced
at this position. We want to do this so that if we arrive at this point with a
new way to continue from this point (or in other words, a new
parser_continuation), we will want to run that with each value that has been
produced by this parser at this location. We will use a set of output_pairs
for this.
Next, we also need to remember all of the different possible ways that we can
continue from this point. This is because if we discover that this parser has
produced a new value, then we want to rerun all of the different possible ways
to continue using that value so that we can see if any off those produce a valid
parse. Thus, we will keep a list of parser_continuations.
(* imagine some easy-to-use set, instead of the
relatively complicated OCaml one *)
type location_info = (output_pair set) * (parser_continuation list)
Building a simple parser
Now that we have the basic concepts down, let's build a parser of our own! We will
build a parser that reads a single character from the string and advances to the
next character, called any_char.
First, let's remember what a parser is. A parser takes an index, the current
location in the parse string, and a parser_continuation, the 'done' function
for the parser. We will call these idx and complete, respectively.
let any_char : parser =
fun (idx : index) (complete : parser_continuation) ->
As mentioned previously, the index is a string * int pair, so we will unwrap that
so we can get the current character.
let parse_str, curr_char_idx = idx in
We want to make sure that we handle the case if the index is past the end of the parse string, so we will use an if statement to handle this.
if curr_char_idx < (String.length parse_str) then
If the index is inside the parse string, we will get the character at the index.
We will then convert the character to a string (since we are returning the
string that was parsed). Finally, we will advance the index by 1. Once we have
done that, we can 'finish' our parser by calling complete with the new index
and our character string as an output_pair. This will generate a
state_transformer that will be returned by our parser.
let c = String.get parse_str curr_char_idx in
let c_str = Char.to_string c in
let new_index = curr_char_idx + 1 in
complete (new_index, c_str) (* generates a state_transformer *)
Now, if the index is outside of the parse string, we want the parser to fail. How
do we do this? We just don't succeed. In other words, if calling the complete
continuation represents a parser succeeding, then we never call complete. However,
we still need to return a state_transformer. So we will simply create a state_transformer
that leaves the state unchanged.
else
(fun state -> state) (* leave the state unchanged *)
And there you have it! A parser that parses any character and advances the index. The full code for this parser is below.
let any_char : parser =
fun (idx : index) (complete : parser_continuation) ->
let parse_str, curr_char_idx = idx in
if curr_char_idx < (String.length parse_str) then
let c = String.get parse_str curr_char_idx in
let c_str = Char.to_string c in
let new_index = curr_char_idx + 1 in
complete (new_index, c_str) (* generates a state_transformer *)
else
(fun state -> state) (* leave the state unchanged *)