A parser of type-level domain representations.
As LLMs get better at writing in DSLs, from JSON to HTML to SQL to GraphQL, they've become a fast and relatively error-free way to develop software.
Many libraries, like ORMs, act as a typing layer between a DSL and an application.
This library aims to bridge that gap, providing the copy-and-paste-ability of LLM-generated DSL code with the type-safety of an ORM.
Let's examine how purescript-tldr
works by creating a parser for a subset of JSON, called MySON. It's like JSON, but has no floats and strings can only be alphanumeric.
We'll first create data
types that our data will parse to. For example, 1
and 42
will parse to MyInt (Proxy "1")
and MyInt (Proxy "42")
.
data MyInt a
instance ShowParser (SP1 "MyInt" a) doc => ShowParser (MyInt a) doc
In the MyInt
, we saw that ShowParser
is used to document our parser. This must be done for every data type to which you parse, otherwise parsing will fail. This library is intended to provide the most helpful parsing errors possible, and that can't be done without rich error messages for any data
it parses.
A convenient way to do this is to use SP1
, SP2
... based on the arity of the type, which will generate a sensible automatic representation.
After defining our data types, we match them against whatever we want to parse. In the example below, we use WS
to ignore optional whitespace around an int, and then match against at least one value between 0 and 9.
type ParseInt = WS (MyInt (M.Some M.Match09))
We'll go over all of the matchers and combinators in the API section.
module Test.Readme where
import Prelude
import Prim.TypeError (Text)
import TLDR.Combinators (ParseAndThenIgnore)
import TLDR.Combinators as C
import TLDR.Combinators.Class (class FailOnFail, class Parse, class ShowParser, SP1, SP2)
import TLDR.Matchers as M
import TLDR.Sugar (Bracket, DQ, L4, L5, WS, WSM)
import Type.Function (type ($))
import Type.Proxy (Proxy(..))
data MySONFix
instance ShowParser MySONFix (Text "MySONFix")
data MyInt a
instance ShowParser (SP1 "MyInt" a) doc => ShowParser (MyInt a) doc
data MyString a
instance ShowParser (SP1 "MyString" a) doc => ShowParser (MyString a) doc
data MyBoolean a
instance ShowParser (SP1 "MyBoolean" a) doc => ShowParser (MyBoolean a) doc
data MyArray a
instance ShowParser (SP1 "MyArray" a) doc => ShowParser (MyArray a) doc
data MyObjectEntry k v
instance ShowParser (SP2 "MyObjectEntry" a b) doc => ShowParser (MyObjectEntry a b) doc
data MyObject a
instance ShowParser (SP1 "MyObject" a) doc => ShowParser (MyObject a) doc
type ParseInt = WS (MyInt (M.Some M.Match09))
type ParseString = WS (DQ (MyString (M.Some M.MatchAlphanumeric)))
type ParseBoolean = WS
$ MyBoolean
$ M.Or (L4 "t" "r" "u" "e") (L5 "f" "a" "l" "s" "e")
type ParseArray a = WS
$ Bracket (M.Literal "[")
(MyArray (C.SepBy a (WSM (M.Literal ","))))
(M.Literal "]")
type ObjectPair (a :: Type) = WS
$ MyObjectEntry
( ParseAndThenIgnore
(WS (DQ (M.Some M.MatchAlphanumeric)))
(M.Literal ":")
)
a
type ParseObject a = WS
$ Bracket (M.Literal "{")
(MyObject (C.SepBy (ObjectPair a) (WSM (M.Literal ","))))
(M.Literal "}")
type MySON = WS
$ C.Fix
( mySONFix ::
C.Or ParseInt
$ C.Or ParseString
$ C.Or ParseBoolean
$ C.Or (ParseArray (Proxy "mySONFix")) (ParseObject (Proxy "mySONFix"))
)
myson0' :: forall a. Parse "1" MySON Unit a Unit => Unit -> Proxy a
myson0' _ = Proxy
myson0 = myson0' unit
myson1'
:: forall a
. Parse
"""[1,"true",false]"""
MySON
Unit
a
Unit
=> Unit
-> Proxy a
myson1' _ = Proxy
myson1 = myson1' unit
myson2'
:: forall a
. Parse
"""
{"username":"mikesol", "pw":1234, "tags":["bored", "tired"] }
"""
MySON
Unit
a
Unit
=> Unit
-> Proxy a
myson2' _ = Proxy
myson2 = myson2' unit
Now, when we hover over the ide for the definitions of myson0
, myson1
, and myson2
, we get the following:
Proxy (Success (MyInt (Proxy "1")) "")
Proxy (Success (MyArray (Cons (MyInt (Proxy "1")) (Cons (MyString (Proxy "true")) (Cons (MyBoolean (Proxy "false")) Nil)))) "")
Proxy (Success (MyObject (Cons (MyObjectEntry (Proxy "username") (MyString (Proxy "mikesol"))) (Cons (MyObjectEntry (Proxy "pw") (MyInt (Proxy "1234"))) (Cons (MyObjectEntry (Proxy "tags") (MyArray (Cons (MyString (Proxy "bored")) (Cons (MyString (Proxy "tired")) Nil)))) Nil)))) "")
From there, you can consume your typelevel ADT in downstream constraints.
When parsing fails, we can print a reasonably nice error messages using the FailOnFail
class. Otherwise, the type variable a
can carry the failure forward for downstream processing.
Take for example:
mysonFail'
:: forall a
. Parse
"""{1,"true",false}"""
MySON
Unit
a
Unit
=> FailOnFail a
=> Unit
-> Proxy a
mysonFail' _ = Proxy
mysonFail = mysonFail' unit
You get:
Custom error:
Could not match } against 1
The generated documentation for type-level libraries on Pursuit is notoriously difficult to read. So here's the API. It's split into Parse
, matchers, combinators, and sugar.
class Parse
is how you parse a typelevel string.TLDR.Matchers
constains roughly the same stuff asParsing.String
.TLDR.Combinators
contains roughyl the same stuff asParsing.Combinators
.TLDR.Sugar
contains some stuff I was too lazy to write out by hand over and over.
As seen in the MySON example, the signature of Parse
is:
Parse sym combinator inState result outState
We'll see how to use state in a bit. In the MySON example, it's just Unit
and it isn't touched.
Matcher | What it does |
---|---|
Match09 |
Matches a char in the range 0-9 |
Matchaz |
Matches a char in the range a-z |
MatchAZ |
Matches a char in the range A-Z |
MatchAlpha |
Matches a char in the range a-zA-Z |
MatchAlphanumeric |
Matches a char in the range a-zA-Z0-0 |
Matchaz09 |
Matches a char in the range a-z0-9 |
MatchAZ09 |
Matches a char in the range A-Z0-9 |
MatchHex |
Matches a char in the range a-f0-9 |
MatchWhitespace |
Matches a whitespace char |
Literal char |
Matches a literal. Will fail on anything longer than 1 char! |
Match2 match |
Matches two consective chars matching match |
Match3 match |
Matches three consective chars matching match |
Match4 match |
Matches four consective chars matching match |
Match5 match |
Matches five consective chars matching match |
Match6 match |
Matches six consective chars matching match |
Match7 match |
Matches seven consective chars matching match |
Match8 match |
Matches eight consective chars matching match |
Match9 match |
Matches nine consective chars matching match |
Many match |
Matches 0 or more match |
Some match |
Matches 1 or more match |
Except no yes |
Matches yes except anything matching to no |
Or left right |
Matches either left or right |
And left right |
Consumes left , then right , and concatenates them |
Noop |
Doesn't match anything, no input is consumed |
EOF |
Matches "" |
Any |
Matches any single char |
Note that PureScript typelevel strings don't have a proper Char
representation, so when I use "char" above, I mean any head
that is produced by Symbol.Cons head tail string
. For example, Symbol.Cons head tail "hello"
produces "h"
for head
.
Combinators take matchers and use them to populate our data types.
At the most basic level, any data type that accepts Type
arguments and with a ShowParser
instance can be a combinator. For example, MyBoolean
is a combinator below.
type ParseBoolean = MyBoolean $ M.Or (L4 "t" "r" "u" "e") (L5 "f" "a" "l" "s" "e")
In a term-level parsing library, you'd achieve a similar result with mapping, ie:
parseBoolean = MyBoolean <$> or (l4 "t" "r" "u" "e") (l5 "f" "a" "l" "s" "e")
Here are the combinators. Fix
is a mind-melt, so there'll be a whole section on that in a bit.
Combinator | What it does |
---|---|
IgnoreAndThenParse ignore parse |
Ignore matcher ignore and then parse combinator parse . |
ParseAndThenIgnore parse ignore |
Parse combinator parse and then ignore matcher ignore . |
Many a |
Accumulates 0 or more of combinator a into a typelevel list. |
Some a |
Accumulates 1 or more of combinator a into a typelevel list. |
Const a |
Consume no input and output a . |
Or left right |
Parses to either left or right |
SepBy a sep |
Parse phrases delimited by a separator. |
SepBy1 a sep |
Parse phrases delimited by a separator, requiring at least one match. |
SepEndBy a sep |
Parse phrases delimited and optionally terminated by a separator. |
SepEndBy1 a sep |
Parse phrases delimited and optionally terminated by a separator, requiring at least one match. |
EndBy a sep |
Parse phrases delimited and terminated by a separator. |
EndBy1 a sep |
Parse phrases delimited and terminated by a separator, requiring at least one match. |
Fix selfs |
A fixed point. See below. |
Fixed points are essential in parsing because they allow you to parse arbitrarily nested structures. The MySON example above uses one, and most parsing tasks eventually require a fixed point.
The way Fix selfs
works is that you define a row of types for selfs
. In the MySON example, it's a row with entry mySONFix
. Then, in a
, you refer to Proxy "mySONFix"
whenever you want to parse a
. This is how objects and arrays are expressed in the MySON
example.
To achieve mutual recursion, you can add more keys to the row. In this case, the combinator with the key that comes first in the alphabet is used for parsing.
There are three state-management combinators for keeping and using an internal state.
Combinator | What it does |
---|---|
ModifyStateAfterSuccessOnConstant constant combinator |
Modify the state upon a successful parsing outcome with a constant value. |
ModifyStateAfterSuccessWithResult functor combinator |
Modify the state upon a successful parsing outcome with the result of the parsing operation as the argument to functor . |
BranchOnState branches otherwise |
Branch to different parsers depending on the internal state. |
There are examples of all four of these in the tests.
In general, state management should be considered an advanced feature. Most often, it's not needed, and you can do all stateful tasks it in a second step. For example, in XML parsing, you can use state management to determine if open tags are closed. But you can also do this in a second step after successfully parsing to a data type. While the latter is slower, it's easier to read and implement.
- Some combinators have the same names as data types. For example,
Or
in matcher-land matches either of two symbols, whereasOr
in combinator-land matches either of two data types. - Some combinators take other combinators, some take matchers, and some like
SepBy
take both - a combinator for the thing to match and a matcher for the separator. Getting this wrong can lead to funky error messages.
What fun would life be without syntactic sugar? Probably healthier and cavity-free. But still, we can indulge ourselves from time to time with these useful, delectable tidbits.
Sugar | What it does |
---|---|
WS combinator |
Allows for optional whitespace before and after a combinator. |
WSM match |
Allows for optional whitespace before and after a match. |
DQ combinator |
Puts a combinator in double quotes. |
DQM match |
Puts a match in double quotes. |
Bracket left combinator right |
Brackets a combinator between two ignorable matches. |
L1 ... L10 |
Matches a literal with 1 to 10 arguments, ie L4 "t" "r" "u" "e" . |