Uploaded by
Published on

Test Pursuit

parsing-replace is for finding text patterns, and also replacing or splitting on the found patterns. This activity is traditionally done with regular expressions, but parsing-replace uses Text.Parsing.Parser.String parsers instead for the pattern matching.

parsing-replace can be used in the same sort of “pattern capture” or “find all” situations in which one would use Data.String.Regex.match or

parsing-replace can be used in the same sort of “stream editing” or “search-and-replace” situations in which one would use Data.String.Regex.replace'.

parsing-replace can be used in the same sort of “string splitting” situations in which one would use Data.String.Regex.split.

See replace-megaparsec for the Haskell megaparsec version.

See replace-attoparsec for the Haskell attoparsec version.

We can expect the performance of parser-based find-and-replace to be worse than regex-based find-and-replace in a JavaScript runtime environment. This module is intended for use when the input size is modest, the pattern is complicated, and readability and maintainability are more important than speed.

Why would we want to do pattern matching and substitution with parsers instead of regular expressions?

  • Monadic parsers have a nicer syntax than regular expressions, which are notoriously difficult to read.

  • Regular expressions can do “group capture” on sections of the matched pattern, but they can only return stringy lists of the capture groups. Parsers can construct typed data structures based on the capture groups, guaranteeing no disagreement between the pattern rules and the rules that we're using to build data structures based on the pattern matches.

    For example, consider scanning a string for numbers. A lot of different things can look like a number, and can have leading plus or minus signs, or be in scientific notation, or have commas, or whatever. If we try to parse all of the numbers out of a string using regular expressions, then we have to make sure that the regular expression and the string-to-number conversion function agree about exactly what is and what isn't a numeric string. We can get into an awkward situation in which the regular expression says it has found a numeric string but the string-to-number conversion function fails. A typed parser will perform both the pattern match and the conversion, so it will never be in that situation. Parse, don't validate.

  • Regular expressions are only able to pattern-match regular grammars. Monadic parsers are able pattern-match context-free (by recursion) or context-sensitive (by monad transformer) grammars.

  • The replacement expression for a traditional regular expression-based substitution command is usually just a string template in which the Nth “capture group” can be inserted with the syntax \N. With this library, instead of a template, we get an editor function which can perform any computation, including Effects.

Perl Problems

Usage Examples

These usage examples are implemented in test/Main.purs.

Break strings with breakCap

Find the first pattern match and break the input string on the pattern.

breakCap (string "needle") "hay needle hay"
Just $ "hay " /\ "needle" /\ " hay"

Split strings with splitCap

Split the input string on all pattern matches.

splitCap (string "needle") "hay needle straw needle hay"
[Left "hay ", Right "needle", Left " straw ", Right "needle", Left " hay"]

Edit strings with streamEdit

Edit all found patterns with an editor function.

streamEdit (string "needle") toUpper "hay needle hay"
"hay NEEDLE hay"

Break strings with breakCap and match

Find the first pattern match, capture the matched text and the parsed result.

parseInt :: Parser String Int
parseInt = some digit >>= fromCharArray >>> fromString >>> maybe (fail "fromString") pure

breakCap (match parseInt) "abc 123 def"
Just $ "abc " /\ ("123" /\ 123) /\ " def"

Find all positions of a pattern with splitCap

Find the beginning positions of all pattern matches in the input.

catMaybes $ hush <$> splitCap (position <* string "𝝺") ".𝝺...\n...𝝺."
[Position { line: 1, column: 2 }, Position { line: 2, column: 4 }]

Find balanced parentheses with splitCap

Find groups of balanced nested parentheses. This pattern is an example of a “context-free” grammar, a pattern that can't be expressed by a regular expression. We can express the pattern with a recursive parser.

balancedParens :: Parser String Unit
balancedParens = do
  void $ char '('
  void $ manyTill (balancedParens <|> void anyChar) (char ')')
  pure unit

rmap fst <$> splitCap (match balancedParens) "((🌼)) (()())"
[Right "((🌼))", Left " ", Right "(()())"]

Edit strings in Effect in with streamEditT

Find an environment variable in curly braces and replace it with its value from the environment. We can read from the environment because streamEditT is running the editor function in Effect.

pattern = string "{" *> anyTill (string "}")
editor  = fst >>> lookupEnv >=> fromMaybe "" >>> pure
streamEditT pattern editor "◀ {HOME} ▶"
"◀ /home/jbrock ▶"

Count the pattern matches with splitCapT

Parse in a State monad to remember state in the parser. This stateful letterCount parser counts the number of pattern matches which occur in the input, and also tags each match with its index.

letterCount :: ParserT String (State Int) (Tuple Char Int)
letterCount = do
  l <- letter
  i <- lift $ modify (_+1)
  pure $ l /\ i

flip runState 0 $ splitCapT letterCount "A B"
[Right ('A' /\ 1), Left " ", Right ('B' /\ 2)] /\ 2



How do we repeat a “non-greedy” pattern?

…like in regex where we would repeat pattern p non-greedily by writing p*??

manyTill_ p q where q is the entire rest of the parser.

Is this a good idea?

You may have heard it suggested that monadic parsers are better for pattern-matching when the input stream is mostly signal, and regular expressions are better when the input stream is mostly noise.

The premise of this library is that monadic parsers are great for finding small signal patterns in a stream of otherwise noisy text.

Our reluctance to forego the speedup opportunities afforded by restricting ourselves to regular grammars is an old superstition about opportunities which remain mostly unexploited anyway. The performance compromise of allowing stack memory allocation (a.k.a. pushdown automata, a.k.a. context-free grammar, a.k.a. recursion) was once considered controversial for general-purpose programming languages. I think we can now resolve that controversy the same way for pattern matching languages.