rten-brel: An indexing program

2021-08-21

This article talks about the program rten-brel. It is a program to index documents. Since it is not finished yet, this article will be updated once in a while.

Disclaimer

This is still a work in progress, so forgive all those errors here and there. ;P

Source

The source codes can be found in the git repository.

Rationale

About a year ago, I found that the built-in Emacs package org-agenda is unsatisfatory for my needs. I know there are various packages out there that do all sorts of things, but what I want is not limited to Org-mode, which those package assume to be the case. Specifically, I would like to use various formats of documents as I see fit: per chance a mixture of LaTeX-mode and Org-mode, or a simple format for recording my notes, without all the bells and whistles that Org-mode provides.

So I wrote a simple package in Emacs Lisp. But it turned out that the performance is terrible: even with only a couple of small to medium files, it took some seconds to do the job. This is quite unacceptable, since in my vision this might be applied to hundreds of files, and should still complete within a second or two. Hence I started trying to write this in C.

Features

With all that said, what is this program expected to do?

It is expected to do only one thing: indexing. Given a set of regular expressions, find the positions in a document that these regular expressions match.

Unicode support

The default encoding is assumed to be UTF-8. Other two encodings (UTF-16 and UTF-32) are also (supposed to be) supported.

Regular expression

This program has its own regular expression parser and matcher. And in my plan the syntax should be compatible with multiple formats: either the Emacs syntax, the Emacs rx syntax, or the grep syntax, should be supported.

However, this won't support back-references. That is, such regular expressions as "\\(cat\\|dog\\)\\1" are not going to be supported. This is mainly to avoid the potential exponential time problem such regular expressions bring. Also, I don't use this feature myself. :D

Still that is a potentially useful feature, so I plan to add some sort of Parsing Expression Grammar support in the future, which should be more than enough. But right now this remains a dream.

Query syntax

I tried with various options, and found that I like the "query algebra" approach (citation needed).

Rewrite

Actually, I finished the first version of this program about a month ago. But then I encountered some weird bugs that I could not debug. So I thought I should rewrite from scratch to work on the correct ground. Unfortunately I got occupied by other things in the meantime, so this is still unfinished.

More features

While rewriting, I think that I shall also add more features, or improve the performance somehow. So I plan to find a way to compactly run regular expression matchers.

NFA & DFA

Advantages

The original implementation was written using the virtual machine approach, as discussed in this series of articles (the third article discusses the virtual machine approach).

Now I think the deterministic finite automata approach should be the most efficient way to do it. This is partially inspired by finding an efficient DFA-based UTF-8 parser. See Bjoern Hoehrmann's article for details.

My belief is that a deterministic finite automaton can be simply written down as an array of numbers, and then the state transitions are done by simple table lookups. I think this is the ideal approach for my program.

Moreover, with this approach, I can merge the parsing of UTF-8 with the matching of regular expressions. And we can represent Unicode character sets / character classes as DFAs as well, whic further increases the flexibility of the design in my opinion.

Challenges

But I have to find a way to convert a regular expression into a DFA efficiently first. This must include the functionality of tracking submatches, since the essential features of the program rely on that.

Grammars and Parse Combinators

After some time, I now think that what I need is not a regular expression, but a formal grammar. To be precise, it would be better to be able to define the structures of documents by some grammar at least as powerful as context-free grammars.

Since I have some familiarity with the Parsec library in Haskell, the parser-combinators approach came to my mind first. But I think perhaps I need a weaker variant of it since I also want better performance than the exponential performance of parser-combinators.

After some search, I find that the project tree-sitter does a similar thing already, and seems to be quite developped and mature. Thus I face the question of whether to build on top of the tree-sitter project. However, the main purpose of this project is educational, so I think I will experiment with approaches that come to my mind, without worries about re-inventing wheels then.

More parser algorithms to consider

Now I learnt more algorithms for parsing. Some promising-looking ones are listed below.

Earley parser
Parser combinators
Pakrat algorithm thesis
Pratt algorithm link to the paper on github
Shift-Reduce
LR algorithms
CYK parser