rten-brel: An indexing program
2021-08-21
This article talks about the program rten-brel. It is a program to index documents. Since it is not finished yet, this article will be updated once in a while.
Disclaimer
This is still a work in progress, so forgive all those errors here and there. ;P
Source
The source codes can be found in the git repository.
Rationale
About a year ago, I found that the built-in Emacs package org-agenda
is unsatisfatory for my needs. I know there are various packages out
there that do all sorts of things, but what I want is not limited to
Org-mode, which those package assume to be the case. Specifically, I
would like to use various formats of documents as I see fit: per
chance a mixture of LaTeX-mode
and Org-mode
, or a simple format
for recording my notes, without all the bells and whistles that
Org-mode
provides.
So I wrote a simple package in Emacs Lisp. But it turned out that the
performance is terrible: even with only a couple of small to medium
files, it took some seconds to do the job. This is quite
unacceptable, since in my vision this might be applied to hundreds of
files, and should still complete within a second or two. Hence I
started trying to write this in C
.
Features
With all that said, what is this program expected to do?
It is expected to do only one thing: indexing. Given a set of regular expressions, find the positions in a document that these regular expressions match.
Unicode support
The default encoding is assumed to be UTF-8. Other two encodings (UTF-16 and UTF-32) are also (supposed to be) supported.
Regular expression
This program has its own regular expression parser and matcher. And
in my plan the syntax should be compatible with multiple formats:
either the Emacs syntax, the Emacs rx
syntax, or the grep syntax,
should be supported.
However, this won't support back-references. That is, such regular expressions as "\\(cat\\|dog\\)\\1" are not going to be supported. This is mainly to avoid the potential exponential time problem such regular expressions bring. Also, I don't use this feature myself. :D
Still that is a potentially useful feature, so I plan to add some sort of Parsing Expression Grammar support in the future, which should be more than enough. But right now this remains a dream.
Query syntax
I tried with various options, and found that I like the "query algebra" approach (citation needed).
Rewrite
Actually, I finished the first version of this program about a month ago. But then I encountered some weird bugs that I could not debug. So I thought I should rewrite from scratch to work on the correct ground. Unfortunately I got occupied by other things in the meantime, so this is still unfinished.
More features
While rewriting, I think that I shall also add more features, or improve the performance somehow. So I plan to find a way to compactly run regular expression matchers.
NFA & DFA
Advantages
The original implementation was written using the virtual machine approach, as discussed in this series of articles (the third article discusses the virtual machine approach).
Now I think the deterministic finite automata approach should be the most efficient way to do it. This is partially inspired by finding an efficient DFA-based UTF-8 parser. See Bjoern Hoehrmann's article for details.
My belief is that a deterministic finite automaton can be simply written down as an array of numbers, and then the state transitions are done by simple table lookups. I think this is the ideal approach for my program.
Moreover, with this approach, I can merge the parsing of UTF-8 with the matching of regular expressions. And we can represent Unicode character sets / character classes as DFAs as well, whic further increases the flexibility of the design in my opinion.
Challenges
But I have to find a way to convert a regular expression into a DFA efficiently first. This must include the functionality of tracking submatches, since the essential features of the program rely on that.
Grammars and Parse Combinators
After some time, I now think that what I need is not a regular expression, but a formal grammar. To be precise, it would be better to be able to define the structures of documents by some grammar at least as powerful as context-free grammars.
Since I have some familiarity with the Parsec library in Haskell, the parser-combinators approach came to my mind first. But I think perhaps I need a weaker variant of it since I also want better performance than the exponential performance of parser-combinators.
After some search, I find that the project tree-sitter does a similar thing already, and seems to be quite developped and mature. Thus I face the question of whether to build on top of the tree-sitter project. However, the main purpose of this project is educational, so I think I will experiment with approaches that come to my mind, without worries about re-inventing wheels then.
More parser algorithms to consider
Now I learnt more algorithms for parsing. Some promising-looking ones are listed below.
- Earley parser
- Parser combinators
- Pakrat algorithm thesis
- Pratt algorithm link to the paper on github
- Shift-Reduce
- LR algorithms
- CYK parser