Rope Science – Computer science concepts behind the Xi editor
Interesting! I find it amazing that they went all the way into implementing a CDRT to support efficient plugins.
At the moment I am also writing a text-editor with my partner, to show-case the C++ RRB-Trees implementations in Immer [1]. We are just started, but my plan is to stop at 1000k lines. Interestingly, with such persistent data structure, you go already a long way in implementing undo, parallel processing, and many editing algorithms are super simple thanks to slicing/concat. Memory consumption is still sub-optimal (also I am using wchar_t to simplify my life), but I am very satisfied with the results so far (it is the fastest editor I have on my machine when editing 1GB file, also can edit a ~100MB file on a Raspberry --- larger files fail only due to excessive memory use :/).
I love to geek out on this sort of stuff. Data Structures for Text Sequences by Charles Crowley [1] is a great read. I would also recommend checking out the Vis editor [2]. It's an interesting Vi like editor that uses the piece chain as its data structure and supports Sam's structural regular expressions.
For those interested in reading more about this idea, the term is "monoid cached trees". The Fenwick tree is a particular well-known special case.
(I looked into monoid cached trees for float placement in CSS at one point in an effort to come up with an O(n log n) algorithm. I succeeded, but the constant factor was so high that it wasn't worth the price. I ended up switching to an algorithm based on splay trees that, while O(n^2) in the worst case, ended up being O(n) with a small constant factor on real-world pages.)
Reading the fourth entry on parenthesis matching made me wonder whether one could store, in the monoid, partials views into the table that is generated by CYK parsing: https://en.wikipedia.org/wiki/CYK_algorithm
I love the idea of using monoids like they're described in the blog series, but the examples suggest that there's a certain amount of non-generalizable cleverness that goes into defining each monoid. Could you do CYK subtables inside the monoid, so that people can define arbitrary CF grammars, as long as they're in Chomsky normal form?
I first discovered ropes - aka cords - from Boehm's library, alongside the conservative collector.
IIRC, Internet Explorer used a binary tree to represent its strings, at least at version 5 in the early 2000s, because of the inefficiency of doing lots of copying for string operations - looped concatenation was one of the primary drivers. That doesn't mean it went all the way to ropes, of course.
Please read this with a grain of salt as it does not seem practical or necessary. It seems like the kind of thing written by a young person who is excited but doesn't really have much experience. Most of the ideas would not be real-world-useful as stated.
Excitement is nice to feel, but it takes some experience to know when excitement is really aimed in a productive direction. Otherwise we end up with the kind of motivation that so often produces over-complex and mis-aimed software: having a "cool idea" for "exciting technology" and then looking for places to apply it, and the applications don't really fit or don't really work, but we don't want to notice that, so we don't.
To pull examples: an entire one of these essays is on "paren matching" and how it would be really great if you monoidized (ugh) and parallelized that ... the basic idea of which is instantly shot down by the fact that language grammars are just more complicated than counting individual characters. Hey bro, what if there is a big comment in the middle of your file that has some parens in it? The author didn't even think of this, and relegates this to a comment at the end of that particular essay: "Jonathan Tomer pointed out that real parsing is much more interesting than just paren matching." Which is a short way of saying "this entire essay is not going to work so you probably shouldn't read it, but I won't tell you that until the bottom of the page, and even then I will only slyly allude to that fact." Which in itself is contemptuous of the reader -- it is the kind of thing that happens when you are excited enough about your ideas that the question of whether they are correct is eclipsed. This leads to bad work.
There's the essay about the scrollbar -- if you have a 100k-line text file, do you really want a really long line somewhere in the middle to cause the scrollbar to be narrow and tweakyin the shorter, well-behaved majority of the file? No, you probably don't! But this shoots down the idea that you might want to do a big parallel thing to figure out line length, so he declines to think about it. In reality what you probably want is the scrollbar to be sized based on a smooth sliding window that is slightly bigger than what appears on the screen (but not too much).
Besides which, computers are SO FAST that if you just program them in a straightforward way, and don't do any of the modern software engineering stuff that makes programs slow, then your editor is going to react instantly for all reasonable editing tasks.
I don't want to be too overly critical and negative -- these sorts of thoughts are fine if they are your private notes and are thinking about technical problems and asking friends for feedback. It becomes different when you post them to Hacker News and/or the rest of the internet, because this contains an implicit claim that these are worth many readers' time. But in order to be worth many readers' time, much more thought would have had to go in ... and as a result, the ideas would have changed substantially from what they are now.
I didn't read past essay 4, so if it gets more applicable to reality after that I don't know!
Is there any comprehensive documentation on the rope datastructure?
This write up is really good! Are there similar write-ups for things other than text editors?
I wonder if only wanted a printable character ASCII editor would simplify things a lot or only a little. And I guess no tabs.
> Part 2 Line breaking
I don't really understand the problem here. Can't we count the line breaks like anything else? Is it because that's not the values we want in the end?
> Part 4: Again, making this into a monoid is pretty easy. You store two copies of the (t, m) pair - one for the simple case, and one for the case where the beginning of the string is in a comment. You also keep two bits to keep track of whether the string ends or begins a comment. In principle, you have to do the computation twice for both cases, whether the first line is a comment or not, but in practice it doesn’t make the computation any more expensive: you compute (t, m) for the first line and for the rest of the string, and just store both the first value and the monoid sum.
What if a node of the rope contains an "end comment" and (later) a "("? What should the two pairs of (t, m) be? Now that substring might be entirely inside, outside or partially outside and inside a comment.
Although I do understand the general idea of computing the result for all possible initial/input state to achieve paralellism.
For those interested in following the project or participating in discussions, there's a subreddit over at https://www.reddit.com/r/xi_editor/
The text editor that I use for everything is one that I wrote myself a decade ago in 5K lines of C, based on gap buffers. Save the "advanced computer science" for the problems that need it.
Would it be possible to add in-memory LZ4 compression in xi-editor? For those huge log, XML, csv, etc. files?
Maybe it'd still be possible to maintain good response time while enjoying 4-10x memory savings.
I opened two of the rope documents and I don't even get the problems they try to solve. How can I decide whether these problems are mine as well?
Sure my text editors aren't perfect, but they mostly get the job done, so any editor coming along needs to show that it tries to solve a problem that the user has. I'm not yet convinced this one does, so I probably will never find out what makes it brilliant.
undefined
In case you're wondering what's going on, this is the list of posts: https://github.com/google/xi-editor/tree/master/doc/rope_sci...