I just added a Closure Table to Publications (posts and comments).
What is a Closure Table?
Closure Tables are one way of managing Trees in SQL. SQL doesn't represent trees well. The ideal 'theory' way is using recursive queries with an adjacency list. We're using PostgreSQL, and it supports recursive queries, but in our testing they were unusably slow. As in 10–30 seconds to query all the comments on a post.
There are many alternative, denormalised ways to represent trees. Closure Tables are one of them. Essentially, a Closure Table has a row for every ancestor and descendant of any entry. For example, if we have a comment chain 1 → 2 → 3, the Closure table has ancestor–descendant entries (1 2), (2 3), (1 3). You can see how this is duplicate data: why do we need (1 3) if we already have the data that 2 is 3's parent, and 1 is 2's parent?
Why Closure Tables?
Because they're fast. Querying the Closure Table for all the comments on a 100-comment post takes about 2.5 milliseconds.
But you keep ranting about duplicate data in the old Arc code being bane of your existence?
There are exceptions to every rule. It turns out, duplicate data, nightmarish as it is, is only a problem if your data is mutable. It will never make sense to reassign a comment to a different parent.
Right now, the Closure Table has been populated with all existing publications, and is being updated with each new comment and post, but it isn't being used yet. But it will be. This is a big step toward making it fast to directly query the SQL, and getting data out of memory. Which is one of our biggest goals, to both reduce hosting costs and make the site more stable.
PS If you were following the livecoding Saturday, this is what I was working on.
I was gutting a cottage in the UP of MI this weekend. This was a nice surprise to find upon my return. I wish I could have seen the livecoding. Watching mk live code would probably freak everyone out. I suppose I can scrap the "change the parent of your comment" feature that I was planning.
I've always cheated and designed tables that hold tree like data to include a level number and the parent in the table itself rather than a separate table. You can cheat and take a sneaky look at the data without joins! Take that Codd! It adds some space overhead but makes querying so much easier. And gives easy indexing options to increase performance.
That's another way to do trees, yeah. If you mean the direct parent, that's an Adjacency List, and it requires a bunch of queries (or a slow recursive query). Hubski currently does that. If you mean the root post of all comments, yeah, that's another way, but it's less powerful. E.g. we have a way to display only one comment and it's children. Storing only the root would require querying all comments on a post, even when we only need a subtree.