Tuesday, October 27, 2009

Storing a tree structure in a database

NOTE: I fixed a bug in the node update trigger on 16 August 2011.

I think that one of the more tough problems in database design is how to store tree data (arbitrary depth parent-child relationships, where a child has at most one parent).

The two most common approaches are the Adjacency List model, and the Nested Set model. Both are explained and compared here. This forum post also has some good links to information on the two models.

In my opinion the major advantage and disadvantage of each are:

Adjacency Lists:
  • Major advantage:  Simple and easy to understand.
  • Major disadvantage:  It takes multiple queries to find all ancestors or all descendants of a node.
Nested Sets:
  • Major advantage:  A single query can find all ancestors or all descendants of a node.
  • Major disadvantage:  Modifying the tree structure affects half the nodes in the tree, on average.
What if you have a large tree (hundreds of millions of nodes), with fairly frequent changes to the tree structure, and you need to be able to run queries that access all ancestors or descendants of a node?  Since the nested set model would make it very difficult to make frequent modifications to the hierarchy, the other option is to do some denormalization of the adjacency list model so that we can query for ancestors and descendants of a node.

Let's say we have a simple node table:

CREATE TABLE node (
 node_id INTEGER PRIMARY KEY AUTO_INCREMENT,
 parent_id INTEGER,

 CONSTRAINT fk_node__parent FOREIGN KEY (parent_id) REFERENCES node (node_id) ON DELETE CASCADE
) ENGINE=InnoDB;


Each node has at most one parent.  Root nodes have a parent of null.  Now we create an ancestor list, which is our denormalization table:

CREATE TABLE node_ancestry_link (
 node_id INTEGER UNSIGNED NOT NULL,
 ancestor_id INTEGER UNSIGNED NOT NULL,

 PRIMARY KEY(node_id, ancestor_id),
 INDEX ix_node_anc__anc_node (ancestor_id, node_id),
 CONSTRAINT fk_node_anc__node FOREIGN KEY (node_id) REFERENCES node(node_id) ON DELETE CASCADE,
 CONSTRAINT fk_node_anc__anc FOREIGN KEY (ancestor_id) REFERENCES node(node_id) ON DELETE CASCADE
) ENGINE=InnoDB;


Each node has an entry in the ancestry table for each of its ancestors.  This table will grow much more quickly than the node table, especially for deep trees.  This denormalization allows us to write queries to get all ancestors of a node:

SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.node_id
WHERE n.node_id = :nodeid;


and all descendants of a node:

SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.ancestor_id
WHERE n.node_id = :nodeid;


Now, to help us keep the ancestry table in sync as changes are made in the node table, we define some triggers:

DELIMITER |

CREATE TRIGGER tr_node_ins AFTER INSERT ON node
FOR EACH ROW
BEGIN
  INSERT INTO node_ancestry_link (node_id, ancestor_id) VALUES (NEW.node_id, NEW.node_id);
  IF NEW.parent_id IS NOT NULL THEN
    INSERT INTO node_ancestry_link (node_id, ancestor_id) SELECT NEW.node_id, l.ancestor_id FROM node_ancestry_link l WHERE l.node_id = NEW.parent_id;
  END IF;
END
|

CREATE TRIGGER tr_node_upd AFTER UPDATE ON node
FOR EACH ROW
BEGIN
  IF NEW.parent_id <> OLD.parent_id OR ((NEW.parent_id IS NULL) <> (OLD.parent_id IS NULL)) THEN
    IF OLD.parent_id IS NOT NULL THEN
      DELETE FROM links USING node_ancestry_link links
        JOIN node_ancestry_link anclinks ON links.ancestor_id = anclinks.ancestor_id
        JOIN node_ancestry_link deslinks ON links.node_id = deslinks.node_id
        WHERE anclinks.node_id = OLD.parent_id
        AND deslinks.ancestor_id = NEW.node_id;
    END IF;
    IF NEW.parent_id IS NOT NULL THEN
      INSERT INTO node_ancestry_link (node_id, ancestor_id)
        SELECT desnodes.node_id, ancnodes.ancestor_id
        FROM node_ancestry_link ancnodes
        CROSS JOIN node_ancestry_link desnodes
        WHERE ancnodes.node_id = NEW.parent_id
        AND desnodes.ancestor_id = NEW.node_id;
    END IF;
  END IF;
END
|

DELIMITER ;


With the triggers in place, we can edit the node hierarchy without having to worry about the ancestry table.  We only need to worry about inserts and updates because of the CASCADE DELETEs on the ancestry table foreign keys.

If you need to know how many descendants a particular node has, you may want to track the descendant count in the node table, since the count queries will be expensive for large trees. 

You could augment the triggers to update the descendant counts when nodes are inserted, updated, or deleted.  This would require adding a delete trigger.  The one gotcha with doing this in MySQL is that you can't depend on cascade deletes when you delete nodes.  MySQL has a bug/feature that cascade deletes don't fire delete triggers.  For keeping descendant counts up-to-date this isn't a problem, as long as you take it into account, and when a node is deleted, subtract its descendant count from its ancestors.