Wednesday, November 20, 2013

Integrating SQLAlchemy into Django

In my previous post I describe some of the limitations that I have hit while using Django's ORM for our RESTful API here at HireVue.  In this post I will describe how I integrated SQLAlchemy into our Django app for read-only (GET) requests, to handle the queries that Django doesn't allow.

Let's say we are running with a single database instance, so our django settings look something like

DATABASES {
    'default' : {
        'ENGINE' : 'django.db.backends.postgresql_psycopg2',
        'NAME' : 'mydatabase',
        'USER' : 'mydatabaseuser',
        'PASSWORD' : 'mypassword',
        'HOST' : '127.0.0.1',
        'PORT' : '5432',
    }
}

We will continue to let Django handle all of the connection management, so when SQLAlchemy needs a database connection, we want to just use the current connection.  The method to get this connection looks like

# custom connection factory, so we can share with django
def get_conn():
    from django.db import connections
    conn = connections['default']
    return conn.connection

Now we want SQLAlchemy to call get_conn whenever it needs a new connection.  In addition, we need to keep SQLAlchemy from trying to pool the connection, clean up after itself, etc.  We essentially need it to do absolutely no connection handling.  To accomplish this, we create the SQLAlchemy engine with a custom connection pool that looks like

# custom connection pool that doesn't close connections, and uses our
# custom connection factory
class SharingPool(NullPool):
    def __init__(self, *args, **kwargs):
        NullPool.__init__(self, get_conn, reset_on_return=False,
                          *args, **kwargs)

    def status(self):
        return 'Sharing Pool'

    def _do_return_conn(self, conn):
        pass

    def _do_get(self):
        return self._create_connection()

    def _close_connection(self, connection):
        pass

    def recreate(self):
        return self.__class__(self._creator,
                              recycle=self._recycle,
                              echo=self.echo,
                              logging_name=self._orig_logging_name,
                              use_threadlocal=self._use_threadlocal,
                              reset_on_return=False,
                              _dispatch=self.dispatch,
                              _dialect=self._dialect)

    def dispose(self):
        pass

The magic is where we pass get_conn to the NullPool constructor, and then we override most of the other methods to do nothing.  This allows SQLAlchemy to borrow the connection that Django is managing, without interfering.

Now we can create an engine with this pool like so:

# create engine using our custom pool and connection creation logic
engine = create_engine(db_url, poolclass=SharingPool)

Since we want to keep DRY, we use Django's settings to create the db_url that is passed to the engine constructor:

db_url = 'postgresql+psycopg2://{0}:{1}@{2}:{3}/{4}'.format(
            settings.DATABASES['default']['USER'],
            settings.DATABASES['default']['PASSWORD'],
            settings.DATABASES['default']['HOST'],
            settings.DATABASES['default']['PORT'],
            settings.DATABASES['default']['NAME'])

We also don't want to maintain table definitions separate from our Django models, so we use SQLAlchemy reflection to build the metadata:

# inspect the db metadata to build tables
meta = MetaData(bind=engine)
meta.reflect()

This is basically it!  Now you can get the tables you would like to query from the meta object, and create and run selects.  They will run on the connection provided by Django.  Don't try to do any transaction handling or it may mess up Django.  

Also, you'll want to make sure you reflect() on startup, so that tables are ready to go when you need them.

Monday, October 28, 2013

Limitations of Django's ORM

I'm currently working at HireVue, where we have a public RESTful API that we consume for our website. I have done quite a bit of work on the API, and in the process, have run into some of the limitations of the Django ORM.

Difficulty controlling joins

Let's say I have an Interview table for storing information about candidate interviews, an Evaluation table for storing interviewer evaluations, and a User table for information on the interviewers/evaluators.  The Evaluation table is a many-to-many connecting Users and Interviews, with evaluation information stored in the table as well.


Now I want to write a query that returns interviews that a specific user has evaluated highly.  I try something like this:

Interview.objects.filter(evaluation__rating__gt=5, evaluation__user__username='specificuser')

I cross my fingers and hope that Django only joins to the evaluation table once, and uses that same join for both filter conditions.  It seems to work in the cases I have tried, but there is no way for me to tell Django for sure that is what I want.

On the other hand, let's say I want to write a query that returns interviews that a specific user has evaluated low, but others have evaluated high.  How do I tell Django I need two different joins?

Interview.objects.filter(evaluation__rating__lt=3, evaluation__user__username='specificuser') \
        .filter(evaluation__rating__gt=5).distinct()

Will this work?  I have had mixed luck, and find myself banging my head against the Django ORM black box, wishing I could tell it more explicitly what I want.  The Django docs are very reassuring on this matter:
 Conditions in subsequent filter() or exclude() calls that refer to the same relation may end up filtering on different linked objects.
I'm also limited if I am trying to do filtering in different parts of my code, on the same relationship.  Let's say I have some 'permissions' functions that apply a filter so that the request can only see interviews that the current user has evaluated, and then other functions to filter based on search criteria, like whether the evaluation was high or low:

query = Interview.objects.all() 
query = apply_permissions(query) # filter(evaluation__user__username='specificuser')
query = apply_search(query) # filter(evaluation__rating__gt=5)

I don't know of any way to tell Django that both of the filters, applied by different methods, should both use the same joined table.  This makes it difficult to break code out into logical components like this.

Inability to force outer joins

This is the bigger problem, in my opinion:  I am unaware of any way to tell Django to use an outer join.  Others seem to agree that this is not possible.

What if I would like to return all interviews, and if the current user has evaluated the interview, I'd like to include the rating.  This is easily done using a LEFT OUTER JOIN in sql.  Django, however, does not allow a query like this, leaving us with extra queries and joins in code.  That may not seem like too big of a pain in most cases, but if you are querying against a large data set, and want to sort by the ratings in that left outer joined table, things get even more difficult.

In my experience, these are my biggest complaints with Django's ORM.  It's frustrating to be held back by limitations in the ORM, when I know that the database can handle the problem easily.  I find myself jumping through hoops, or coding around these limitations way too often, and thought I'd share, for others considering using Django.

I'll follow up this post with another describing how I recently bolted SQLAlchemy onto our existing Django app, to handle the 'tough' queries.

Wednesday, March 9, 2011

Spring Nested Transactions and problems with "Session is closed" Exceptions

I was really pulling my hair out over some code that did programmatic transaction handling using Spring's PlatformTransactionManager on top of Hibernate. It is fairly complicated, with up to 4 different transactions running concurrently. At one point it needs to process 3 different result sets, like this:

TransactionStatus status1 = transactionManager.getTransaction(new DefaultTransactionDefinition(TransactionDefinition.PROPAGATION_REQUIRES_NEW));
PreparedStatement statement1 = sessionFactory.getCurrentSession().connection().prepareStatement(...);

TransactionStatus status2 = transactionManager.getTransaction(new DefaultTransactionDefinition(TransactionDefinition.PROPAGATION_REQUIRES_NEW));
PreparedStatement statement2 = sessionFactory.getCurrentSession().connection().prepareStatement(...);


TransactionStatus status3 = transactionManager.getTransaction(new DefaultTransactionDefinition(TransactionDefinition.PROPAGATION_REQUIRES_NEW));
PreparedStatement statement3 = sessionFactory.getCurrentSession().connection().prepareStatement(...);

... // set parameters on the statements

ResultSet rs1 = statement1.executeQuery();
ResultSet rs2 = statement1.executeQuery();
ResultSet rs3 = statement1.executeQuery();

... // process the result sets

... // close the result sets and statements

transactionManager.commit(status1);
transactionManager.commit(status2);
transactionManager.commit(status3);


This part of the code ran fine the first time, but the next time, it would throw an exception saying "Session is closed", when trying to get the first transaction. Do you see the problem? I didn't see it for way too long. I spent time removing sections of code, staring at logs, and pulling lots of hair out, before I finally noticed that in the spring logging, after finishing that section of code, it was falling back to trying to use the wrong session.

Finally it dawned on me that Spring must be using a stack to keep track of the current session. When a transaction is started with PROPAGATION_REQUIRES_NEW, the current session is pushed on the stack, and a new session is created and becomes the current session. When that transaction finishes, the current session is closed, and the previous session is popped off of the stack to resume as the now current session.

Looking at the documentation about transaction propagation, it does talk about "inner" and "outer" transactions, but I don't think it's quite explicit enough at explaining the nesting relationship. And my problem is that I wasn't really considering them to be nested, thinking of them more as simply independent transactions (in my defense, the docs do say that they are "completely independent" transactions). That is why it took me so long to realize my mistake.

My problem was that I was creating transaction 1, then transaction 2, then 3. But I was trying to commit transaction 1 first, then 2, then 3. Doing this messed up the stack, and in the end Spring was left with a session that had been closed as the current session. So the next attempt to use the session would cause an exception to be thrown telling me that the "Session is closed".

Rearranging the code to commit transaction 3 first, then 2, then 1, fixed the problem. I now have a better sense of how Spring works, working code, and less hair. I thought I'd write this up in case it helps someone else avoid this simple mistake.

Wednesday, September 15, 2010

CruiseControl to Hudson

At work we recently switched from using CruiseControl to Hudson for continuous builds.

Originally I was just trying to upgrade CruiseControl so that we could add in a plugin to support Mercurial (we also recently switched from Subversion to Mercurial for version control). We were on a fairly old version of CruiseControl. After upgrading, we were having lots of headaches with the web interface freezing up, and I had to write a Mercurial label incrementer plugin, and I didn't really like the new web interface anyway. I finally got frustrated enough to try something new.

I threw Hudson on the machine, and had builds up and running so much more quickly than in CruiseControl that I was sold almost immediately. Some of the differences that I really liked were:

* No editing XML. All configuration in Hudson can be done through the web interface. It's also easy to create build configs for new clones/branches. When you create a new Job, you can create it based on an existing Job, and then just change a few names and paths.
* Plugins are listed and installed through the web interface. This was probably the single best part of Hudson compared to CruiseControl. With CruiseControl, I was left searching for plugins to see what existed, and going through the hassle of researching and installing plugins to try to get things working. Hudson lets you see all the available plugins in one place. That was really handy.
* Hudson automatically detects if test cases fail, and will mark the build as "unstable". It's also really good at showing you the test case failures within the Hudson web interface.
* More features. I could probably get all of these things in CruiseControl, but it's so hard to find plugins, that I didn't try very hard. In Hudson it was easy to get it to host/expose my javadocs. It's easy to link to the most recent artifacts with static urls. It's also possible to have "slave" build machines so that long builds don't hold everything up.
* The email notifications are better. I only get emailed the first time a build breaks, and not for every subsequent failure. It also emails me when the build is fixed.

Now I'm guessing that I could probably get CruiseControl to do most or maybe all of those things, but when Hudson makes it so easy, why try to figure CruiseControl out?

The only trick I had to work out was getting Hudson to show the build revision. We use build revisions in our bug tracking system, so that developers can let testers know what revision a bug fix was made in. The solution I found was to use the "Hudson Description Setter Plugin". I installed the plugin in Hudson, then modified my build script to output the Mercurial revision. I created an ant task that I run as a dependency in my build task:


<target name="revision" description="Stores the latest revision number in revision property">
   <echo message="os: ${os.name}"/>
   <condition property="unix">
      <os family="unix"/>
   </condition>
   <echo message="is unix: ${unix}"/>
   <if>
      <equals arg1="${unix}" arg2="true"/>
      <then>
         <exec executable="sh" outputproperty="revision" errorproperty="revision-error" dir="${basedir}">
            <arg value="-c"/>
            <arg line="hg identify -n | tr -d '+'"/>
         </exec>
      </then>
      <else>
         <property name="revision" value="unknown"/>

      </else>
   </if>
   <echo message="rev: ${revision}"/>
</target>



Then I activated the plugin in my build configuration in Hudson, under "Post-build Actions" (the checkbox named "Set build description"), and set the regular expression to "rev: (.*)" and the Description to "[version] \1". All subsequent builds will have a description with the Mercurial revision in it (e.g. "[version] 12345").

Tuesday, October 27, 2009

Storing a tree structure in a database

NOTE: I fixed a bug in the node update trigger on 16 August 2011.

I think that one of the more tough problems in database design is how to store tree data (arbitrary depth parent-child relationships, where a child has at most one parent).

The two most common approaches are the Adjacency List model, and the Nested Set model. Both are explained and compared here. This forum post also has some good links to information on the two models.

In my opinion the major advantage and disadvantage of each are:

Adjacency Lists:
  • Major advantage:  Simple and easy to understand.
  • Major disadvantage:  It takes multiple queries to find all ancestors or all descendants of a node.
Nested Sets:
  • Major advantage:  A single query can find all ancestors or all descendants of a node.
  • Major disadvantage:  Modifying the tree structure affects half the nodes in the tree, on average.
What if you have a large tree (hundreds of millions of nodes), with fairly frequent changes to the tree structure, and you need to be able to run queries that access all ancestors or descendants of a node?  Since the nested set model would make it very difficult to make frequent modifications to the hierarchy, the other option is to do some denormalization of the adjacency list model so that we can query for ancestors and descendants of a node.

Let's say we have a simple node table:

CREATE TABLE node (
 node_id INTEGER PRIMARY KEY AUTO_INCREMENT,
 parent_id INTEGER,

 CONSTRAINT fk_node__parent FOREIGN KEY (parent_id) REFERENCES node (node_id) ON DELETE CASCADE
) ENGINE=InnoDB;


Each node has at most one parent.  Root nodes have a parent of null.  Now we create an ancestor list, which is our denormalization table:

CREATE TABLE node_ancestry_link (
 node_id INTEGER UNSIGNED NOT NULL,
 ancestor_id INTEGER UNSIGNED NOT NULL,

 PRIMARY KEY(node_id, ancestor_id),
 INDEX ix_node_anc__anc_node (ancestor_id, node_id),
 CONSTRAINT fk_node_anc__node FOREIGN KEY (node_id) REFERENCES node(node_id) ON DELETE CASCADE,
 CONSTRAINT fk_node_anc__anc FOREIGN KEY (ancestor_id) REFERENCES node(node_id) ON DELETE CASCADE
) ENGINE=InnoDB;


Each node has an entry in the ancestry table for each of its ancestors.  This table will grow much more quickly than the node table, especially for deep trees.  This denormalization allows us to write queries to get all ancestors of a node:

SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.node_id
WHERE n.node_id = :nodeid;


and all descendants of a node:

SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.ancestor_id
WHERE n.node_id = :nodeid;


Now, to help us keep the ancestry table in sync as changes are made in the node table, we define some triggers:

DELIMITER |

CREATE TRIGGER tr_node_ins AFTER INSERT ON node
FOR EACH ROW
BEGIN
  INSERT INTO node_ancestry_link (node_id, ancestor_id) VALUES (NEW.node_id, NEW.node_id);
  IF NEW.parent_id IS NOT NULL THEN
    INSERT INTO node_ancestry_link (node_id, ancestor_id) SELECT NEW.node_id, l.ancestor_id FROM node_ancestry_link l WHERE l.node_id = NEW.parent_id;
  END IF;
END
|

CREATE TRIGGER tr_node_upd AFTER UPDATE ON node
FOR EACH ROW
BEGIN
  IF NEW.parent_id <> OLD.parent_id OR ((NEW.parent_id IS NULL) <> (OLD.parent_id IS NULL)) THEN
    IF OLD.parent_id IS NOT NULL THEN
      DELETE FROM links USING node_ancestry_link links
        JOIN node_ancestry_link anclinks ON links.ancestor_id = anclinks.ancestor_id
        JOIN node_ancestry_link deslinks ON links.node_id = deslinks.node_id
        WHERE anclinks.node_id = OLD.parent_id
        AND deslinks.ancestor_id = NEW.node_id;
    END IF;
    IF NEW.parent_id IS NOT NULL THEN
      INSERT INTO node_ancestry_link (node_id, ancestor_id)
        SELECT desnodes.node_id, ancnodes.ancestor_id
        FROM node_ancestry_link ancnodes
        CROSS JOIN node_ancestry_link desnodes
        WHERE ancnodes.node_id = NEW.parent_id
        AND desnodes.ancestor_id = NEW.node_id;
    END IF;
  END IF;
END
|

DELIMITER ;


With the triggers in place, we can edit the node hierarchy without having to worry about the ancestry table.  We only need to worry about inserts and updates because of the CASCADE DELETEs on the ancestry table foreign keys.

If you need to know how many descendants a particular node has, you may want to track the descendant count in the node table, since the count queries will be expensive for large trees. 

You could augment the triggers to update the descendant counts when nodes are inserted, updated, or deleted.  This would require adding a delete trigger.  The one gotcha with doing this in MySQL is that you can't depend on cascade deletes when you delete nodes.  MySQL has a bug/feature that cascade deletes don't fire delete triggers.  For keeping descendant counts up-to-date this isn't a problem, as long as you take it into account, and when a node is deleted, subtract its descendant count from its ancestors.

Tuesday, September 1, 2009

re-dispatching events in Flash

I recently ran into a head-scratcher, and couldn't find any help online, so I thought I'd post my solution.

I had a custom flex component, ZoomControl:

<mx:VBox>
  <mx:Metadata>
    [Event(name="zoomChanged", type="ZoomEvent")]
  </mx:Metadata>
...
</mx:VBox>>

ZoomControl can throw a "zoomChanged" event, of a custom class ZoomEvent. I then created a Toolbar custom component that contained the ZoomControl:

<mx:VBox>
  <mx:Metadata>
    [Event(name="zoomChanged", type="ZoomEvent")]
  </mx:Metadata>
...
  <ZoomControl
    zoomChanged="dispatchEvent(event)"
  />
</mx:VBox>>

When the ZoomControl dispatches a ZoomEvent, I want my Toolbar to re-dispatch the event. Seems simple, right? But when the Toolbar calls dispatchEvent, it throws an exception:

TypeError: Error #1034: Type Coercion failed: cannot convert flash.events::Event@19ea2df1 to footnote.imageviewer.events.ZooomEvent.
at flash.events::EventDispatcher/dispatchEventFunction()
at flash.events::EventDispatcher/dispatchEvent()
at mx.core::UIComponent/dispatchEvent()[E:\dev\3.0.x\frameworks\projects\framework\src\mx\core\UIComponent.as:9051]
at Toolbar/__zoomControl_zoomChanged()...

After some wasted time trying to make sure that it really was a ZoomEvent being passed to dispatchEvent, I finally took time to read the documentation on UIComponent.dispatchEvent:

* If the event is being redispatched, a clone of the event is created automatically.
* After an event is dispatched, its target property cannot be changed,
* so you must create a new copy of the event for redispatching to work.

My problem was that I needed to override the clone method in my ZoomEvent. The dispatchEvent method was calling the base Event.clone(), which was returning an Event, of course. Overriding the clone method to return a ZoomEvent solved the problem.

Friday, June 12, 2009

SuperDuper "Smart Update" doesn't stay smart

I use SuperDuper for backups on my work machine (MacBook Pro), and I have it set up to backup daily to an external drive using "Smart Update", which is supposed to be fast and only copy things that have changed. I really have no idea how it works, nor do I care, as long as it's working.

The problem is that it has started to take longer and longer to run. It had reached the point where it was taking 2 hours or more to finish (my drive is 120G and I don't back all of it up). I found very little help by searching on Google, so I asked the Sysadmin, who also uses SuperDuper, if he had seen the same problem. He said "I don't know, I only do full backups once a week".

That made me think that maybe SuperDuper just doesn't handle backing up repeatedly using Smart Update only. So I revised my schedule to do a full backup once a week, and Smart Updates daily. Success! My daily backups are back down to 15 minutes or less! I thought I'd post this in case it helps someone else.