Friday, February 12, 2010

The Truth About Joins

Whenever I get the opportunity to talk to my peers about the coolness of App Engine it always pains me to have to say "...but you can't do joins."  I know how useful joins are because I spent many many years using them before I started working on App Engine.  Lately I've been thinking that there has to be a way to make it work.  The App Engine datastore offers a guarantee that query performance scales with the size of the result set, not the size of the data set, so a query that returns 100 results should take the same amount of time whether you have a thousand rows or a million.  That means we can't compute a full cross-product at time-of-query because that would require looking at all the data in the n tables we're joining, and if we look at all the data our performance is now tied to the size of the data set.  Still, it feels like we ought to be able to do something here.

What if we set our sights a little bit lower?  Sure, n-way joins with arbitrary filters and sort orders would be ideal, but isn't something better than nothing in this area?  It turns out that with some additional restrictions in place it is possible to execute joins where the performance scales with the size of the result set, and as of the recently released SDK 1.3.1 this feature, which I'm calling "simple joins," is available to beta testers.

Now, before you start licking your chops let me give you all the bad news (there's plenty of it).  First, simple joins are only available to JPA and JDO users.  Why?  Because JPA and JDO already have an established API for joins and API design is really really hard.  I try to avoid it whenever possible.  Second, simple joins only support equality filters.  Ouch, right?  Third, simple joins can only be sorted by the property on which you're joining.  Double ouch.  Fourth, if you're using owned relationships, in order to use simple joins you're going to need to upgrade your storage version and migrate existing data.   This post is already long enough so I'm just going to focus on simple joins across unowned relationships. I'll cover storage versions, data migration, and joining across owned relationships in my next post, I promise.

I know that was a lot of bad news, but my theory is that even with all that bad news, simple joins are sufficiently useful.  I'm making them available to try and prove that theory. Ready to try it out?

Let's model a college.  We have Students, Courses, and Dorms.  Every Student is associated with zero or one Dorm and zero or more Courses.

JPA:
@Entity
public class Student {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String year;

  @Basic
  private List<Key> courses = new ArrayList<Key>();

  @OneToMany
  @Column(name = "courses", insertable = false, updatable = false)
  private List<Course> coursesAlias;

  @Basic
  private Key dorm;

  @OneToOne(fetch = FetchType.LAZY)
  @Column(name = "dorm", insertable = false, updatable = false)
  private Dorm dormAlias;
}

@Entity
public class Course {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String department;
}

@Entity
public class Dorm {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String campus;
}

Yes, there's something funny (and by funny I mean hacky) going on with the courseAlias and dormAlias fields.  The JPQL compiler won't let you join on a Key or a List<Key> field because these aren't OneToOne or OneToMany fields.  To work around this I've created a read-only alias of the appropriate type.  Your application code should always read from and write to the 'courses' and 'dorm' fields while your join queries should read from and write to the 'coursesAlias' and 'dormAlias' fields.  Once we have proper support for unowned relationships all this nonsense will go away.

WARNING: There's a bug that causes aliased fields to be written to the datastore with incorrect names. Make sure the names of your aliased fields don't end in "Key" or else your join won't return any results.  In the above example you could trigger this bug by renaming Student.dorm to Student.dormKey.

JDO:
@PersistenceCapable(detachable = "true")
public class Student {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Long id;

  private String year;

  @Element(dependent = "true")
  private List<Key> courses = new ArrayList<Key>();

  @Persistent
  private Dorm dorm;
}

@PersistenceCapable(detachable = "true")
public class Course {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String department;
}

@PersistenceCapable(detachable = "true")
public class Dorm {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String campus;
}
---------------------------------------
Here's a query that returns all Juniors enrolled in a Biology course:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.coursesAlias c where "
  + "c.department = 'Biology' and "
  + "s.year = 'Junior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() 
  + " where "
  + "courses.contains(c) && c.department == 'Biology' && "
  + "year == 'Junior'");
q.declareVariables(Course.class.getName() + " c");

It's definitely strange to be referring to non-existent properties of a Key field and a List<Key> field, but since proper unowned relationships are not yet supported I eased up some of the JDOQL compiler's type checking to let this go through (sorry Andy, I know you hate this).  Non-standard!
---------------------------------------

And here's a query that returns all Juniors who live in a dorm on the North campus of the school:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.dormAlias d where "
  + "d.campus = 'North' and "
  + "s.year = 'Junior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() + " where "
  + "dorm == d && d.campus == 'North' && "
  + "year == 'Junior'");
q.declareVariables(Dorm.class.getName() + " d");
Again, I'm taking some liberties with standard JDOQL here.
---------------------------------------
These examples should be enough to get you started.  Currently, only 2-way joins are supported but there's no reason it can't be extended to n-way.  As for outer joins, arbitrary sorts, and inequality filters, my colleagues and I are still thinking hard about it.

Stay tuned for more posts in which I'll explain how to make this work for owned relationships and the underlying algorithm involved.  Give simple joins a try and let me know how it goes!