Sunday, March 21, 2010

Executing Simple Joins Across Owned Relationships

As promised, today I'm going to explain how to use simple joins with owned relationships. This requires some additional setup because the physical layout of an owned relationship is not the same as the physical layout of an unowned relationship. Specifically, with an unowned relationship, the keys of the child entities are stored in a property on the parent, whereas with an owned relationship the keys of the child entities are not stored in a property on the parent - ownership is instead modeled using the hierarchical structure of the child key.  Simple joins depend on the physical layout we get with unowned relationships, so if we're going to use simple joins with owned relationships we're going to need to change the way we store them.

Fortunately, as of SDK 1.3.1, there is a configuration option that changes the physical layout of owned relationships to match the physical layout of unowned relationships while preserving the hierarchical structure of the child key.  Nice!  You can enable this option by adding the following to your config:

<property name="datanucleus.appengine.storageVersion" value="WRITE_OWNED_CHILD_KEYS_TO_PARENTS"/>

If you're using JPA, add this to persistence.xml. If you're using JDO, add this to jdoconfig.xml. If you don't have any existing data in production you're ready to move on to the examples. If you do have existing data there's more work to do: data migration. If you're like me then you get sad, then angry, then hungry, then sad again whenever you see those words so I apologize for just throwing them in your face like that, but if you have existing data there's really no way around it.  See, setting this config option will cause all newly created and updated parents to be written with a property containing child keys, but all your existing data is still sitting there in the old layout. If you want the results of your simple joins to be accurate (and you do want this, right?) you're going to need to migrate all your existing data. I'll explain how to do this in a later post (note that I said 'a later post' not 'the next post' - I may not get to this for awhile).

Let's go back to our college example.  We have Students, Credits, and Honors Thesis.  Every Student has zero or more Credits and zero or one Honors Thesis.

JPA:
@Entity
public class Student {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String year;

  @OneToMany(cascade = CascadeType.ALL)
  private List<Credit> credits;

  @OneToOne(fetch = FetchType.LAZY, cascade = CascadeType.ALL)
  private HonorsThesis thesis;
}

@Entity
public class Credit {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String department;
}

@Entity
public class HonorsThesis {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private boolean complete;
}


JDO:
@PersistenceCapable(detachable = "true")
public class Student {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Long id;

  private String year;

  @Element(dependent = "true")
  private List<Credit> credits;

  @Persistent
  private HonorsThesis thesis;
}

@PersistenceCapable(detachable = "true")
public class Credit {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String department;
}

@PersistenceCapable(detachable = "true")
public class HonorsThesis {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private boolean complete;
}
---------------------------------------
Here's a query that returns all Seniors with at least one credit from the Philosophy department:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.credits c where "
  + "c.department = 'Philosophy' and "
  + "s.year = 'Senior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() 
  + " where "
  + "credits.contains(c) && c.department == 'Philosophy' && "
  + "year == 'Senior'");
q.declareVariables(Credit.class.getName() + " c");

---------------------------------------

And here's a query that returns all Seniors who have completed an Honors Thesis:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.thesis t where "
  + "t.complete = true and "
  + "s.year = 'Senior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() + " where "
  + "thesis == t && t.complete == true && "
  + "year == 'Senior'");
q.declareVariables(HonorsThesis.class.getName() + " t");


---------------------------------------
Unlike the unowned relationship examples we covered in the last post, we're now firmly back in the land of standard JPA and JDO.  Good!  That's where we want to be.  Give it a try and let me know what you think!  In the next post, hopefully less than 6 weeks away this time, I'll explain how simple joins work under the hood.

Friday, February 12, 2010

The Truth About Joins

Whenever I get the opportunity to talk to my peers about the coolness of App Engine it always pains me to have to say "...but you can't do joins."  I know how useful joins are because I spent many many years using them before I started working on App Engine.  Lately I've been thinking that there has to be a way to make it work.  The App Engine datastore offers a guarantee that query performance scales with the size of the result set, not the size of the data set, so a query that returns 100 results should take the same amount of time whether you have a thousand rows or a million.  That means we can't compute a full cross-product at time-of-query because that would require looking at all the data in the n tables we're joining, and if we look at all the data our performance is now tied to the size of the data set.  Still, it feels like we ought to be able to do something here.

What if we set our sights a little bit lower?  Sure, n-way joins with arbitrary filters and sort orders would be ideal, but isn't something better than nothing in this area?  It turns out that with some additional restrictions in place it is possible to execute joins where the performance scales with the size of the result set, and as of the recently released SDK 1.3.1 this feature, which I'm calling "simple joins," is available to beta testers.

Now, before you start licking your chops let me give you all the bad news (there's plenty of it).  First, simple joins are only available to JPA and JDO users.  Why?  Because JPA and JDO already have an established API for joins and API design is really really hard.  I try to avoid it whenever possible.  Second, simple joins only support equality filters.  Ouch, right?  Third, simple joins can only be sorted by the property on which you're joining.  Double ouch.  Fourth, if you're using owned relationships, in order to use simple joins you're going to need to upgrade your storage version and migrate existing data.   This post is already long enough so I'm just going to focus on simple joins across unowned relationships. I'll cover storage versions, data migration, and joining across owned relationships in my next post, I promise.

I know that was a lot of bad news, but my theory is that even with all that bad news, simple joins are sufficiently useful.  I'm making them available to try and prove that theory. Ready to try it out?

Let's model a college.  We have Students, Courses, and Dorms.  Every Student is associated with zero or one Dorm and zero or more Courses.

JPA:
@Entity
public class Student {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String year;

  @Basic
  private List<Key> courses = new ArrayList<Key>();

  @OneToMany
  @Column(name = "courses", insertable = false, updatable = false)
  private List<Course> coursesAlias;

  @Basic
  private Key dorm;

  @OneToOne(fetch = FetchType.LAZY)
  @Column(name = "dorm", insertable = false, updatable = false)
  private Dorm dormAlias;
}

@Entity
public class Course {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String department;
}

@Entity
public class Dorm {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String campus;
}

Yes, there's something funny (and by funny I mean hacky) going on with the courseAlias and dormAlias fields.  The JPQL compiler won't let you join on a Key or a List<Key> field because these aren't OneToOne or OneToMany fields.  To work around this I've created a read-only alias of the appropriate type.  Your application code should always read from and write to the 'courses' and 'dorm' fields while your join queries should read from and write to the 'coursesAlias' and 'dormAlias' fields.  Once we have proper support for unowned relationships all this nonsense will go away.

WARNING: There's a bug that causes aliased fields to be written to the datastore with incorrect names. Make sure the names of your aliased fields don't end in "Key" or else your join won't return any results.  In the above example you could trigger this bug by renaming Student.dorm to Student.dormKey.

JDO:
@PersistenceCapable(detachable = "true")
public class Student {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Long id;

  private String year;

  @Element(dependent = "true")
  private List<Key> courses = new ArrayList<Key>();

  @Persistent
  private Dorm dorm;
}

@PersistenceCapable(detachable = "true")
public class Course {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String department;
}

@PersistenceCapable(detachable = "true")
public class Dorm {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String campus;
}
---------------------------------------
Here's a query that returns all Juniors enrolled in a Biology course:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.coursesAlias c where "
  + "c.department = 'Biology' and "
  + "s.year = 'Junior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() 
  + " where "
  + "courses.contains(c) && c.department == 'Biology' && "
  + "year == 'Junior'");
q.declareVariables(Course.class.getName() + " c");

It's definitely strange to be referring to non-existent properties of a Key field and a List<Key> field, but since proper unowned relationships are not yet supported I eased up some of the JDOQL compiler's type checking to let this go through (sorry Andy, I know you hate this).  Non-standard!
---------------------------------------

And here's a query that returns all Juniors who live in a dorm on the North campus of the school:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.dormAlias d where "
  + "d.campus = 'North' and "
  + "s.year = 'Junior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() + " where "
  + "dorm == d && d.campus == 'North' && "
  + "year == 'Junior'");
q.declareVariables(Dorm.class.getName() + " d");
Again, I'm taking some liberties with standard JDOQL here.
---------------------------------------
These examples should be enough to get you started.  Currently, only 2-way joins are supported but there's no reason it can't be extended to n-way.  As for outer joins, arbitrary sorts, and inequality filters, my colleagues and I are still thinking hard about it.

Stay tuned for more posts in which I'll explain how to make this work for owned relationships and the underlying algorithm involved.  Give simple joins a try and let me know how it goes!

Friday, January 15, 2010

Querying with Key parameters

Today we're going to cover a small 'gotcha' that I've seen on the forums a number of times: How do you execute a query where one or more of the parameters are of type com.google.appengine.datastore.api.Key?

Did you know that a Key has two different string representations?  You can see the first by calling Key.toString():
Key key = KeyFactory.createKey("Foo", 25);
System.out.println(key); // prints Foo(25)

You can see the second by calling KeyFactory.keyToString():
Key key = KeyFactory.createKey("Foo", 25);
System.out.println(KeyFactory.keyToString(key)); // prints agR0ZXN0cgkLEgNGb28YGQw

The first string is human readable, just like the javadoc for Object.toString() recommends.  The second string is websafe (no escaping required) and easily parsed by a machine (it's just the Key object translated to a protocol buffer and then base-64 encoded).  Seems reasonable, right?  Key.toString() outputs a string that is useful in logging and debugging, KeyFactory.keyToString() outputs a string that you can safely pass around in forms and urls.  Plus for those of you who, like me, care about symmetry in your apis, KeyFactory.keyToString() and KeyFactory.stringToKey() are symmetrical.  If Key.toString() returned the websafe version, what would the opposite transform look like?  Probably a static method on Key called fromString(), right?  So now you have mirror methods but one requires an instance and the other is static.  I don't like it but I digress.  You're probably reading this because you'd like to learn something useful so let's move along.

The problem with having two string representations of the Key class stems from the fact that for a number of data types it's easy to "inline" the parameter value.  Take this query for example:
JPA:

public List<Flight> flightsByAircraftType(EntityManager em, String aircraftType) {
    String queryStr = String.format("select from " + Flight.class.getName() + 
        " where aircraftType = '%s'", aircraftType);
    return em.createQuery(queryStr).getResultList();
}

JDO:
public List<Flight> flightsByAircraftType(PersistenceManager pm, String aircraftType) {
    String queryStr = String.format("select from " + Flight.class.getName() + 
        " where aircraftType == '%s'", aircraftType);
    return (List<Flight>) pm.newQuery(queryStr).execute();
}


Concise, readable, and correct.  But now let's do the same thing with a Key parameter:
JPA:

public List<Flight> flightPaginationByDest(EntityManager em, Key dest1, Key dest2) {
    String queryStr = String.format("select from " + Flight.class.getName() + 
        " where dest > '%s' and dest <= '%s'", dest1, dest2);
    return em.createQuery(queryStr).getResultList();
}

JDO:
public List<Flight> flightPaginationByDest(PersistenceManager pm, Key dest1, Key dest2) {
    String queryStr = String.format("select from " + Flight.class.getName() + 

        " where dest > '%s' && dest <= '%s'", dest1, dest2);
    return (List<Flight>) pm.newQuery(queryStr).execute();
}

Concise, readable, but incorrect.  We just asked the datastore for all Flights departing from airports between "Airport(15)" and "Airport(30)", and that's not going to match anything, ever, because we passed a String parameter to a query on entities where every single "dest" property is of type Key.  Remember, the datastore is schema-less.  If we were using the low-level api you could create some flight entities where the "dest" property is of type String and some flight entities where the "dest" property is of type Key.  The datastore has no problem with that.
So what's the solution?  Ultimately the query compiler should detect this, and in future releases you can expect to get an exception if you attempt to match an unencoded String value against a Key.  Until this is implemented I'd recommend using parameters:
JPA:
public List<Flight> flightPaginationByDest(EntityManager em, Key dest1, Key dest2) {
    String queryStr = "select from " + Flight.class.getName() + 

        " where dest > :p1 and dest <= :p2";
    Query q = em.createQuery(queryStr);
    q.setParameter("p1", dest1);
    q.setParameter("p2", dest2);
    return q.getResultList();

}

JDO:
public List<Flight> flightPaginationByDest(PersistenceManager pm, Key dest1, Key dest2) {
      String queryStr = "select from " + Flight.class.getName() + 

          " where dest > :p1 && dest <= :p2";
      return (List<Flight>) pm.newQuery(queryStr).execute(dest1, dest2);
}