Friday, February 12, 2010

The Truth About Joins

Whenever I get the opportunity to talk to my peers about the coolness of App Engine it always pains me to have to say "...but you can't do joins."  I know how useful joins are because I spent many many years using them before I started working on App Engine.  Lately I've been thinking that there has to be a way to make it work.  The App Engine datastore offers a guarantee that query performance scales with the size of the result set, not the size of the data set, so a query that returns 100 results should take the same amount of time whether you have a thousand rows or a million.  That means we can't compute a full cross-product at time-of-query because that would require looking at all the data in the n tables we're joining, and if we look at all the data our performance is now tied to the size of the data set.  Still, it feels like we ought to be able to do something here.

What if we set our sights a little bit lower?  Sure, n-way joins with arbitrary filters and sort orders would be ideal, but isn't something better than nothing in this area?  It turns out that with some additional restrictions in place it is possible to execute joins where the performance scales with the size of the result set, and as of the recently released SDK 1.3.1 this feature, which I'm calling "simple joins," is available to beta testers.

Now, before you start licking your chops let me give you all the bad news (there's plenty of it).  First, simple joins are only available to JPA and JDO users.  Why?  Because JPA and JDO already have an established API for joins and API design is really really hard.  I try to avoid it whenever possible.  Second, simple joins only support equality filters.  Ouch, right?  Third, simple joins can only be sorted by the property on which you're joining.  Double ouch.  Fourth, if you're using owned relationships, in order to use simple joins you're going to need to upgrade your storage version and migrate existing data.   This post is already long enough so I'm just going to focus on simple joins across unowned relationships. I'll cover storage versions, data migration, and joining across owned relationships in my next post, I promise.

I know that was a lot of bad news, but my theory is that even with all that bad news, simple joins are sufficiently useful.  I'm making them available to try and prove that theory. Ready to try it out?

Let's model a college.  We have Students, Courses, and Dorms.  Every Student is associated with zero or one Dorm and zero or more Courses.

JPA:
@Entity
public class Student {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String year;

  @Basic
  private List<Key> courses = new ArrayList<Key>();

  @OneToMany
  @Column(name = "courses", insertable = false, updatable = false)
  private List<Course> coursesAlias;

  @Basic
  private Key dorm;

  @OneToOne(fetch = FetchType.LAZY)
  @Column(name = "dorm", insertable = false, updatable = false)
  private Dorm dormAlias;
}

@Entity
public class Course {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String department;
}

@Entity
public class Dorm {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Key key;

  private String campus;
}

Yes, there's something funny (and by funny I mean hacky) going on with the courseAlias and dormAlias fields.  The JPQL compiler won't let you join on a Key or a List<Key> field because these aren't OneToOne or OneToMany fields.  To work around this I've created a read-only alias of the appropriate type.  Your application code should always read from and write to the 'courses' and 'dorm' fields while your join queries should read from and write to the 'coursesAlias' and 'dormAlias' fields.  Once we have proper support for unowned relationships all this nonsense will go away.

WARNING: There's a bug that causes aliased fields to be written to the datastore with incorrect names. Make sure the names of your aliased fields don't end in "Key" or else your join won't return any results.  In the above example you could trigger this bug by renaming Student.dorm to Student.dormKey.

JDO:
@PersistenceCapable(detachable = "true")
public class Student {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Long id;

  private String year;

  @Element(dependent = "true")
  private List<Key> courses = new ArrayList<Key>();

  @Persistent
  private Dorm dorm;
}

@PersistenceCapable(detachable = "true")
public class Course {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String department;
}

@PersistenceCapable(detachable = "true")
public class Dorm {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Key key;

  private String campus;
}
---------------------------------------
Here's a query that returns all Juniors enrolled in a Biology course:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.coursesAlias c where "
  + "c.department = 'Biology' and "
  + "s.year = 'Junior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() 
  + " where "
  + "courses.contains(c) && c.department == 'Biology' && "
  + "year == 'Junior'");
q.declareVariables(Course.class.getName() + " c");

It's definitely strange to be referring to non-existent properties of a Key field and a List<Key> field, but since proper unowned relationships are not yet supported I eased up some of the JDOQL compiler's type checking to let this go through (sorry Andy, I know you hate this).  Non-standard!
---------------------------------------

And here's a query that returns all Juniors who live in a dorm on the North campus of the school:
JPA:
Query q = em.createQuery("select from " + Student.class.getName() 
  + " s JOIN s.dormAlias d where "
  + "d.campus = 'North' and "
  + "s.year = 'Junior'");

JDO:
Query q = pm.newQuery("select from " + Student.class.getName() + " where "
  + "dorm == d && d.campus == 'North' && "
  + "year == 'Junior'");
q.declareVariables(Dorm.class.getName() + " d");
Again, I'm taking some liberties with standard JDOQL here.
---------------------------------------
These examples should be enough to get you started.  Currently, only 2-way joins are supported but there's no reason it can't be extended to n-way.  As for outer joins, arbitrary sorts, and inequality filters, my colleagues and I are still thinking hard about it.

Stay tuned for more posts in which I'll explain how to make this work for owned relationships and the underlying algorithm involved.  Give simple joins a try and let me know how it goes!

27 comments:

  1. Looks interesting Max.

    > First, simple joins are only available to JPA and JDO users.

    Is that the current short-term plan, or do you mean this feature is unlikely to eventually appear in the low-level Java or Python API?

    ReplyDelete
  2. This is just the current short-term plan. If simple joins turn out to be useful we will make them available in the low-level Java and Python APIs, but changes at this level require a lot more work so we want to make sure it's really worth it.

    ReplyDelete
  3. Execellent job, Max.

    Thanks for brining (simple) joins to GAE users

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. I currently precompute all joins and store them in a serialized text string for decoding on the client side. For example if I have one student who takes many courses, I would have a field called courses in the student entity, which would contain a text string with containing a serialized version of all course entities which that student is taking instead of just the keys. Can I now start using this new join method instead, or should I stick to the old way for performance?

    ReplyDelete
  6. @Mats : your problème will not be performance but ability to maintain your code and make it evolve... And if you change a join, you have to recompute the serialized string...

    @max : Good job ! I have another question : what pattern would you use to go around the "an entity cannot have multiple relationship fields of the same type" ?
    I test a pseudo mail system, with a Mailbox containing List and a List of read

    The correct answer may be "your model is incorrect" and read/not read should be a property of Mail but for performances issues, i'm not sure...

    ReplyDelete
  7. Great post! I have created a database that can store Event objects which reference Venue objects. I am currently trying to query the db for all events that are at venues in a particular city. Because the Venue entity is not 'embedded' within the Event entity I am unable to query based on attributes of the Venue an Event belongs to. The example you provided of the dorms seems to be a solution to my problem except I just get an exception when trying similar queries. Would you mind explaining what JDOQL type-checking settings you had to tweak to make such queries work? Thanks again!

    ReplyDelete
  8. Adam,

    I disabled most of the type checking. If you post your model objects and your query code I'm happy to take a look.

    Max

    ReplyDelete
  9. Hi Max,

    I failed to get it to work. Below are my model and query:


    @Entity
    public class CheckIn extends Post implements Serializable {

    @Basic
    private Key venueKey;

    @OneToOne(fetch = FetchType.LAZY)
    @Column(name = "venue", insertable = false, updatable = false)
    private Venue venueAlias;


    @Entity
    @MappedSuperclass
    public abstract class Post implements Comparable{

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Key key;

    @NotNull
    @Temporal(TemporalType.TIMESTAMP)
    private Date date;


    public Key getKey() {
    return key;
    }

    public void setKey(Key key) {
    this.key = key;
    }

    public Long getId() {
    return key.getId();
    }

    public void setId(Long id) {
    this.key = KeyFactory.createKey(this.getClass().getSimpleName(), id);
    }


    @Entity
    public class Venue implements Serializable {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Key key;


    private String name;




    return super.entityManager.createQuery("select c from " + CheckIn.class.getSimpleName()
    + " c JOIN c.venueAlias v where "
    + "v.name = 'Damansara Perdana'")
    .getResultList();

    Your help is very much appreciated

    ReplyDelete
  10. I'm happy to investigate but could you please post what went wrong? Did you get an exception? If so, what was it? Did you get incorrect results? If so, what were they?

    Thanks,
    Max

    ReplyDelete
  11. No exception. Just no result returned.

    Thanks,
    --Wong Liong Hung

    ReplyDelete
  12. Got error trying to do joins with JDO.
    Here what I have done:
    1.create an empty appplication (using GAE 1.3.1).
    2.copy entities from this post.
    3.write a small test:
    public void test() throws Exception {
    PersistenceManager pm = PMF.getFactory().getPersistenceManager();

    Query q = pm.newQuery("select from " + Student.class.getName() + " where " + "courses.contains(c) && c.department == 'Biology' && "
    + "year == 'Junior'");
    }
    4.launch test..
    5.got error:
    java.lang.NullPointerException
    at org.datanucleus.store.appengine.EntityUtils.determineKind(EntityUtils.java:95)
    at org.datanucleus.store.appengine.EntityUtils.determineKind(EntityUtils.java:91)
    at org.datanucleus.store.appengine.query.DatastoreQuery.addLeftPrimaryExpression(DatastoreQuery.java:1071)
    at org.datanucleus.store.appengine.query.DatastoreQuery.addExpression(DatastoreQuery.java:846)
    at org.datanucleus.store.appengine.query.DatastoreQuery.addExpression(DatastoreQuery.java:830)
    at org.datanucleus.store.appengine.query.DatastoreQuery.addExpression(DatastoreQuery.java:829)
    at org.datanucleus.store.appengine.query.DatastoreQuery.addFilters(DatastoreQuery.java:807)
    at org.datanucleus.store.appengine.query.DatastoreQuery.performExecute(DatastoreQuery.java:226)
    at org.datanucleus.store.appengine.query.JDOQLQuery.performExecute(JDOQLQuery.java:85)
    at org.datanucleus.store.query.Query.executeQuery(Query.java:1489)
    at org.datanucleus.store.query.Query.executeWithArray(Query.java:1371)
    at org.datanucleus.store.query.Query.execute(Query.java:1344)
    at org.datanucleus.jdo.JDOQuery.execute(JDOQuery.java:221)
    at LocalDatastoreTestCase.test(LocalDatastoreTestCase.java:35)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at junit.framework.TestCase.runTest(TestCase.java:168)
    at junit.framework.TestCase.runBare(TestCase.java:134)
    at junit.framework.TestResult$1.protect(TestResult.java:110)
    at junit.framework.TestResult.runProtected(TestResult.java:128)
    at junit.framework.TestResult.run(TestResult.java:113)
    at junit.framework.TestCase.run(TestCase.java:124)
    at junit.framework.TestSuite.runTest(TestSuite.java:232)
    at junit.framework.TestSuite.run(TestSuite.java:227)
    at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:46)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

    Error was in this method:

    public static String determineKind(AbstractClassMetaData acmd, IdentifierFactory idFactory) {
    if (acmd.getTable() != null) {

    acmd was null.

    6.got stuck...
    7.write this comment
    8.wait for your answer...
    Don't know how to fix it... Any suggestions?

    ReplyDelete
  13. "Your application code should always read from and write to the 'courses' and 'dorm' fields while your join queries should read from and write to the 'coursesAlias' and 'dormAlias' fields."


    Although still haven't able to get the join to work, I hope my application code could read from coursesAlias and dormAlias when I use join.

    Meaning from the Student result returned, I could do something like this:
    student.getDormAlias().getCampus() without another query using the dorm's Key to get the campus value.

    Thanks,
    --Wong Liong Hung

    ReplyDelete
  14. @Wong Liung Hung:
    The 'name' attribute of the Column annotation needs to match the name of the Key field. Your 'name' attribute is "venue" but your Key field is "venueKey." Change the name of the Key field to "venue" and you should be ok. Don't change the name of the 'name' attribute to "venueKey" because for some reason DataNucleus is doing something strange with field names that end in Key. I need to investigate that.

    ReplyDelete
  15. @Alex:
    How embarrassing, my JDO examples weren't declaring the variables that were used in the join. I've added this line now (search for "declareVariables" up above). If you add it to your code it should work.

    Sorry for the trouble,
    Max

    ReplyDelete
  16. It does work now.. Thanks..

    ReplyDelete
  17. Great feature! I think whatever can be done to alleviate lack of joins for Data Store will reduce the pain at least for some scenarios. I have 2 questions:
    1) Can we have unowned relationships defined as collection of KeyStrings instead of keys i.e. :

    @Basic
    private List<String> courses = new ArrayList<String>();

    @OneToMany
    @Column(name = "courses", insertable = false, updatable = false)
    private List coursesAlias;


    courses field would be filled with string keys coming from a declaration like this:

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @Extension(vendorName = "datanucleus", key = "gae.encoded-pk", value = "true")
    private String coursesKeyStr;

    Does this would work?

    2) Restriction to equality filter is a temporary is a temporary implementation issue or will stay due to the workings of app store?

    ReplyDelete
  18. 1) A List of Strings won't work because we have no way of knowing that the values are actually the String representations of Keys. Why not convert the Strings to Keys before you store them in the List?

    2) I'd of course like it to be temporary but we still haven't come up with a way to make this work with inequality filters. I'll never say never because for awhile we didn't think we could have any type of scalable joins, and then we came up with this. Don't hold your breath though.

    ReplyDelete
  19. Can you please explain why aliases does not work for me.

    @Entity
    public class Foo {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    public Long id;

    @Basic
    public Key bar;

    @OneToOne(fetch = FetchType.LAZY)
    @Column(name = "bar", insertable = false, updatable = false)
    public Bar barAlias;
    }

    @Entity
    public class Bar {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    public Key key;
    }

    public class UnOwnedRelationshipTest extends BaseDataStoreTest {

    public void testRelation() throws EntityNotFoundException {

    Bar bar = new Bar();
    beginTx();
    getEntityManager().persist(bar);
    commitTx();

    beginTx();
    Foo pojo = new Foo();
    pojo.bar = bar.key;
    getEntityManager().persist(pojo);
    commitTx();

    pojo = getEntityManager().find(Foo.class, pojo.id);
    com.google.appengine.api.datastore.Entity e = DatastoreServiceFactory
    .getDatastoreService().get(KeyFactory.createKey("Foo", pojo.id));
    assertTrue(e.hasProperty("bar"));

    //this fails
    assertNotNull(pojo.barAlias);

    }
    }

    The test fails, alias is not initialized..
    It does not work even when I use getters/setters to access alias, Neither it works for @oneToMany as you explained in example.

    I use GAE SDK 1.3.2

    See this thread as well http://groups.google.com/group/google-appengine-java/browse_thread/thread/aefd4e8b918af4f2

    Please advice.

    ReplyDelete
  20. See this thread as well http://groups.google.com/group/google-appengine-java/browse_thread/thread/cbf2dfe62209480e?hl=en#

    ReplyDelete
  21. Max, can you please answer my above question..

    ReplyDelete
  22. From student result returned, can I do
    student.getDormAlias().getCampus() !!

    ReplyDelete
  23. How can I find all courses that a user is enrolled in !!

    ReplyDelete
  24. Got error on model:

    Class Store {

    List<Key> admins;

    @OneToMany() @Column (name = "admins", insertable = false, updatable = false)
    private List<People> adminsAlias;

    List<Key> agents;
    @OneToMany() @Column (name = "agents", insertable = false, updatable = false)
    private java.util.List<People> agentsAlias;

    }

    javax.persistence.PersistenceException: Error in meta-data for Store.agentsAlias: Class Store has multiple relationship fields of type People: agentsAlias and adminsAlias. This is not yet supported.
    at org.datanucleus.jpa.NucleusJPAHelper.getJPAExceptionForNucleusException(NucleusJPAHelper.java:264)

    ReplyDelete
  25. I know this is over a year late but I am just starting to use JDO on GAE with App Engine 1.5.0. Do the samples work with 1.5.0 ? Can a Dorm exist with no student ? Can more than one student be in the same Dorm ? If the PK of Dorm is a Key, surely it will get encoded with it's parent key and thus prevent other parents. Is it really unowned ? Perhaps the terminology has changed since 2010.

    ReplyDelete
  26. Im also interested if its still possible to use this simple joins. I tested it, but got null result. So Im curious if Im doing something wrong or if it is no longer supported how its written in official documentation.

    ReplyDelete