Thursday, October 15, 2009

Keys Only Queries

(This was originally posted to the Google App Java Google Group on October 6, 2009)

If you use the low-level datastore api you may have noticed that the com.google.appengine.api.datastore.Query class has a setKeysOnly() method.  If you call this method before you execute the query the datastore will return com.google.appengine.api.datastore.Entity instances that have their keys filled in but none of their properties.  This can reduce consumption of your Datastore Data Received from API quota, especially if you've got some large entities, but, more importantly, it can also reduce consumption of your Datastore CPU Time quota.  How?  Well, if the fulfillment of your query requires an index or a merge join your query actually executes in two stages:  First it scans the index to find the keys of the entities that match and then it issues additional scans to retrieve the entities uniquely identified by the matching keys.  If your query is keys-only we can skip that second step entirely.  That means faster queries!

Now, JPA and JDO don't know anything about keys-only queries, but they do give you the flexibility to either return your entire object or some subset of the fields on your object.  If you construct this subset to only contain the primary key of your object, the App Engine implementation of JPA and JDO will use a keys-only query.  Let's look at some examples:

JPA:
@Entity
public class Book {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key id;

    private Date dateOfPublication;

    // getters and setters

}


Now let's implement a method that returns the Keys of all Books published betweeen 2 years (we'll assume someone else is creating and closing an EntityManager named 'em' for us):

public List keysOfBooksPublishedBetween(EntityManager em, Date from, Date to) {
    em.getTransaction().begin();
    try {
        Query q = em.createQuery("select id from " + Book.class.getName()
            + " where dateOfPublication >= :from AND dateOfPublication <= :to");
        q.setParameter("from", from);
        q.setParameter("to", to);
        return (List) q.getResultList();
    } finally {
        em.getTransaction().rollback();
    }
}


JDO:

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable = "true")
public class Book {
 
    @PrimaryKey

    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;

    private Date dateOfPublication;


    // getters and setters
}

Now let's implement a method that returns the Keys of all Books published betweeen 2 years (we'll assume someone else is creating and closing a PersistenceManager named 'pm' for us):


public List keysOfBooksPublishedBetween(PersistenceManager pm, Date from, Date to) {
    pm.currentTransaction().begin();
    try {
        Query q = pm.newQuery("select id from " + Book.class.getName()
            + " where dateOfPublication >= :from && dateOfPublication <= :to");
        return (List) q.execute(from, to);
    } finally {
        pm.currentTransaction().rollback();
    }
}
--------------------------------
Notice how we are only selecting the 'id' field from our Book class.  This is crucial.  If you select any other fields your query will end up fetching entire entities and the optimization will be lost.

12 comments:

  1. Why is the transaction rolled-back at the end of the each keysOfBooksPublishedBetween() function call?

    ReplyDelete
  2. The transaction needs to be either committed or rolled back. I'm not performing any writes inside the transaction so I'm rolling it back.

    ReplyDelete
  3. Thanks! That makes sense.

    Is it necessary to wrap this kind of call in a transaction?

    I have a JUnit test on my devserver that works the same (i.e., gets the keys it should be getting) whether I start a transaction or not (haven't tried in appengine yet).

    ReplyDelete
  4. I generally wrap everything in a transaction unless I can't due to multiple entity groups. If you're issuing a query that maps to a batch get and the entities you're fetching belong to multiple entity groups you'll get an exception.

    The behavior should be identical whether you're running locally or in prod. If it's not please let me know.

    ReplyDelete
  5. I do understand the sentence:
    "The transaction needs to be either committed or rolled back".
    I don't know much about JDO/JPA/DataNucleus, so sorry if those questions seem basic. I can see the transaction.begin, the transaction.rollback, but not the transaction.commit. Is the transaction.commit implicit on "return" of the operation ? o did you omit it for simplicity of the example ?

    I do not understand the logic of this sentence:
    "I'm not performing any writes inside the transaction so I'm rolling it back".

    The transaction contains just a "select", so the most we did here, under the cover, is potentially to read-lock some data. Even if we did read-lock, a (implicit or explicit) transaction.commit should release the read-locks (and the world should be back like it was before the transaction started), no ?

    What does transaction.rollback do for you here ? No write-lock or data updates, so what the rollback does ?

    I would expect the need for and explicit "transaction.rollback" if soemthing goes wrong in a transaction where we write-lock some data (i.e. update/create/delete data), but not in a transaction that just read data.

    Thanks,

    ReplyDelete
    Replies
    1. If you use the low-level datastore api you may have noticed that the com.google.appengine.api.datastore.Query class has a setKeysOnly() method. If you call this method before you execute the query the datastore will return com.google.appengine.api.datastore. cotton bed mattress near me , double blanket , double bed razai , sofa cover fabric , velvet blanket Entity instances that have their keys filled in but none of their properties. This can reduce consumption of your Datastore Data Received from API quota, especially if you've got some large entities, but, more importantly, it can also reduce consumption of your Datastore CPU Time quota. How? Well, if the fulfillment of your query requires an index or a merge join your query actually executes in two stages: First it scans the index to find the keys of the entities that match and then it issues additional scans to retrieve the entities uniquely identified by the matching keys. If your query is keys-only we can skip that second step entirely. That means faster queries!

      Delete
  6. If you start a transaction you need to either commit it or roll it back. If you just leave it open it will eventually time out on the server side, and the cleanup it performs is equivalent to a rollback, so I'm just making it explicit. You could just as easily commit the transaction the behavior of your application would be no different.

    ReplyDelete
  7. Okay thanks. I guess for me the "commit" would have been logical rather than the "rollback", even though in this case they behave the same.

    Thanks.

    ReplyDelete
  8. See
    http://code.google.com/p/googleappengine/issues/detail?id=3423

    for issues when making keys only queries.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. If you use the low-level datastore api you may have noticed that the com.google.appengine.api.datastore.Query class has a setKeysOnly() method. If you call this method before you execute the query the datastore will return com.google.appengine.api.datastore.Entity instances that have their keys filled in but none of their properties. wholesale bed sheets , cheap comforter sets , gul ahmed sale 2018 bed sheets , bed sheet wedding , duvet covers king , pure cotton mattress online , blankets near me , bed razai , 6 seater sofa cover , crushed velvet bedding This can reduce consumption of your Datastore Data Received from API quota, especially if you've got some large entities, but, more importantly, it can also reduce consumption of your Datastore CPU Time quota. How? Well, if the fulfillment of your query requires an index or a merge join your query actually executes in two stages: First it scans the index to find the keys of the entities that match and then it issues additional scans to retrieve the entities uniquely identified by the matching keys. If your query is keys-only we can skip that second step entirely. That means faster queries!

    ReplyDelete