Friday, November 20, 2009

Case Insensitive Queries

Hi everyone.  I've been posting a bit less frequently the past few weeks because the team has really been working hard on our next release.  It's almost ready.....we're excited to get it out to you!  On to today's topic....

I see questions about case insensitive queries on our message board pretty regularly.  We usually give people the same answer, but rather than continuing to describe the solution, why not just provide it?  Here we go.

If you're using a relational database it's pretty easy to execute case-insensitive queries because it's a native feature of the database.  It usually looks something like this:

 select * from Person where UPPER(lastName) = 'ROSS'

This is a challenge for the App Engine Datastore because we rely on executing scans over a limited subset of your data and returning results as we encounter them.  Consider the following strings in lexicographic order:
...
ROSE
...
ROSS
...
ROSTA
...
Rose
...
Ross
...
Rosta
...
rose
...
ross
...
rosta
...

As you can see there could be an unlimited number of rows in between 'ROSS' and 'Ross', and also between 'Ross' and 'ross' (okay not unlimited, we do have a limit on indexed string length, but definitely Large), so if we start scanning at 'ROSS' we might have to skip a huge number of results before we hit 'Ross', and that doesn't scale.  This is why the datastore doesn't support case-insensitive queries.

Fortunately it's not hard to implement support for case-insensitive queries in your application.  Here's the approach: for each field that you want to query in a case-insensitive way, create a duplicate field that stores the value of that field in either all upper or lowercase letters.  Then, have your model object register a pre-persist callback with your persistence framework (JPA or JDO), and then populate the duplicate fields inside that callback.  Here's what it looks like.

JPA:
import javax.persistence.PrePersist;
import javax.persistence.PreUpdate;
 

@Entity
public class Person {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String lastName; 
  private String uppercaseLastName;

  public Long getId() {
    return id;
  }


  public String getLastName() {
    return lastName;
  }

  public void setLastName(String lastName) {
    this.lastName = lastName;
  }

  @PrePersist
  @PreUpdate
  public void prePersist() {
    if (lastName != null) {
      uppercaseLastName = lastName.toUpperCase();
    } else {
      uppercaseLastName = null;
    }
  } 

}
 
public List<Person> getPeopleByLastName(String lastName) {
  Query q = em.createQuery("select from Person where uppercaseLastName = :p");
  q.setParameter("p", lastName.toUpperCase());
  return (List<Person>) q.getResultList(); 

}


JDO:
import javax.jdo.listener.StoreCallback;

@PersistenceCapable(identityType = IdentityType.APPLICATION)
public class Person implements StoreCallback {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Long id;
  private String lastName;
  private String uppercaseLastName;

  public Long getId() {
    return id;
  }

  public String getLastName() {
    return lastName;
  }

  public void setLastName(String lastName) {
    this.lastName = lastName;
  }

  public void jdoPreStore() {
    if (lastName != null) {
      uppercaseLastName = lastName.toUpperCase();
    } else {
      uppercaseLastName = null;
    }
  }
}

public List<Person> getPeopleByLastName(String lastName) { 
  Query q = pm.newQuery(Person.class, "uppercaseLastName == :p"); 
  return (List<Person>) q.execute(lastName.toUpperCase()); 
}

Note that in both examples I have not created a getter or a setter for the uppercaseLastName field.  That's because this field is part of the interface we're exposing to the persistence framework but not part of the public interface of the model object.  Your query code can filter by the uppercaseLastName field but the client code that interacts with the Person object has no reason to know that there are duplicate fields that are used to support case-insensitive queries.

Wednesday, November 4, 2009

Unindexed Properties

Did you know that, by default, the App Engine Datastore writes two index records for every entity property that isn't a com.google.appengine.api.datastore.Blob or a com.google.appengine.api.datastore.Text?  It's true!  These index records allow you to execute a variety of queries involving the property without creating a composite index.

Now, these index records don't come for free.  They take time to write and they take up space on disk.  If you have a property that you're absolutely positively sure you'll never want to filter or sort by, you can opt-out of the default indexing that is going on.  Let's look at an example.  We'll use a Book class with a com.google.appengine.api.datastore.Link property that is the URL of an image of the Book's cover.  Our assumption here is that we don't need to select Books based on the value of this property and we don't need to sort them based on the value of this property.


JPA:
@Entity
public class Book {
  @Id
  @GeneratedValue(strategy=GenerationType.IDENTITY)
  private Long id;

  @Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
  private Link coverImageUrl;

  private String title;

  // getters and setters
}


JDO:
@PersistenceCapable(identityType=IdentityType.APPLICATION)
public class Book {
  @PrimaryKey
  @Persistent(valueStrategy=IdGeneratorStrategy.IDENTITY)
  private Long id;

  @Persistent
  @Extension(vendorName="datanucleus", key="gae.unindexed", value="true")
  private Link coverImageUrl;

  private String title;

  // getters and setters
}


In both examples, the "gae.unindexed" extension tells App Engine that you want to opt out of the default indexing.  Now remember, requirements change over time.  Just because you're not filtering or sorting by a property today doesn't mean you won't filter or sort by that property tomorrow, so think hard before you choose to mark something as 'unindexed.'

Of course, no matter how well you've planned you'll find yourself in a situation where you have existing data and you need to change a property from indexed to unindexed or from unindexed to indexed.  What do you do?
Let's start with the easy one - going from indexed to unindexed.

If you change a property from indexed to unindexed, the index records for all existing entities with that property will continue to exist until you update (or delete) the entity with which those index records are associated.  Any new entities will be created without the index records for the newly unindexed property.  As long as you've taken care to purge your code of all queries that filter or sort by the property that is now unindexed you'll be ready to go as soon as you upload your new application version.

Changing a property from unindexed to indexed is more difficult.  Presumably you're doing this because you need to filter or sort by that property in your queries, and since entities without index records for the filter and sort properties are automatically excluded from the result set, the only way that your queries involving the newly indexed property are going to return the results you expect is if you create index records for all the entities that existed before you made the property indexed.  Yikes.  I told you to think hard, didn't I?

So what do you do?  Well, the missing index records will be created whenever you rewrite an entity, so you'll need to map over all entities of the appropriate kind and "touch" each one (fetch it then put it).  You can use the Task Queue to break this work up, and when Datastore Cursors are released (coming soon!) this will be even easier.  Still, you've got some work to do.

My final warning: Unindexed properties are an optimization, and just like any optimization it is possible to invoke it prematurely.  Using unindexed properties will speed up writes and reduce disk usage, but will it speed up writes and reduce disk usage enough to make the optimization worthwhile?  I can't answer that question for you.  The impact will depend on how much data you have and how often you're writing it.

Wednesday, October 21, 2009

Optimistic Locking With @Version

The datastore does optimistic locking, so if you update an entity inside a transaction and another request commits an update to either that same entity or some other entity in the same entity group before you commit your transaction, you will get an exception.  This keeps your data from getting stomped on when there are concurrent requests.  However, in the web-app world where clients can and typically do maintain state across requests, there is a far more likely scenario in which your data can get stomped on.  Consider the following scenario involving a bug tracking system:

Time 1: User A brings up the details page for Bug 838.
Time 2: User B brings up the details page for Bug 838.
Time 3: User A assigns Bug 838 to Jim.
Time 4: User B assigns Bug 838 to Sally.

Even though User A and User B updated Bug 838 at different times, User A's update was effectively stomped on by User B.  Sure, the fact that Bug 838 was assigned to Jim at some point may be in the history for the Bug, we can't change the fact that User B made a decision to assign Bug 838 based on out-of-date information.  In a bug tracking system this might not be such a big deal, but if you're doing something like compensation planning you'd much rather have your users receive an exception when they make an update based on out-of-date information.  Fortunately this is easy to implement using JPA/JDO on top of App Engine.  Let's use a Person object with a 'salary' property as an example.

JPA:
@Entity
public class Person {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Long id;

    private int salary;

    @Version
    private long version;

    // ...getters and setters
}

public void updateSalary(EntityManager em, Person p, int newSalary) {
    em.getTransaction().begin();
    try {
        p.setSalary(newSalary);
        p = em.merge(p);
        em.getTransaction().commit();
    } catch (RollbackException e) {
        if (e.getCause() instanceof OptimisticLockException) {
            handleVersionConflict(e.getCause(), p);
        } else {
            throw e;
        }
    } finally {
        if (em.getTransaction().isActive()) {
            em.getTransaction().rollback();
        }
    }
}

JDO:

@PersistenceCapable(identityType=IdentityType.APPLICATION)
@Version(strategy=VersionStrategy.VERSION_NUMBER)
public class Person {
    @PrimaryKey
    @Persistent(valueStrategy=IdGeneratorStrategy.IDENTITY)
    private Long id;

    private int salary;


    // ...getters and setters
}

public void updateSalary(PersistenceManager pm, Person p, int newSalary) {
    pm.currentTransaction().begin();
    try {
        p.setSalary(newSalary);
        pm.makePersistent(p);
        pm.currentTransaction().commit();
    } catch (JDOOptimisticVerificationException e) {
        handleVersionConflict(e, p);
    } finally {        
        if (pm.currentTransaction().isActive()) {
            pm.currentTransaction().rollback();
        }
    }
}

If you declare a Version field on your model object, JDO/JPA will compare the value of that field on the instance you are updating with the value of that field in the datastore.  If the version numbers are equal the version number will be incremented and your model object will be persisted.   If the version numbers are not equal the appropriate exception will be thrown.  In the above examples I've caught this exception in the update code itself, but this is really just to illustrate what's going on.  In practice I would most likely let this exception propagate out of the update method and handle it at a higher level, perhaps even as part of a generic exception handler in my presentation code.

Note that there is a performance cost to using this feature.  The datastore does not support updates with predicates the way relational databases do ('update person where version = 3') so in order to perform the version comparison it needs to do an additional get() for the entity that corresponds to your model object.  You'll have to decide for yourself whether or not the cost of the additional fetch is worthwhile.

Now, if you've looked at both the JPA and JDO examples you may have noticed that JPA requires the declaration of an explicit field annotated with @Version while JDO accepts an @Version annotation on the class without an explicit field.  If you are using JDO and would like the version stored in an explicit field you can use a DataNucleus-specific extension to tell JDO which field to use.  You can see an example of this here.

Using JPA and JDO's built-in versioning checks is an easy way to help users of your app make decisions based on up-to-date information.  Try it out!

Thursday, October 15, 2009

Serialized Fields

(This was originally posted to the Google App Java Google Group on October 14, 2009)

Suppose one of your model objects has a field that isn't a native datastore type like String or Date and isn't another model object, like the child of an owned OneToOne relationship.  Oftentimes you can use an Embedded class to store that field's sub-fields in the same record as the containing record, but what if an Embedded class isn't sufficient?  This is where Serialized Fields come in.  With JDO and JPA it's possible to store any class that implements java.io.Serializable in a single property.

For our example lets model Person, ContactProfiles, ContactProfile, and PhoneNumber.  A Person has exactly one ContactProfiles instance, a ContactProfiles has some number of ContactProfile instances, and a ContactProfile has some number of PhoneNumber instances.  Our application is such that when we retrieve a Person from the datastore it always makes sense to load their ContactProfiles as well.  We'll define ContactProfiles, ContactProfile, and PhoneNumber first since they're the same whether we're using JPA or JDO:

public static class ContactProfiles implements Serializable {
    private final
List<ContactProfile> profiles;

    public ContactProfiles(List<ContactProfile> profiles) {
      this.profiles = profiles;
    }

    public
List<ContactProfile> get() {
      return profiles;
    }
}

public class ContactProfile implements Serializable {
    private String profileName;
    private
List<PhoneNumber> phoneNumbers;

    // getters and setters
}

public class PhoneNumber implements Serializable {
    private String type;
    private String number;

    // getters and setters
}


The JPA Person definition:
@Entity
public class Person {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key id;

    @Lob
    private ContactProfiles contactProfiles;

    // getters and setters
}


The JDO Person definition:

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable = "true")
public class Person { 
    @PrimaryKey

    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;

    @Persistent(serialized = "true")
    private ContactProfiles contactProfiles;

    // getters and setters
}


In this example the App Engine JDO/JPA implementation converts the Person.contactProfiles field into a com.google.appengine.api.datastore.Blob using standard Java serialization.  This Blob is then stored as a single property on the Person Entity.  When you fetch a Person the corresponding Entity is retrieved and the Blob is deserialized in just a single datastore rpc, leaving you with a ContactProfiles instance and all its associated sub-objects.

There are a few drawbacks to this approach that you should be aware of.  First, even though your application understands the structure of the bytes that make up the ContactProfiles, the datastore does not.  As a result you can't filter or sort on any of the properties of ContactProfile or PhoneNumber.  Second, all the known gotchas associated with Java serialization apply here.  If you evolve your Serializable classes make sure you do it in a backwards compatible way!  Third, you might be looking at the definition of the ContactProfiles class and thinking "Why are we bothering with a ContactProfiles class when Person could just store List<ContactProfile>?"  The reason is that updating a serialized field is not as simple as updating other types of fields, and this extra layer of indirection makes things a bit easier.  DataNucleus has no way of "seeing" the changes you make to the internal values of a serialized field, and as a result it doesn't know when it needs to flush changes to these fields to the datastore.  However, DataNucleus can see when the top-level serialized field reference changes.  Giving Person a reference to a ContactProfiles object makes it easy for us to change ContactProfiles in a way that DataNucleus is guaranteed to recognize:

public void addContactProfile(Person p, ContactProfile cp) {
    // update our serialized field
    p.getContactProfiles().get().add(cp);
    // give the person a new ContactProfiles reference so the update is detected
    p.setContactProfiles(new ContactProfiles(p.getContactProfiles().get()));
}

The last line where we set a new ContactProfiles instace back on the Person is the key and it's easy to miss, so be careful!  If you forget this step your updates will be ignored and you will be sad.

Despite the limitations I've listed, Serialized fields are a good way to store structured data inside the record of a containing Entity.  As long as you can get by without filtering or sorting on this data and you remember that it has special update requirements, serialized fields will almost certainly come in handy at some point.

Keys Only Queries

(This was originally posted to the Google App Java Google Group on October 6, 2009)

If you use the low-level datastore api you may have noticed that the com.google.appengine.api.datastore.Query class has a setKeysOnly() method.  If you call this method before you execute the query the datastore will return com.google.appengine.api.datastore.Entity instances that have their keys filled in but none of their properties.  This can reduce consumption of your Datastore Data Received from API quota, especially if you've got some large entities, but, more importantly, it can also reduce consumption of your Datastore CPU Time quota.  How?  Well, if the fulfillment of your query requires an index or a merge join your query actually executes in two stages:  First it scans the index to find the keys of the entities that match and then it issues additional scans to retrieve the entities uniquely identified by the matching keys.  If your query is keys-only we can skip that second step entirely.  That means faster queries!

Now, JPA and JDO don't know anything about keys-only queries, but they do give you the flexibility to either return your entire object or some subset of the fields on your object.  If you construct this subset to only contain the primary key of your object, the App Engine implementation of JPA and JDO will use a keys-only query.  Let's look at some examples:

JPA:
@Entity
public class Book {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key id;

    private Date dateOfPublication;

    // getters and setters

}


Now let's implement a method that returns the Keys of all Books published betweeen 2 years (we'll assume someone else is creating and closing an EntityManager named 'em' for us):

public List keysOfBooksPublishedBetween(EntityManager em, Date from, Date to) {
    em.getTransaction().begin();
    try {
        Query q = em.createQuery("select id from " + Book.class.getName()
            + " where dateOfPublication >= :from AND dateOfPublication <= :to");
        q.setParameter("from", from);
        q.setParameter("to", to);
        return (List) q.getResultList();
    } finally {
        em.getTransaction().rollback();
    }
}


JDO:

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable = "true")
public class Book {
 
    @PrimaryKey

    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;

    private Date dateOfPublication;


    // getters and setters
}

Now let's implement a method that returns the Keys of all Books published betweeen 2 years (we'll assume someone else is creating and closing a PersistenceManager named 'pm' for us):


public List keysOfBooksPublishedBetween(PersistenceManager pm, Date from, Date to) {
    pm.currentTransaction().begin();
    try {
        Query q = pm.newQuery("select id from " + Book.class.getName()
            + " where dateOfPublication >= :from && dateOfPublication <= :to");
        return (List) q.execute(from, to);
    } finally {
        pm.currentTransaction().rollback();
    }
}
--------------------------------
Notice how we are only selecting the 'id' field from our Book class.  This is crucial.  If you select any other fields your query will end up fetching entire entities and the optimization will be lost.

Updating A Bidrectional Owned One-To-Many With A New Child

(This was originally posted to the Google App Java Google Group on September 28, 2009)  

All the way back in our first post one we demonstrated how to create both a parent and a child of a bidirectional, owned, one-to-many relationship at the same time.  This week we're going to see how to add a child to an existing parent.  We'll use the same model objects we used before:

JPA:
@Entity
public class Book {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key id;

    private String title;


    @OneToMany(mappedBy = "book", cascade = CascadeType.ALL)

    private List<Chapter> chapters = new ArrayList<Chapter>();

    // getters and setters
}

@Entity
public class Chapter {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key id;

    private String title;
    private int numPages;

    @ManyToOne(fetch = FetchType.LAZY)
    private Book book;

    // getters and setters
}

Now let's assume we've already created a book with a few chapters in the datastore and we want to add a brand new chapter to a Book with a given id (we'll assume someone else is creating and closing an EntityManager named 'em' for us):

public void addChapterToBook(EntityManager em, Key bookKey, Chapter chapter) {
    em.getTransaction().begin();
    try {
        Book b = em.find(Book.class, bookKey);
        if (b == null) {
            throw new RuntimeException("Book " + bookKey + " not found!");
        }
        b.getChapters().add(chapter);

        em.getTransaction().commit();
    } finally {
        if (em.getTransaction().isActive()) {
            em.getTransaction().rollback();
        }
    }
}


JDO:

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable = "true")
public class Book {
 
    @PrimaryKey

    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;

    private String title;


    @Persistent(mappedBy = "book")
    @Element(dependent = "true")
    @Order(extensions = @Extension(vendorName="datanucleus", key="list-ordering", value="id asc"))
    private List<Chapter> chapters = new ArrayList<Chapter>();


    // getters and setters
}

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable = "true")

public class Chapter {
    @PrimaryKey

    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    private Key id;

    private String title;
    private int numPages;

    @Persistent
    private Book book;

    // getters and setters

}

Now let's assume we've already created a book with a few chapters in the datastore and we want to add a brand new chapter to a Book with a given id (we'll assume someone else is creating and closing a PersistenceManager named 'pm' for us):


public void addChapterToBook(PersistenceManager pm, Key bookKey, Chapter chapter) {
    pm.currentTransaction().begin();
    try {
        // throws a runtime exception if book is not found       
        Book b = pm.getObjectById(Book.class, bookKey);
        b.getChapters().add(chapter);

        pm.currentTransaction().commit();
    } finally {
        if (pm.currentTransaction().isActive()) {
            pm.currentTransaction().rollback();
        }
    }
}
--------------------------------

The interesting thing about both of these examples is that we're not making any explicit calls to save the new Chapter.  We look up the Book identified by the Key that was passed into the function and then we manipulate the persistent state of the object by manipulating the POJO that was returned by em.fetch/pm.getObjectById.  JPA and JDO both have mechanisms that allow them to monitor the objects that you've looked up for changes.  Ever wonder what exactly the enhancer is doing to your classes?  It's adding hooks so that the persistence framework gets notified when things change (among other things).  This allows JPA and JDO to automatically flush your changes to the datastore when you commit your transaction.  If you wanted to modify the title of the Book or the number of pages in an existing Chapter the approach would be exactly the same: Start a transaction, look up the Book, make your changes, commit your transaction.  Whether you're using JPA or JDO your changes will be persisted for you without any explicit calls to change the persistent state.  This is a prime example of how JPA and JDO facilitate "transparent persistence."

Executing Batch Gets

(This was originally posted to the Google App Java Google Group on September 21, 2009) 

Did you know that the App Engine datastore supports batch gets?  Batch gets are a super-efficient way to load multiple entities when you already have the keys of the entities you want loaded.  Here's an example using the low-level Datastore API:

  public Map getById(List keys) {
      DatastoreService ds = DatastoreServiceFactory.getDatastoreService();
      return ds.get(keys);        
  }

Now lets see how we can accomplish the same thing in JPA and JDO:

JPA:
@Entity
public class Book {
    @Id
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key key;
    private String title;


    // additional members, getters and setters
}



public List getById(List keys) {
    Query q = em.createQuery("select from " + Book.class.getName() + " where key = :keys");
    q.setParameter("keys", keys);
    return (List) q.getResultList();
}


JDO:
@PersistenceCapable(identityType = IdentityType.APPLICATION)
public class Book {
    @PrimaryKey
    @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
    @GeneratedValue(strategy=GenerationType.IDENTITY)
    private Key key;
    private String title;


    // additional members, getters and setters
}


public List getById(List keys) {
    Query q = pm.newQuery("select from " + Book.class.getName() + " where key == :keys");
    return (List) q.execute(keys);
}

Notice how in both examples we're constructing a query to pull back the entities by key.  The App Engine JDO/JPA implementation detects queries that are filtering only by key and fulfills them using a low level batch get rather than a datstore query.  This works no matter the type of your primary key field, so whether you're using a Long, an unencoded String, an encoded String, or a Key, the same technique will work.  However, even though this looks like a query, all the fetch-related transaction restrictions apply: if you're executing your batch get query inside a txn, all of the entities you're attempting to fetch must belong to the same entity group or you'll get an exception. Be careful with this.

The next time you need to pull back multiple entities and you already have their keys, issue a query that filters only by your object's key field and reap the benefits of the datastore's optimized batch get mechanism.