Friday, November 20, 2009

Case Insensitive Queries

Hi everyone.  I've been posting a bit less frequently the past few weeks because the team has really been working hard on our next release.  It's almost ready.....we're excited to get it out to you!  On to today's topic....

I see questions about case insensitive queries on our message board pretty regularly.  We usually give people the same answer, but rather than continuing to describe the solution, why not just provide it?  Here we go.

If you're using a relational database it's pretty easy to execute case-insensitive queries because it's a native feature of the database.  It usually looks something like this:

 select * from Person where UPPER(lastName) = 'ROSS'

This is a challenge for the App Engine Datastore because we rely on executing scans over a limited subset of your data and returning results as we encounter them.  Consider the following strings in lexicographic order:
...
ROSE
...
ROSS
...
ROSTA
...
Rose
...
Ross
...
Rosta
...
rose
...
ross
...
rosta
...

As you can see there could be an unlimited number of rows in between 'ROSS' and 'Ross', and also between 'Ross' and 'ross' (okay not unlimited, we do have a limit on indexed string length, but definitely Large), so if we start scanning at 'ROSS' we might have to skip a huge number of results before we hit 'Ross', and that doesn't scale.  This is why the datastore doesn't support case-insensitive queries.

Fortunately it's not hard to implement support for case-insensitive queries in your application.  Here's the approach: for each field that you want to query in a case-insensitive way, create a duplicate field that stores the value of that field in either all upper or lowercase letters.  Then, have your model object register a pre-persist callback with your persistence framework (JPA or JDO), and then populate the duplicate fields inside that callback.  Here's what it looks like.

JPA:
import javax.persistence.PrePersist;
import javax.persistence.PreUpdate;
 

@Entity
public class Person {
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;

  private String lastName; 
  private String uppercaseLastName;

  public Long getId() {
    return id;
  }


  public String getLastName() {
    return lastName;
  }

  public void setLastName(String lastName) {
    this.lastName = lastName;
  }

  @PrePersist
  @PreUpdate
  public void prePersist() {
    if (lastName != null) {
      uppercaseLastName = lastName.toUpperCase();
    } else {
      uppercaseLastName = null;
    }
  } 

}
 
public List<Person> getPeopleByLastName(String lastName) {
  Query q = em.createQuery("select from Person where uppercaseLastName = :p");
  q.setParameter("p", lastName.toUpperCase());
  return (List<Person>) q.getResultList(); 

}


JDO:
import javax.jdo.listener.StoreCallback;

@PersistenceCapable(identityType = IdentityType.APPLICATION)
public class Person implements StoreCallback {
  @PrimaryKey
  @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
  private Long id;
  private String lastName;
  private String uppercaseLastName;

  public Long getId() {
    return id;
  }

  public String getLastName() {
    return lastName;
  }

  public void setLastName(String lastName) {
    this.lastName = lastName;
  }

  public void jdoPreStore() {
    if (lastName != null) {
      uppercaseLastName = lastName.toUpperCase();
    } else {
      uppercaseLastName = null;
    }
  }
}

public List<Person> getPeopleByLastName(String lastName) { 
  Query q = pm.newQuery(Person.class, "uppercaseLastName == :p"); 
  return (List<Person>) q.execute(lastName.toUpperCase()); 
}

Note that in both examples I have not created a getter or a setter for the uppercaseLastName field.  That's because this field is part of the interface we're exposing to the persistence framework but not part of the public interface of the model object.  Your query code can filter by the uppercaseLastName field but the client code that interacts with the Person object has no reason to know that there are duplicate fields that are used to support case-insensitive queries.

15 comments:

  1. thanks for that post, max.

    a simple question: wouldn't the property name then be a candidate for being defined as "unindexed" in regard to optimization?

    ReplyDelete
  2. If you're sure you'll never want to filter or sort by that propery in a case sensitive way then sure, "unindexed" sounds reasonable. Still, make sure the optimization is really necessary. It's much easier to change a property from "indexed" to "unindexed" than the other way around (see previous post).

    ReplyDelete
  3. Do you have an opinion on whether or not toUpperCase should be used vs. toLowerCase? I can't think of any reason for one over the other, but am wondering if there's something to think about if Locale's are introduced.

    For example, my entity stores a Locale in it along with a lower-case field, and based on the locale, I perform the lowercase function in my prePersist(). Not sure if it really matters for English, but perhaps for other language sets?

    ReplyDelete
  4. Interesting question, I don't know the answer. Perhaps someone else can chime in?

    ReplyDelete
  5. JavaDoc from String#toUpperCase()
    -----------------------------
    Converts all of the characters in this String to upper case using the rules of the default locale. This method is equivalent to toUpperCase(Locale.getDefault()).

    Note: This method is locale sensitive, and may produce unexpected results if used for strings that are intended to be interpreted locale independently. Examples are programming language identifiers, protocol keys, and HTML tags. For instance, "title".toUpperCase() in a Turkish locale returns "T\u0130TLE", where '\u0130' is the LATIN CAPITAL LETTER I WITH DOT ABOVE character. To obtain correct results for locale insensitive strings, use toUpperCase(Locale.ENGLISH).
    -----------------------------

    So, your solution (just as SQL one) will not work for Turkish language.

    E.g we have name "Likas". It will be stored as "LIKAS" (Locale.ENGLISH is the default one in AppEngine, right?). But in Turkish uppercase variant is "LİKAS". So, if some Turkish person will look for "LİKAS", he will not fined "Likas". Hopefully you get my point :)

    ReplyDelete
  6. So a good solution might be to capture the Locale of the user entering in a certain type of data, and then perform a lower/upper case translation with that user's locale (String.toLowerCase(Locale.XYZ))

    Not as easy as Max's solution since it requires one more piece of data (the Locale), but should work if you can capture the locale.

    ReplyDelete
  7. What is the benefit of the logic being in prePersist() instead of setLastName()

    ReplyDelete
  8. Good question! In this example I can't think of a benefit, other than my desire to demonstrate that these callbacks are easy to use. However, if you needed to compute some derived state based on the value of more than one field you'd definitely want the logic in prePersist() since you don't need to think about the order in which your setters are invoked.

    ReplyDelete
  9. Ok, now I'm stuck with 20k records that do not have this. What can I do?

    cron? It can run for just 30 seconds right. How do I go about the entire process?

    I'm subscribing to ur blog now, have so much to learn!

    ReplyDelete
  10. I ran into the problem of mass updating 15k+ object lists. My first idea was to do it on localhost and use the bulk uploader to upload the whole DB, but the bulk uploader is only available in Python...

    So what I ended up doing was creating a Task which only processes n objects in a run and then reschedule a new task to continue with the remaining objects. I chose n to be a number that is safe to be finished in 30 seconds. I also ran into the problem of seeking into the middle of the resultset, as the Query.setRange() only accepts offsets below 1000... to solve this problem, I used used a Cursor which was being passed between the tasks, so the next task could continue exactly where the previous once finished, without having to rerun the appstore query.

    See AppStore documentation for Tasks and Data storage / Cursor.

    ReplyDelete
  11. When I bought my computer and I didn´t know how to use java graphics, so I decided looking for information in a webside and I found an useful information that helped me a lot.. Now I am interested in to do the best investment and I found a webside very useful and interesting called costa rica investment opportunities , I think it´s a very wonderful site.

    ReplyDelete
  12. Thanks for the post. Really important;

    ReplyDelete
  13. Thank you!

    Too bad I saw this after a few wasted hours...

    ReplyDelete