Friday, December 4, 2009

Queries with != and IN filters

Since we launched Google App Engine for Java back in April there has been a sizable disparity between the persistence features of Python and Java: Python had support for the != and IN operators in query filters and Java did not. With the release of SDK 1.2.8 this disparity is now gone (hooray!), but before you go rushing out to take advantage of these new operators I want to take some time to explain how the performance of queries that use these operators is a little bit different from what you're used to.

The key difference is that the datastore does not natively support != and IN. Instead, these operators are implemented in "userland." The code lives in appengine-api.jar, and it knows how to transform queries that use != and IN into queries that the datastore natively supports. Let's look at an example. Suppose you want to issue a query that returns all people not named Max, ordered by name and age.

Under the hood, this query gets broken up into two queries. The first query returns all people whose name is lexicographically smaller than Max, ordered by name and age. The second query returns all people whose name is lexicographically larger than Max, ordered by name and age.

JPA:
select from Person where name <> "Max" order by name, age

is translated into

select from Person where name < "Max" order by name, age
select from Person where name > "Max" order by name, age

JDO:
select from Person where name != "Max" order by name, age

is translated into



select from Person where name < "Max" order by name, age 
select from Person where name > "Max" order by name, age

The results of these two queries are then merged, with the sort applied in-memory as we go. Both result sets are ordered by the same properties, so we can determine the next result to return by comparing the two results at the front of the two result sets. This makes the in-memory sort efficient. We also need to de-dupe as we go because a multi-value property may place an Entity in both result sets. The performance of this query is equivalent to the performance of the two underlying queries with some additional memory consumption because we need to maintain references to all the results we've already returned in order to de-dupe.

IN filters are implemented in similar fashion, but instead of requiring a fixed number of queries to fulfill they require N queries, where N is the number of values in the IN clause.  Suppose you want all people whose favorite foods are cheeseburgers, pizza, and fried chicken, ordered by favoriteFood and age.

Under the hood this query gets broken up into three queries. The first query returns all people whose favorite food is cheeseburgers, ordered by favoriteFood and age. The second query returns all people whose favorite food is pizza, ordered by favoriteFood and age. The third query includes all people whose favorite food is fried chicken, ordered by favoriteFood and age.

JPA:
select from Person where
    favoriteFood IN ('cheeseburger', 'pizza', 'fried chicken')
    order by favoriteFood, age

is translated into

select from Person where favoriteFood = "cheeseburger" order by favoriteFood, age
select from Person where favoriteFood = "pizza" order by favoriteFood, age
select from Person where favoriteFood = "fried chicken" order by favoriteFood, age

JDO:
Query q = pm.newQuery(
    "select from Person where :p1.contains(favoriteFood) order by favoriteFood, age");
q.execute(Arrays.asList("cheeseburger", "pizza", "fried chicken"));

is translated into

select from Person where favoriteFood == "cheeseburger" order by favoriteFood, age
select from Person where favoriteFood == "pizza" order by favoriteFood, age
select from Person where favoriteFood == "fried chicken" order by favoriteFood, age

Once again we merge the results of these three queries, sorting and de-duping as we go. Once again the performance of this query is equivalent to the performance of the underlying queries with some additional memory consumption.

Can you issue a query that combines != and IN? Absolutely. Can you have multiple IN filters? Absolutely again! But, be careful, if your query requires more than 30 underlying queries to fulfill you'll get an exception. Here's a query that requires too many underlying queries to fulfill:

JPA:
select from Person where 
    name <> "Max" AND 
    favoriteFood IN ("cheeseburger", "pizza", "fried chicken") AND
    age IN (10, 11, 12, 13, 14, 15, 16, 17, 18 19) order by name

is translated into

select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 10 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 10 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 11 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 11 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 12 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 12 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 13 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 13 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 14 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 14 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 15 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 15 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 16 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 16 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 17 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 17 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 18 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 18 order by name
select from Person where name < "Max" AND favoriteFood = "cheeseburger" AND age = 19 order by name
select from Person where name > "Max" AND favoriteFood = "cheeseburger" AND age = 19 order by name
select from Person where name < "Max" AND favoriteFood = "pizza" AND age = 10 order by name
select from Person where name > "Max" AND favoriteFood = "pizza" AND age = 10 order by name
select from Person where name < "Max" AND favoriteFood = "pizza" AND age = 11 order by name
select from Person where name > "Max" AND favoriteFood = "pizza" AND age = 11 order by name

.... and so on.

JDO:
Query q = pm.newQuery("select from Person where " + 
    "name != 'Max' AND " +
    ":p1.contains(favoriteFood) && :p2.contains(age) order by name");
q.execute(Arrays.asList("cheeseburger", "pizza", "fried chicken"),
    Arrays.asList(10, 11, 12, 13, 14, 15, 16, 17, 18, 19));

is translated into

select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 10 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 10 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 11 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 11 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 12 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 12 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 13 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 13 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 14 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 14 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 15 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 15 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 16 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 16 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 17 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 17 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 18 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 18 order by name
select from Person where name < "Max" && favoriteFood == "cheeseburger" AND age == 19 order by name
select from Person where name > "Max" && favoriteFood == "cheeseburger" AND age == 19 order by name
select from Person where name < "Max" && favoriteFood == "pizza" AND age == 10 order by name
select from Person where name > "Max" && favoriteFood == "pizza" AND age == 10 order by name
select from Person where name < "Max" && favoriteFood == "pizza" AND age == 11 order by name
select from Person where name > "Max" && favoriteFood == "pizza" AND age == 11 order by name

.... and so on.

One final note that isn't entirely related but that I'll tack on anyway: You can now use the 'OR' operator in your queries as long as the query can be rewritten using IN. That means this now works:

JPA:
select from Person where 
    favoriteFood = "cheeseburger" OR 
    favoriteFood = "pizza" OR 
    favoriteFood = "fried chicken"

JDO:
select from Person where 
    favoriteFood == "cheeseburger" || 
    favoriteFood == "pizza" || 
    favoriteFood == "fried chicken"

But this still doesn't:

JPA:
select from Person where 
    favoriteFood = "cheeseburger" OR age = 10

JDO:
select from Person where 
    favoriteFood == "cheeseburger" || age == 10

Enjoy the new operators!

20 comments:

  1. I would still love to see also examples using the low-level datastore API, that would really be handy!

    ReplyDelete
  2. The low-level api has far less surface area than JDO and JPA so I would expect the javadoc to be sufficient for most things, but if you have questions about how to accomplish specific tasks, please ask and I'll answer as best I can.

    We have plans to document the low-level api alongside JDO and JPA in the official docs but this will take some time.

    ReplyDelete
  3. Do you think it is a good idea to simulate a capabilities, which are not supported by the underlying engine? It might cause creation of inefficient applications, because developers will be lured into the sense that the capabilities like IN do present and should be used whenever needed.

    ReplyDelete
  4. In general, no, we don't simulate capabilities that are not supported by the underlying engine. That's why we don't support aggregate functions, joins, etc. The rule of thumb is that we only support operations that scale with the size of the result set (as opposed to the size of the data set). The implementation of != and IN that I've described in this post meets that criteria. These queries are more expensive to fulfill by a factor related to the query itself but still tied to the size of the result set. So, in this case, I think the exception is warranted.

    ReplyDelete
  5. A very simple alternative to the latter question is to make available access to the in-memory query evaluator. This would then allow people to get the results they want for the query they supply, and they have specified a query extension to enable it ... hence know that things may be done non-optimally. Their query would then be portable to other JDO/JPA implementations too .... this is actually important to some people. This alternative was raised back in April (2009) as an issue, and is NOT a lot of work to enable.

    ReplyDelete
  6. may i know is there hard limit when using "IN" when doing jpa/jdo query?

    ReplyDelete
  7. i mean, how long the "select in" statement.

    ReplyDelete
  8. JPA and JDO don't impose any limitations, but if the query requires more than 30 underlying queries you'll get an exception. This is explained in more detail in the post.

    ReplyDelete
  9. thanks @Max ross . can you comment on socialwork presentation below ?

    DatastoreService dataSvc =...;

    Query query = new Query("MessageRecipients")
    .addFilter("recipients"), FilterOperator.EQUAL,userid)
    .addSort("date", SortDirection.DESCENDING)
    .setKeysOnly(); // <-- Only fetch keys!


    //*** from the presentation , mentioned the above can only be done
    using low level api, arent jdo/jpa can get the key only value as well
    //***http://gae-java-persistence.blogspot.com/2009/10/keys-only-
    queries.html


    List msgRecpts = dataSvc.prepare(query).asList();
    List parents = new ArrayList();
    for(Entity recep : msgREcpts) {
    parents.add(recep.getParent());



    }


    //Bulk fetch parents using key list
    Map msg = dataSvc.get(parents);

    //***for second part above, what is the different from doing directly
    "select in" statement in jpa/jdo ?
    //***http://gae-java-persistence.blogspot.com/2009/12/queries-with-and-
    in-filters.html

    ReplyDelete
  10. Do these queries run in parallel?

    ReplyDelete
  11. Good question. In the traditional sense, no. != and IN filters are executed in the datastore api implementation, which executes inside your app. Since we don't currently allow threads in your app the sub-queries are initiated in sequence. However, we only need one result from each sub-query to determine the next result to return for the top-level query, and query results aren't streamed back from the datastore until they are needed, so in a sense the queries are being evaluated in parallel, just not in parallel within the jvm where your application code is running.

    ReplyDelete
  12. When building a query using the low-level datastore API, am I limited to 30 underlying queries too?
    I need to count plenty of entites fast in geocells for my map based app.

    ReplyDelete
  13. Doesn't it support not in or not contains??

    ReplyDelete
  14. It is possible to not just test if the query string *contains* the Entity property, but also find out whether the query string contains the Entity property at the *beginning* of the query string?

    E.g.

    Entity Property: "this is just a test"
    Query String: "this is" - SHOULD MATCH
    Query String: "just a test" - SHOULD FAIL

    As it is, if I just use the IN operator, both query strings would match

    ReplyDelete
  15. I'm just replying to my previous. I think I totally misunderstood the function of contains()

    I thought it performed the same function as String.contains()

    But it's actually used to search if a property is contained in a collection of possible results.

    Oops.

    Is there any sort of String.contains() functionality in JDOQL queries?

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. The low level API permits batch gets when retrieving by entity key. They don't mention a limit of 30 anywhere for that.

    Is it possible to fetch more than 30 entities with JPA if you query with an IN clause on the key property?

    For example:
    select from MyEntity where id IN (:keyList)

    and then keyList is a collection with over 30 keys.

    ReplyDelete
  18. Hi Max,
    Nice Blog!!

    I have a couple of questions:
    1. I didn't understood this ':p1.contains(favoriteFood) && :p2.contains(age)'.
    Everything okay but what is ':p1' and ':p2'? Is it a syntax to specify 'parameter 1', 'parameter 2' ?

    2. Can we match substrings? If yes, how? Suppose datastore has values "looping", "loop", "loops" and the search parameter value is "loop". So the result should give all the 3 values. :p.contains returns only the search value(here "loop") is returned.

    Thanks

    ReplyDelete
  19. It's been nearly 4 years you started supporting != and IN *non-natively* through a kludge. Are there any plans to support != and IN natively any time soon?

    The problem is that in their current avatars, they (especially IN) result in rapid increase of costs and additionally, in limitations. Now I need to put checks in the system so that the total number of items in IN are not more than 30. Too problematic. Need a graceful solution here.

    ReplyDelete