I’m not a Postgres expert, but I’ve been using it for about 6 months as an Event Store database. At Bench we’ve built our own eventing system on top of ActiveMQ, Camel and Akka and we use Postgres to persist every single Domain Event.
Our event schema is very flexible and currently represented in JSON. We chose Postgres for persistence, because of the great JSON support. As you probably know, Postgres 9.4 introduced JSONB type, which is an advanced JSON type that supports indexing. Obviously, you should index all your key fields, but it can be tricky. Let me share what we’ve discovered.
Problem
So, we have a query like this:
select “id”, “created_at”, “version”, “name”, “context”, “assets” from “event” where (“assets” @> ‘[{“resourceId”: “569ee61ee4b0e7dd960dcee3”}]’) or (“assets” @> ‘[{“resourceId”: “874874”}]’ ) or (“assets” @> ‘[{“resourceId”: “875187”}]’ ) or (“assets” @> ‘[{“resourceId”: “858164”}]’ ) or (“assets” @> ‘[{“resourceId”: “858567”}]’) order by “created_at” desc limit 1000 offset 0
assets
field is a JSON array that contains different entities (objects). Some of them have resourceId
field. In this query we want to find all events that contain assets with specified resourceId
s. We only pass 5 resourceId
s here, but in practice we can have many more.
Of course we have an index on this field:
CREATE INDEX event_idx_assets ON event USING gin (assets jsonb_path_ops)
But… when you run the query, it can be REALLY slow. Let’s run analyze
:
explain analyze select “id”, “created_at”, “version”, “name”, “context”, “assets” from “event” where (“assets” @> ‘[{“resourceId”: “569ee61ee4b0e7dd960dcee3”}]’) or (“assets” @> ‘[{“resourceId”: “874874”}]’ ) or (“assets” @> ‘[{“resourceId”: “875187”}]’ ) or (“assets” @> ‘[{“resourceId”: “858164”}]’ ) or (“assets” @> ‘[{“resourceId”: “858567”}]’) order by “created_at” desc limit 1000 offset 0
Results:
1 2 3 4 5 6 |
|
Wait… It doesn’t use our index! And it’s very slow because of that. But why? If you read documentation and StackOverflow discussions about indexing, you’ll see that it should work. There is no reason why it shouldn’t…
We’ve spent significant amount of time trying to understand why this index doesn’t work. After some we did an experiment – what if you run the query with only one resourceId
?
explain analyze select “id”, “created_at”, “version”, “name”, “context”, “assets” from “event” where (“assets” @> ‘[{“resourceId”: “569ee61ee4b0e7dd960dcee3”}]’) order by “created_at” desc limit 1000 offset 0
1 2 3 4 5 6 7 8 9 10 11 |
|
It works! How about 3?
explain analyze select “id”, “created_at”, “version”, “name”, “context”, “assets” from “event” where (“assets” @> ‘[{“resourceId”: “569ee61ee4b0e7dd960dcee3”}]’) or (“assets” @> ‘[{“resourceId”: “874874”}]’ ) or (“assets” @> ‘[{“resourceId”: “875187”}]’ ) order by “created_at” desc limit 1000 offset 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Success! It seems like somewhere internally Postgres decides to optimize this query differently depending on the number of conditions you pass for the GIN indexed field. In our case index only worked with 1, 2 and 3 conditions.
Solution
Well, you can’t really fix the indexing, but at least you know how to use it :) We ended up chunking one big query with lots of conditions into multiple queries containing only 3 conditions and merging results together. Unexpectedly, it’s much faster!