[imc-uk-tech] Batch producers, Postgres slowness, archives and viewallposts

Zak zak at riseup.net
Mon Mar 6 15:28:53 PST 2006


Sorry for the long email...

----
Summary: <Batch> producers are very slow when the total number of
batches is large, executing queries that are particularly slow in
Postgres (surprisingly so if you're used to MySQL). This particularly
affects the generation of the archive and "viewallposts" pages on IMC
UK. The latter is of particular concern as it is regenerated quite
frequently as material is posted to the newswire.
----


I've just been looking at the Postgres logs from traven, picking out the
slowest queries with a bit of Perl.

Apart from the COPY, VACUUM and ANALYZE statements from routing backup
and maintenance, these are almost all of the form:

897524.041 select count(*) from content where is_published=true and
to_article_type in (6, 1, 22)

or

802216.337 SELECT id, title, subtitle, edittitle, date, creator,
creator_main_url, creator_email, creator_address, creator_phone,
description, comment, source, is_published, is_produced, to_publisher,
to_language, to_rights, webdb_create, webdb_lastchange, content_data,
is_html, to_article_type, to_content, keywords, to_locking_user FROM
content WHERE is_published=true and to_article_type in (6, 1, 22) ORDER
BY webdb_create asc LIMIT 14 OFFSET 55920

where the initial figures are times in milliseconds. That means these
queries are taking well over 10 *MINUTES* each. Correlating the times
with the producer logs shows that these result from the viewallposts.new
producer. This is a "Batch" producer limited to 3 batches, which sounds
like it should be reasonably quick.

However, the algorithm employed (in
source/mir/producer/EntityBatchingProducerNode.java) involves two
operations that are particularly slow in Postgres.

(1) It begins by counting *all* the entities, not just those in the
batches being generated, in order to perform a remainder calculation to
determine the size of the most recent batch.

If you're used to MySQL, you'll expect "SELECT COUNT(*)" to be fast so
long as there is a suitable index. However, this is not the case in
Postgres -- if you're counting a large proportion of the rows in the
table, it will do a *FULL TABLE SCAN* with no help from any indices. See
the warning at
http://www.postgresql.org/docs/7.4/interactive/functions-aggregate.html


EXPLAIN SELECT COUNT(*) FROM content
                        WHERE is_published=true
                          AND to_article_type IN (6, 1, 22);

                             QUERY PLAN
------------------------------------------------------------------------
 Aggregate  (cost=461992.69..461992.69 rows=1 width=0)
   ->  Seq Scan on content  (cost=0.00..461098.44 rows=357700 width=0)
         Filter: ((is_published = true) AND
                  ((to_article_type = 6) OR
                   (to_article_type = 1) OR
                   (to_article_type = 22)))


(2) It then selects the individual batches using an ORDER BY which puts
the oldest entities first and a large OFFSET value to get the most
recent ones for the batches actually being produced.

Again, this works quite efficiently in MySQL but is terribly slow in
Postgres -- as far as I can tell it's like running "tail" in that the
skipped data still has to be scanned in order to find the records at the
end which you actually want.

It is *much* more efficient to restrict by the ordering columns in the
WHERE clause, if the boundary values for each batch can be calculated,
than to use OFFSET in this way.

Simply reversing the ORDER clause would improve the performance when
generating recent batches, but make generating old pages very slow (or
mid-range pages if we're clever and switch the ORDER for the two ends).


I don't have immediate solutions for either of these problems though.
This is also coupled with the issue that the archive pages are typically
very hard to navigate ("page 1411 of 2087" for example).



Zak.



More information about the imc-uk-tech mailing list