When is a SQL Subquery 260x Faster than a Left Join?

2017-08-17, by Andy Ellicott

Our friends at Gestalten.de, a design and software agency in Germany, pinged us recently on Github for some CrateDB query performance help.

In this blog post we'll go over the issue and take a look at how it was resolved using the new sub-query capability in CrateDB as an alternative to using a left join.

Introduction

Gestalten.de is currently developing an ecommerce webshop solution integrated as an extension to the TYPO3 CMS.

They use CrateDB as a database for storing and searching product data. They chose CrateDB because it allows them to scale the webshop easily, according to Gestalten.de CEO Frank Rakow.

They also use CrateDB to store cross-selling information. This information is used to inform users about "people who bought this product also bought X."

For each order, a collection of product pairs are stored in the cross-selling table. So, for example, if there are five products in the cart at checkout, then each product will individually pair with four products, resulting in 20 pairs of products stored in the cross-selling table.

Their CrateDB database contains two tables:

  • product
    • 3,300 rows
    • Multiple columns
    • Primary key is SKU (i.e. product ID)
  • xsell
    • 4.6 million rows
    • Three columns: SKU, cross_SKU, tstamp

Query Using Left Join

The cross-selling (Xsell) table is cleaned from time to time by removing pairs that include discontinued products.

To find those pairs, they tried using a LEFT JOIN query to get a list of all the products in the cross-sell table, which are no longer found in the product table:

SELECT x.sku FROM xsell x LEFT JOIN product p ON p.sku = x.sku 
WHERE p.sku IS NULL 
GROUP BY x.sku;

The query took 35 minutes to execute.

Analysis of the query by Crate.io found nothing wrong with the execution plan of the query.

The problem arises from having to process over 15 billion rows (cartesian product), filtering out the non-matching rows (IS NULL), and then applying the GROUP BY, which is basically a DISTINCT.

Solution: Use a Subquery

To speed this up, we suggested they rewrite the query to use a new feature of CrateDB: subquery expressions. In this case, the IN (SUBQUERY) expression.

This query optimization approach works well when one of the two joined tables has a relatively small number of rows.

Here's the new query:

SELECT x.sku FROM xsell x 
WHERE NOT x.sku IN (SELECT sku FROM product) 
GROUP BY x.sku;

This executed in 8 seconds. That's over 260x faster than before.

Note: at the time of writing this feature is in development, but will be available and fully supported in the next minor release of CrateDB.

Wrap Up

We’d love to hear from our community members. If you need help or have tips you want to share with other CrateDB users, please get in touch. Come talk to us on Github, StackOverflow, or Slack any time.

  • Follow us