Not logged in : Login

About: http://www.openlinksw.com:443/weblog/oerling/?id=1860     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : rss:item, within Data Space : ods-qa.openlinksw.com:8896 associated with source document(s)

AttributesValues
type
described by
Creator
  • Orri Erling <oerling@openlinksw.com>
Date
  • 2015-07-13T17:46:03Z
Description
  • This article discusses the relationship between vectored execution and column- and row-wise data representations. Column stores are traditionally considered to be good for big scans but poor at indexed access. This is not necessarily so, though. We take TPC-H Q9 as a starting point, working with different row- and column-wise data representations and index choices. The goal of the article is to provide a primer on the performance implications of different physical designs. All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads. We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee. So, in these experiments, we store the relevant data four times over, as follows: 100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey and on l_partkey, l_suppkey The same in row-wise data representation Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name. These all have the original tables primary key, e.g., l_orderkey, l_linenumber for the l_ prefixed tables The same with row-wise tables The column-wise structures are in the DB qualifier, and the row-wise are in the R qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win. As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem with a selective hash join followed ordered index access of orders, then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side. The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql makes the tables and indices, rsload.sql copies the data from the TPC-H tables. The business question is to calculate the profit from sale of selected parts grouped by year and country of the supplier. This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row. SELECT n_name AS nation, EXTRACT(year FROM o_orderdate) AS o_year, SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) AS sum_profit FROM lineitem, part, partsupp, orders, supplier, nation WHERE s_suppkey = l_suppkey AND ps_suppkey = l_suppkey AND ps_partkey = l_partkey AND p_partkey = l_partkey AND o_orderkey = l_orderkey AND s_nationkey = n_nationkey AND p_name LIKE '%green%' GROUP BY nation, o_year ORDER BY nation, o_year DESC Query Variants The query variants discussed here are: Hash based, the best plan -- 9h.sql Index based with multicolumn rows, with lineitem index on l_partkey -- 9i.sql, 9ir.sql Index based with multicolumn rows, lineitem index on l_partkey, l_suppkey -- 9ip.sql, 9ipr.sql Index based with one table per dependent column, index on l_partkey -- 9p.sql index based with one table per dependent column, with materialized l_partkey, l_suppkey -> l_orderkey, l_minenumber -- 9pp.sql, 9ppr.sql These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern. The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine. We note that lineitem and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems by a secondary index on l_partkey. 1 — Hash-based plan Vector Dynamic 10k 1 Column-wise 4.1 s 4.1 s 145   s Row-wise 25.6 s 25.9 s 45.4 s Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism. Without vectorization, we have a situation where the lineitem scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead. 2 — Index-based, lineitem indexed on l_partkey Vector Dynamic 10k 1 Column-wise 30.4 s 62.3 s 321   s Row-wise 31.8 s 27.7 s 122   s Here the plan scans part, then partsupp, which shares ordering with part; both are ordered on partkey. Then lineitem is fetched by a secondary index on l_partkey. This produces l_orderkey, l_lineitem, which are used to get the l_suppkey. We then check if the l_suppkey matches the ps_suppkey from partsupp, which drops 3/4 of the rows. The next join is on orders, which shares ordering with lineitem; both are ordered on orderkey. There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser. 3 — Index-based, lineitem indexed on L_partkey, l_suppkey Vector Dynamic 10k 1 Column-wise 16.9 s 47.2 s 151   s Row-wise 22.4 s 20.7 s 89   s This is similar to the previous, except that now only lineitems that match ps_partkey, ps_suppkey are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size. 4 — Decomposed, index on l_partkey Vector Dynamic 10k 1 Column-wise 35.7 s 170   s 601   s Row-wise 44.5 s 56.2 s 130   s Now, each of the l_extendedprice, l_discount, l_quantity and l_suppkey is a separate index lookup. The times are slightly higher but the dynamic is the same. The non-vectored columns case is hit the hardest. 5 — Decomposed, index on l_partkey, l_suppkey Vector Dynamic 10k 1 Column-wise 19.6 s 111   s 257   s Row-wise 32.0 s 37   s 74.9 s Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality. Space Utilization The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted. The row-wise table also contains entries for column-wise structures (DB.*) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures. Row-Wise    Column-Wise MB structure 73515 R.DBA.LINEITEM 14768 R.DBA.ORDERS 11728 R.DBA.PARTSUPP 10161 r_lpk_pk 10003 r_l_pksk 9908 R.DBA.l_partkey 8761 R.DBA.l_extendedprice 8745 R.DBA.l_discount 8738 r_l_pk 8713 R.DBA.l_suppkey 6267 R.DBA.l_quantity 2223 R.DBA.CUSTOMER 2180 R.DBA.o_orderdate 2041 r_O_CK 1911 R.DBA.PART 1281 R.DBA.ps_supplycost 811 R.DBA.p_name 127 R.DBA.SUPPLIER 88 DB.DBA.LINEITEM 24 DB.DBA.ORDERS 11 DB.DBA.PARTSUPP 9 R.DBA.s_nationkey 5 l_pksk 4 DB.DBA.l_partkey 4 lpk_pk 4 DB.DBA.l_extendedprice 3 l_pk 3 DB.DBA.l_suppkey 2 DB.DBA.CUSTOMER 2 DB.DBA.l_quantity 1 DB.DBA.PART 1 O_CK 1 DB.DBA.l_discount    MB structure 36482 DB.DBA.LINEITEM 13087 DB.DBA.ORDERS 11587 DB.DBA.PARTSUPP 5181 DB.DBA.l_extendedprice 4431 l_pksk 3072 DB.DBA.l_partkey 2958 lpk_pk 2918 l_pk 2835 DB.DBA.l_suppkey 2067 DB.DBA.CUSTOMER 1618 DB.DBA.PART 1156 DB.DBA.l_quantity 961 DB.DBA.ps_supplycost 814 O_CK 798 DB.DBA.l_discount 724 DB.DBA.p_name 436 DB.DBA.o_orderdate 126 DB.DBA.SUPPLIER 1 DB.DBA.s_nationkey In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression. We note that the single column projections are smaller column-wise. The l_extendedprice is not very compressible hence column-wise takes much more space than l_quantity; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber are ordered and very compressible, the column-wise structures are in all cases noticeably more compact. The same applies to the multipart index l_pksk and r_l_pksk (l_partkey, l_suppkey, l_orderkey, l_linenumber) in column- and row-wise representations. Note that STRING columns (e.g., l_comment) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store. Conclusions Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality. Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality. For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch. For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access. For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5. For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.
wfw:commentRss
wfw:comment
content:encoded
  • This article discusses the relationship between vectored execution and column- and row-wise data representations. Column stores are traditionally considered to be good for big scans but poor at indexed access. This is not necessarily so, though. We take TPC-H Q9 as a starting point, working with different row- and column-wise data representations and index choices. The goal of the article is to provide a primer on the performance implications of different physical designs.

    All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads.

    We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee.

    So, in these experiments, we store the relevant data four times over, as follows:

    • 100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey and on l_partkey, l_suppkey

    • The same in row-wise data representation

    • Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name. These all have the original tables primary key, e.g., l_orderkey, l_linenumber for the l_ prefixed tables

    • The same with row-wise tables

    The column-wise structures are in the DB qualifier, and the row-wise are in the R qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win.

    As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem with a selective hash join followed ordered index access of orders, then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side.

    The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql makes the tables and indices, rsload.sql copies the data from the TPC-H tables.

    The business question is to calculate the profit from sale of selected parts grouped by year and country of the supplier. This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row.

    SELECT
                                                                             n_name  AS  nation, 
                                                     EXTRACT(year FROM o_orderdate)  AS  o_year,
              SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity)  AS  sum_profit
        FROM  lineitem, part, partsupp, orders, supplier, nation
       WHERE    s_suppkey = l_suppkey
         AND   ps_suppkey = l_suppkey
         AND   ps_partkey = l_partkey
         AND    p_partkey = l_partkey
         AND   o_orderkey = l_orderkey
         AND  s_nationkey = n_nationkey
         AND  p_name LIKE '%green%'
    GROUP BY  nation, o_year
    ORDER BY  nation, o_year DESC
    

    Query Variants

    The query variants discussed here are:

    1. Hash based, the best plan -- 9h.sql

    2. Index based with multicolumn rows, with lineitem index on l_partkey -- 9i.sql, 9ir.sql

    3. Index based with multicolumn rows, lineitem index on l_partkey, l_suppkey -- 9ip.sql, 9ipr.sql

    4. Index based with one table per dependent column, index on l_partkey -- 9p.sql

    5. index based with one table per dependent column, with materialized l_partkey, l_suppkey -> l_orderkey, l_minenumber -- 9pp.sql, 9ppr.sql

    These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern.

    The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine.

    We note that lineitem and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems by a secondary index on l_partkey.

    1 — Hash-based plan

    Vector Dynamic 10k 1
    Column-wise 4.1 s 4.1 s 145   s
    Row-wise 25.6 s 25.9 s 45.4 s

    Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism.

    Without vectorization, we have a situation where the lineitem scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead.

    2 — Index-based, lineitem indexed on l_partkey

    Vector Dynamic 10k 1
    Column-wise 30.4 s 62.3 s 321   s
    Row-wise 31.8 s 27.7 s 122   s

    Here the plan scans part, then partsupp, which shares ordering with part; both are ordered on partkey. Then lineitem is fetched by a secondary index on l_partkey. This produces l_orderkey, l_lineitem, which are used to get the l_suppkey. We then check if the l_suppkey matches the ps_suppkey from partsupp, which drops 3/4 of the rows. The next join is on orders, which shares ordering with lineitem; both are ordered on orderkey.

    There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser.

    3 — Index-based, lineitem indexed on L_partkey, l_suppkey

    Vector Dynamic 10k 1
    Column-wise 16.9 s 47.2 s 151   s
    Row-wise 22.4 s 20.7 s 89   s

    This is similar to the previous, except that now only lineitems that match ps_partkey, ps_suppkey are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size.

    4 — Decomposed, index on l_partkey

    Vector Dynamic 10k 1
    Column-wise 35.7 s 170   s 601   s
    Row-wise 44.5 s 56.2 s 130   s

    Now, each of the l_extendedprice, l_discount, l_quantity and l_suppkey is a separate index lookup. The times are slightly higher but the dynamic is the same.

    The non-vectored columns case is hit the hardest.

    5 — Decomposed, index on l_partkey, l_suppkey

    Vector Dynamic 10k 1
    Column-wise 19.6 s 111   s 257   s
    Row-wise 32.0 s 37   s 74.9 s

    Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality.

    Space Utilization

    The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted.

    The row-wise table also contains entries for column-wise structures (DB.*) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures.

    Row-Wise    Column-Wise
    MB structure
    73515 R.DBA.LINEITEM
    14768 R.DBA.ORDERS
    11728 R.DBA.PARTSUPP
    10161 r_lpk_pk
    10003 r_l_pksk
    9908 R.DBA.l_partkey
    8761 R.DBA.l_extendedprice
    8745 R.DBA.l_discount
    8738 r_l_pk
    8713 R.DBA.l_suppkey
    6267 R.DBA.l_quantity
    2223 R.DBA.CUSTOMER
    2180 R.DBA.o_orderdate
    2041 r_O_CK
    1911 R.DBA.PART
    1281 R.DBA.ps_supplycost
    811 R.DBA.p_name
    127 R.DBA.SUPPLIER
    88 DB.DBA.LINEITEM
    24 DB.DBA.ORDERS
    11 DB.DBA.PARTSUPP
    9 R.DBA.s_nationkey
    5 l_pksk
    4 DB.DBA.l_partkey
    4 lpk_pk
    4 DB.DBA.l_extendedprice
    3 l_pk
    3 DB.DBA.l_suppkey
    2 DB.DBA.CUSTOMER
    2 DB.DBA.l_quantity
    1 DB.DBA.PART
    1 O_CK
    1 DB.DBA.l_discount
      
    MB structure
    36482 DB.DBA.LINEITEM
    13087 DB.DBA.ORDERS
    11587 DB.DBA.PARTSUPP
    5181 DB.DBA.l_extendedprice
    4431 l_pksk
    3072 DB.DBA.l_partkey
    2958 lpk_pk
    2918 l_pk
    2835 DB.DBA.l_suppkey
    2067 DB.DBA.CUSTOMER
    1618 DB.DBA.PART
    1156 DB.DBA.l_quantity
    961 DB.DBA.ps_supplycost
    814 O_CK
    798 DB.DBA.l_discount
    724 DB.DBA.p_name
    436 DB.DBA.o_orderdate
    126 DB.DBA.SUPPLIER
    1 DB.DBA.s_nationkey

    In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression.

    We note that the single column projections are smaller column-wise. The l_extendedprice is not very compressible hence column-wise takes much more space than l_quantity; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber are ordered and very compressible, the column-wise structures are in all cases noticeably more compact.

    The same applies to the multipart index l_pksk and r_l_pksk (l_partkey, l_suppkey, l_orderkey, l_linenumber) in column- and row-wise representations.

    Note that STRING columns (e.g., l_comment) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store.

    Conclusions

    Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality.

    Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality.

    For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch.

    For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access.

    For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5.

    For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.

rss:title
  • Vectored Execution in Column/Row Stores
rss:link
is rdf:_8 of
Faceted Search & Find service v1.17_git55 as of Mar 01 2021


Alternative Linked Data Documents: ODE     Content Formats:       RDF       ODATA       Microdata      About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3322 as of Mar 14 2022, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (7 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software