Not logged in : Login

About: http://www.openlinksw.com:443/weblog/oerling/?id=1856     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : rss:item, within Data Space : ods-qa.openlinksw.com:8896 associated with source document(s)

AttributesValues
type
described by
Creator
  • Orri Erling <oerling@openlinksw.com>
Date
  • 2015-07-13T16:16:31Z
Description
  • In this series, we will look at Virtuoso and some of the big data technologies out there. SQL on Hadoop is of interest, as well as NoSQL technologies. We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop. The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set. The Virtuoso version and settings are as in the Virtuoso Cluster test AMI. The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads. Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey); in Hive, there are no indices. Query Virtuoso Hive Notes — 332     s 742     s Data Load Q1 1.098 s 296.636 s Q2 0.187 s >3600     s Hive Timeout Q3 0.761 s 98.652 s Q4 0.205 s 147.867 s Q5 0.808 s 114.782 s Q6 2.403 s 71.789 s Q7 0.59  s 394.201 s Q8 0.775 s >3600     s Hive Timeout Q9 1.836 s >3600     s Hive Timeout Q10 3.165 s 179.646 s Q11 1.37  s 43.094 s Q12 0.356 s 101.193 s Q13 2.233 s 208.476 s Q14 0.488 s 89.047 s Q15 0.72 s 136.431 s Q16 0.814 s 105.652 s Q17 0.681 s 255.848 s Q18 1.324 s 337.921 s Q19 0.417 s >3600     s Hive Timeout Q20 0.792 s 193.965 s Q21 0.720 s 670.718 s Q22 0.155 s 68.462 s Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do. Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming. The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join. One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this. Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads. If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.
wfw:commentRss
wfw:comment
content:encoded
  • In this series, we will look at Virtuoso and some of the big data technologies out there. SQL on Hadoop is of interest, as well as NoSQL technologies.

    We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop.

    The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set.

    The Virtuoso version and settings are as in the Virtuoso Cluster test AMI.

    The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads.

    Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey); in Hive, there are no indices.

    Query Virtuoso Hive Notes
    332     s 742     s Data Load
    Q1 1.098 s 296.636 s
    Q2 0.187 s >3600     s Hive Timeout
    Q3 0.761 s 98.652 s
    Q4 0.205 s 147.867 s
    Q5 0.808 s 114.782 s
    Q6 2.403 s 71.789 s
    Q7 0.59  s 394.201 s
    Q8 0.775 s >3600     s Hive Timeout
    Q9 1.836 s >3600     s Hive Timeout
    Q10 3.165 s 179.646 s
    Q11 1.37  s 43.094 s
    Q12 0.356 s 101.193 s
    Q13 2.233 s 208.476 s
    Q14 0.488 s 89.047 s
    Q15 0.72 s 136.431 s
    Q16 0.814 s 105.652 s
    Q17 0.681 s 255.848 s
    Q18 1.324 s 337.921 s
    Q19 0.417 s >3600     s Hive Timeout
    Q20 0.792 s 193.965 s
    Q21 0.720 s 670.718 s
    Q22 0.155 s 68.462 s

    Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do.

    Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming.

    The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join.

    One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this.

    Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads.

    If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.

rss:title
  • Big Data, Part 1: Virtuoso Meets Hive
rss:link
is rdf:_12 of
Faceted Search & Find service v1.17_git55 as of Mar 01 2021


Alternative Linked Data Documents: ODE     Content Formats:       RDF       ODATA       Microdata      About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 08.03.3322 as of Mar 14 2022, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (7 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software