About: http://www.openlinksw.com:443/weblog/oerling/?id=1856

Not logged in : Login

Facets (new session)
Description
Metadata
Settings
- Rule:
- Inverse Functional Properties:
- "Same As":

About: http://www.openlinksw.com:443/weblog/oerling/?id=1856 Goto Sponge NotDistinct Permalink

An Entity of Type : rss:item, within Data Space : ods-qa.openlinksw.com:8896 associated with source document(s)

Attributes Values

type

rss:item

described by

http://www.openlinksw.com/blogs/

Creator

Orri Erling <oerling@openlinksw.com>

Date

2015-07-13T16:16:31Z

Description

In this series, we will look at Virtuoso and some of the big data technologies out there. SQL on Hadoop is of interest, as well as NoSQL technologies. We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop. The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set. The Virtuoso version and settings are as in the Virtuoso Cluster test AMI. The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads. Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey); in Hive, there are no indices. Query Virtuoso Hive Notes — 332 s 742 s Data Load Q1 1.098 s 296.636 s Q2 0.187 s >3600 s Hive Timeout Q3 0.761 s 98.652 s Q4 0.205 s 147.867 s Q5 0.808 s 114.782 s Q6 2.403 s 71.789 s Q7 0.59 s 394.201 s Q8 0.775 s >3600 s Hive Timeout Q9 1.836 s >3600 s Hive Timeout Q10 3.165 s 179.646 s Q11 1.37 s 43.094 s Q12 0.356 s 101.193 s Q13 2.233 s 208.476 s Q14 0.488 s 89.047 s Q15 0.72 s 136.431 s Q16 0.814 s 105.652 s Q17 0.681 s 255.848 s Q18 1.324 s 337.921 s Q19 0.417 s >3600 s Hive Timeout Q20 0.792 s 193.965 s Q21 0.720 s 670.718 s Q22 0.155 s 68.462 s Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do. Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming. The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join. One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this. Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads. If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.

wfw:commentRss

http://www.openlinksw.com:443/weblog/oerling/gems/rsscomment.xml?:id=1856

wfw:comment

http://www.openlinksw.com:443/mt-tb/Http/comments?id=1856

content:encoded

In this series, we will look at Virtuoso and some of the big data technologies out there. SQL on Hadoop is of interest, as well as NoSQL technologies.

We begin at the beginning, with Hive, the grand-daddy of SQL on Hadoop.

The test platform is two Amazon R3.8 AMI instances. We compared Hive with the Virtuoso 100G TPC-H experiment on the same platform, published earlier on this blog. The runs follow a bulk load in both cases, with all data served from memory. The platform has 2x244GB RAM with only 40GB or so of working set.

The Virtuoso version and settings are as in the Virtuoso Cluster test AMI.

The Hive version is 0.14 from the Hortonworks HDP 2.2 distribution>. The Hive schema and query formulations are the ones from hive-testbench on GitHub. The Hive configuration parameters are as set by Ambari 2.0.1. These are different from the ones in hive-testbench, but the Ambari choices offer higher performance on the platform. We did run statistics with Hive and did not specify any settings not in the hive-testbench. Thus we suppose the query plans were as good as Hive will make them. Platform utilization was even across both machines, and varied between 30% and 100% of the 2 x 32 hardware threads.

Load time with Hive was 742 seconds against 232 seconds with Virtuoso. In both cases, this was a copy from 32 CSV files into native database format; for Hive, this is ORC (Optimized Row Columnar). In Virtuoso, there is one index, (o_custkey); in Hive, there are no indices.

Query	Virtuoso	Hive	Notes
—	`332 s`	`742 s`	Data Load
Q1	`1.098 s`	`296.636 s`
Q2	`0.187 s`	`>3600 s`	Hive Timeout
Q3	`0.761 s`	`98.652 s`
Q4	`0.205 s`	`147.867 s`
Q5	`0.808 s`	`114.782 s`
Q6	`2.403 s`	`71.789 s`
Q7	`0.59 s`	`394.201 s`
Q8	`0.775 s`	`>3600 s`	Hive Timeout
Q9	`1.836 s`	`>3600 s`	Hive Timeout
Q10	`3.165 s`	`179.646 s`
Q11	`1.37 s`	`43.094 s`
Q12	`0.356 s`	`101.193 s`
Q13	`2.233 s`	`208.476 s`
Q14	`0.488 s`	`89.047 s`
Q15	`0.72 s`	`136.431 s`
Q16	`0.814 s`	`105.652 s`
Q17	`0.681 s`	`255.848 s`
Q18	`1.324 s`	`337.921 s`
Q19	`0.417 s`	`>3600 s`	Hive Timeout
Q20	`0.792 s`	`193.965 s`
Q21	`0.720 s`	`670.718 s`
Q22	`0.155 s`	`68.462 s`

Hive does relatively best on bulk load. This is understandable since this is a sequential read of many files in parallel with just compression to do.

Hive's query times are obviously affected by not having a persistent memory image of the data, as this is always streamed from the storage files into other files as MapReduce intermediate results. This seems to be an operator-at-a-time business as opposed to Virtuoso's vectorized streaming.

The queries that would do partitioned hash joins (e.g., Q9) did not finish under an hour in Hive, so we do not have a good metric of a cross-partition hash join.

One could argue that one should benchmark Hive only in disk-bound circumstances. We may yet get to this.

Our next stop will probably be Impala, which ought to do much better than Hive, as it dose not have the MapReduce overheads.

If you are a Hive expert and believe that Hive should have done much better, please let us know how to improve the Hive scores, and we will retry.

rss:title

Big Data, Part 1: Virtuoso Meets Hive

rss:link

http://www.openlinksw.com:443/weblog/oerling/?id=1856

is rdf:_12 of

nodeID://b1521830

Faceted Search & Find service v1.17_git55 as of Mar 01 2021

Alternative Linked Data Documents: ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 08.03.3322 as of Mar 14 2022, on Linux (x86_64-generic-linux-glibc25), Single-Server Edition (7 GB total memory)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software