强者更强：HAWQ性能的思考

Apr 20, 2016 1 分钟阅读时间

—— 作者：李明(email: mli@apache.org)

SQL on Hadoop产品非常繁多，从性能角度来看，比较好的有HAWQ、Impala。虽然目前HAWQ还是大幅领先于Impala（参见：pivotal公布的hawq性能测试)，但是从最近的Impala的新版本发布公告来看，它最近一直集中精力实现各种新的优化手段。为了使HAWQ继续保持竞争力，我们有必要学习其他竞争对手的优点，才能做到“知彼知己者，百战不殆”。

本文的观点仅代表本人观点，用于学习探讨交流。若因此出现经济、法律等方面的，与本人无关。

HAWQ TO DO LIST

Data skipping for I/O bound query:
- parquet/orc index( min/max/bloom filter) usage
- Runtime filter
- Dynamic partition pruning
Intra-Operator parallel processing vs vSeg number:
- IO intensive operator: base table scan: thread number = disk rotationar number
- CPU intensive operator: joins/aggregations/sort/top-N, thread number = cpu core number
LLVM Codegen for CPU bound query:
- Function call unrolling and branch pruning: Expression/loop/no switching
- CPU intensive operator: joins/aggregations/sort/top-N
- more hotspots: counting the elements of a complex column, Checking for overflow in DECIMAL multiplication, …
Use HDFS caching feature to “pin” entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying.
Duplicate small table to all segments to speed up join with other distributed tables.
streaming pre-aggregation: decides at run time whether it is more efficient to do an initial aggregation phase and pass along a smaller set of intermediate data, or to pass raw intermediate data back to next phase of query processing to be aggregated there.
Optimization for small queries let Impala process queries that process very few rows without the unnecessary overhead of parallelizing and generating native code.
more parallel processing: SIMD, vectorization
Distributed by hash is one kind of partition?
better info for different level: explain/SUMMARY/profile

hawq impala

李明

数据库内核研发工程师

主要从事数据库内核研发、数据库测试、数据库管理工具研发等相关工作。特别对PostgreSQL及其相关的项目有相当的兴趣。目前主要在做HAWQ和GPDB等并行数据库的内核研发工作。