—— 作者:李明(email: mli@apache.org)
SQL on Hadoop产品非常繁多,从性能角度来看,比较好的有HAWQ、Impala。虽然目前HAWQ还是大幅领先于Impala(参见:pivotal公布的hawq性能测试),但是从最近的Impala的新版本发布公告来看,它最近一直集中精力实现各种新的优化手段。为了使HAWQ继续保持竞争力,我们有必要学习其他竞争对手的优点,才能做到“知彼知己者,百战不殆”。
HAWQ TO DO LIST
Data skipping for I/O bound query:
- parquet/orc index( min/max/bloom filter) usage
- Runtime filter
- Dynamic partition pruning
- parquet/orc index( min/max/bloom filter) usage
Intra-Operator parallel processing vs vSeg number:
- IO intensive operator: base table scan: thread number = disk rotationar number
- CPU intensive operator: joins/aggregations/sort/top-N, thread number = cpu core number
LLVM Codegen for CPU bound query:
- Function call unrolling and branch pruning: Expression/loop/no switching
- CPU intensive operator: joins/aggregations/sort/top-N
- more hotspots: counting the elements of a complex column, Checking for overflow in DECIMAL multiplication, …
Use HDFS caching feature to “pin” entire tables or individual partitions in memory, to speed up queries on frequently accessed data and reduce the CPU overhead of memory-to-memory copying.
Duplicate small table to all segments to speed up join with other distributed tables.
streaming pre-aggregation: decides at run time whether it is more efficient to do an initial aggregation phase and pass along a smaller set of intermediate data, or to pass raw intermediate data back to next phase of query processing to be aggregated there.
Optimization for small queries let Impala process queries that process very few rows without the unnecessary overhead of parallelizing and generating native code.
more parallel processing: SIMD, vectorization
Distributed by hash is one kind of partition?
better info for different level: explain/SUMMARY/profile