Carbon

For each column stored in carbon, we will record min-max value in meta space called DataChunk.

It is recommanded to set a proper column as first sort column in carbon.
The following columns are not in a sorted state, they are sorted only when value of the former column is same.

If you query by the first sort column, it benefits a lot from min-max index because only a small part of data is needed to scan.
On the opposite, other columns’ min-max index may be bad. If value is spare, the min-max is useless.

Such that, we import the technique Bloom Filter into carbon.

Before we introduce how bloom filter is used in carbon, we need to introduce user-defined index first.

Such kind of index is stored outside from carbon data file. It is additional index, can work with min-max index too.

Currently, Carbondata supports three types of Indexes: BloomFilter Index\Lucene Index\Secondary Index.
We build the index while loading for new data, and also support to rebuild index for existing data.

PR

3804

Why is this PR needed?
Currently carbon uses priority queue to sort holders of sorted rows.
It first polls a holder from the heap, and adds it back if holder is not empty.
This will cause two times heap maintainance. We can reduce half of that operation,
and test shows it can save one third of time to get rows .

What changes were proposed in this PR?
What will be done when poll item from priority queue currently is:

remove first item.
move the last item to the position of first item, siftDown the new first item.
In this patch, we will peek(without removing from heap) the first item and get a row,
and siftDown the holder to a proper position if the holder is not empty.

不出堆，直接下移，复杂度降低

3603

Why is this PR needed?
In local sort, multi-threads is used for each partition but adding rows to a same object with lock. Only after that, sort and write operations run.
For better performance, we want to do the sort and write(sortTemp file) operations in parallel.

What changes were proposed in this PR?
remove object lock when adding rows to (Unsafe)SortDataRows.
keep object lock in (Unsafe)IntermediateMerger to collect results of all threads.

For unsafe local sort, about 40% time is reduced with multi-cores in one case.

Performance related configuration includes: carbon.number.of.cores.while.loading & yarn.nodemanager.local-dirs

提高约1/3，与配置的核心数相关

3649

删除不必要的数据转换
重构代码

Manhua

Never Say Die

Carbondata PR

Carbon

PR

3804

3603

3649