Features
bloom filter-属于datamap
- 空间换时间
- 先介绍minmax,
- 当数据比较离散时,导致minmax效果不大
- 当数据比较集中时,
- 某列不是sort column时
- 参数配置是个比较困难的问题
对比
32k 超长字符串
- carbon设计问题,原本用short存储字符长度,只为兼容,在spark里的表示都是string
- use an integer instead of short to store the length of bytes content
- 相似问题:snappy压缩时
byte[]
长度为int也会受到限制
zstd等压缩配置
SearchMode for basic filter query:适用于在资源紧张时进行简单查询 select from where limit
Currently, tasks could be jammed subjected to available core number if query generates many tasks. For basic filter query(with limited result rows), we could add a new strategy to assign only one task with multiple splits each node and scan all the splits in parallel(get the result immediately and instead of using iterator)
重点修改在CarbonScanRDD的parallelism指定为0(最后noOfTasksPerNode变成1)
[CARBONDATA-2297] Support SEARCH_MODE for basic filter query by kevinjmh · Pull Request #2123 · apache/carbondata (github.com)
[CARBONDATA-2297] Support SEARCH_MODE for basic filter query - ASF JIRA (apache.org)
堆排优化 - 通过反射操作PriorityQueue的siftDown
What will be done when poll item from priority queue currently is:
- remove first item.
- move the last item to the position of first item, siftDown the new first item.
In this patch, we will peek(without removing from heap) the first item and get a row,
and siftDown the holder to a proper position if the holder is not empty.