Contents

bloom filter-属于datamap

  • 空间换时间
  • 先介绍minmax,
    • 当数据比较离散时,导致minmax效果不大
    • 当数据比较集中时,
    • 某列不是sort column时
    • 参数配置是个比较困难的问题
      对比

32k 超长字符串

  • carbon设计问题,原本用short存储字符长度,只为兼容,在spark里的表示都是string
  • use an integer instead of short to store the length of bytes content
  • 相似问题:snappy压缩时byte[]长度为int也会受到限制

zstd等压缩配置

SearchMode for basic filter query:适用于在资源紧张时进行简单查询 select from where limit
Currently, tasks could be jammed subjected to available core number if query generates many tasks. For basic filter query(with limited result rows), we could add a new strategy to assign only one task with multiple splits each node and scan all the splits in parallel(get the result immediately and instead of using iterator)
重点修改在CarbonScanRDD的parallelism指定为0(最后noOfTasksPerNode变成1)
[CARBONDATA-2297] Support SEARCH_MODE for basic filter query by kevinjmh · Pull Request #2123 · apache/carbondata (github.com)
[CARBONDATA-2297] Support SEARCH_MODE for basic filter query - ASF JIRA (apache.org)

堆排优化 - 通过反射操作PriorityQueue的siftDown
What will be done when poll item from priority queue currently is:

  1. remove first item.
  2. move the last item to the position of first item, siftDown the new first item.

In this patch, we will peek(without removing from heap) the first item and get a row,
and siftDown the holder to a proper position if the holder is not empty.