OptTechs
Contents
HDFS
- Short-Circuit Local Reads
dfs.client.read.shortcircuit
- co-located client read data file directly bypass DataNode process
YARN
- Multi Local Dirs
yarn.nodemanager.local-dirs
- use for multi local disks (JBOD)
Spark
- Shuffle Service
spark.shuffle.manager
- Code Gen
spark.sql.codegen.maxFields
spark.sql.codegen.hugeMethodLimit
- codegen is disabled when too many fields
- will not take advantage of JIT if compiled bytecode is too large (depending on JVM)
- Delta Lake (storage level)
Zero-Copy
- RDMA (hardware required)
- DPDK (Multi-NICs, NUMA required)
File Format
- carbondata
- parquet
- orc
Compress Codec
- snappy
- lz4
- zstd
Data Lake
- delta
- hudi (aims to streaming)
- iceberg
DB query
- hive
- presto/trino
- impala
- greenplum
- clickhouse
- doris
Accelerate
- alluxio
- ignite