Contents
  1. 1. HDFS
  2. 2. YARN
  3. 3. Spark
  4. 4. Zero-Copy
  5. 5. File Format
  6. 6. Compress Codec
  7. 7. Data Lake
  8. 8. DB query
  9. 9. Accelerate

HDFS

  • Short-Circuit Local Reads dfs.client.read.shortcircuit
    • co-located client read data file directly bypass DataNode process

YARN

  • Multi Local Dirs yarn.nodemanager.local-dirs
    • use for multi local disks (JBOD)

Spark

  • Shuffle Service spark.shuffle.manager
  • Code Gen spark.sql.codegen.maxFields spark.sql.codegen.hugeMethodLimit
    • codegen is disabled when too many fields
    • will not take advantage of JIT if compiled bytecode is too large (depending on JVM)
  • Delta Lake (storage level)

Zero-Copy

File Format

  • carbondata
  • parquet
  • orc

Compress Codec

  • snappy
  • lz4
  • zstd

Data Lake

  • delta
  • hudi (aims to streaming)
  • iceberg

DB query

  • hive
  • presto/trino
  • impala
  • greenplum
  • clickhouse
  • doris

Accelerate

  • alluxio
  • ignite