Contents
  1. 1. What is carbon
    1. 1.1. Reference Site
  2. 2. Run Carbondata on your computer
    1. 2.1. 1. Get Carbondata Release
    2. 2.2. 2. Prepare environment
    3. 2.3. 3. Run Carbon
    4. 2.4. Use Carbon to manage your data
  3. 3. Build Carbondata yourself
    1. 3.1. Prepare develop environment
      1. 3.1.1. Maven
      2. 3.1.2. Git
    2. 3.2. Build
    3. 3.3. Test
  4. 4. Contribute to community
    1. 4.1. 1. Discuss in maillist
    2. 4.2. 2. Contribute codes

What is carbon

Carbondata is an indexed columnar data format for fast analytics on big data platform, like parquet, orc.

Reference Site

Run Carbondata on your computer

1. Get Carbondata Release

You can get released Carbondata jar from Github or Apache achieve.

As for each module, you can find it in maven repository

2. Prepare environment

Recommand:

  • Linux OS
  • JDK 7
  • Hadoop 2.7.2
  • Spark 2.3

Please ensure that HDFS, MapReduce and Spark is working normally. You can check it by running example job, such as Word Count for MapReduce and PI calculation for Spark.

Also, you need to prepare carbon.properties file as https://github.com/apache/carbondata/tree/master/conf/carbon.properties.template

You can config carbon.properties.filepath in spark’s spark-defaults.conf file, or set extraJavaOptions when submit application.
Please ensure carbon.properties file can be distributed to each executor, else parameters will not take effect.

3. Run Carbon

We can submit the assembly jar file as a spark application to create CarbonThriftServer, and then use beeline to connect.

Template

1
2
3
4
5
6
7
8
./bin/spark-submit \
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<CARBON_ASSEMBLY_JAR_PATH> \
<CARBON_STORE_PATH>

Example

1
2
3
4
5
6
7
8
9
10
11
$SPARK_HOME/bin/spark-submit \
--class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
--master yarn \
--deploy-mode client \
--driver-memory 10g \
--num-executors 3 \
--executor-memory 10g \
--executor-cores 10 \
--conf -Dcarbon.properties.filepath = carbon.properties \
apache-carbondata-1.5.1-bin-spark2.2.1-hadoop2.7.2.jar \
hdfs://localhost:9000/user/hive/warehouse/carbon.store

When you see a log saying that Thrift Sever is listening to 127.0.0.1:10000, you can use following command to connect:

1
$SPARK_HOME/bin/beeline -u jdbc:hive2://localhost:10000 -n root

Use Carbon to manage your data

Simple usage example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// create table
CREATE TABLE IF NOT EXISTS carbon_test_table (col_1 string , col_2 int)
STORED AS carbondata;

// show table
SHOW TABLES;

// insert data
INSERT INTO TABLE carbon_test_table VALUES('a',1),('b',2);

// show segments
SHOW SEGMENTS FOR TABLE carbon_test_table;

// query
SELECT * FROM carbon_test_table WHERE col_2=1;

And you can check data file in which is set when you start CarbonThriftServer

Build Carbondata yourself

Once you modify source code, you need to build jar for testing.

Prepare develop environment

Recommand:

  • Linux OS
  • JDK 7/8
  • Maven
  • Git
  • Thrift 0.9.3
  • IntelliJ IDEA

Maven

Config a mirror for fetching dependencies jars

For missing jars, you can use mvn install command to install to your local repository

Git

Clone Carbondata project from Apache github repository. Recommand to fork the project to your own github accout and clone.

Build

1
mvn -DskipTests -Pspark-2.2 -Dspark.version=2.2.1 clean package

git archive –format=tar.gz -o current.tgz HEAD
[wsl] tar -xf /mnt/c/Users/Manhua/IdeaProjects/carbondata/current.tgz -C carbondata
mvn -DskipTests -Pbuild-with-format,spark-3.1 install

Test

Submit the jar file you built and run.

Contribute to community

We are glad you share your changes to all other users to make carbondata better.

1. Discuss in maillist

Raise your problem/idea in maill list to make your work more efficiency.

2. Contribute codes

If you have fixed a bug or optimized Carbondata’s code, you can create a JIRA ticket for tracking and create a PR(pull request) in Github