Day 1 to Carbondata
What is carbon
Carbondata is an indexed columnar data format for fast analytics on big data platform, like parquet, orc.
Reference Site
Run Carbondata on your computer
1. Get Carbondata Release
You can get released Carbondata jar from Github or Apache achieve.
As for each module, you can find it in maven repository
2. Prepare environment
Recommand:
- Linux OS
- JDK 7
- Hadoop 2.7.2
- Spark 2.3
Please ensure that HDFS, MapReduce and Spark is working normally. You can check it by running example job, such as Word Count
for MapReduce and PI calculation
for Spark.
Also, you need to prepare carbon.properties
file as https://github.com/apache/carbondata/tree/master/conf/carbon.properties.template
You can config carbon.properties.filepath
in spark’s spark-defaults.conf
file, or set extraJavaOptions
when submit application.
Please ensure carbon.properties
file can be distributed to each executor, else parameters will not take effect.
3. Run Carbon
We can submit the assembly jar file as a spark application to create CarbonThriftServer
, and then use beeline
to connect.
Template
1 | ./bin/spark-submit \ |
Example
1 | $SPARK_HOME/bin/spark-submit \ |
When you see a log saying that Thrift Sever is listening to 127.0.0.1:10000, you can use following command to connect:
1 | $SPARK_HOME/bin/beeline -u jdbc:hive2://localhost:10000 -n root |
Use Carbon to manage your data
Simple usage example
1 | // create table |
And you can check data file in
Build Carbondata yourself
Once you modify source code, you need to build jar for testing.
Prepare develop environment
Recommand:
- Linux OS
- JDK 7/8
- Maven
- Git
- Thrift 0.9.3
- IntelliJ IDEA
Maven
Config a mirror for fetching dependencies jars
For missing jars, you can use mvn install
command to install to your local repository
Git
Clone Carbondata project from Apache github repository. Recommand to fork the project to your own github accout and clone.
Build
1 | mvn -DskipTests -Pspark-2.2 -Dspark.version=2.2.1 clean package |
git archive –format=tar.gz -o current.tgz HEAD
[wsl] tar -xf /mnt/c/Users/Manhua/IdeaProjects/carbondata/current.tgz -C carbondata
mvn -DskipTests -Pbuild-with-format,spark-3.1 install
Test
Submit the jar file you built and run.
Contribute to community
We are glad you share your changes to all other users to make carbondata better.
1. Discuss in maillist
Raise your problem/idea in maill list to make your work more efficiency.
2. Contribute codes
If you have fixed a bug or optimized Carbondata’s code, you can create a JIRA ticket for tracking and create a PR(pull request) in Github