Supported input
- File
- json
- csv
- avro
- parquet
- Jdbc
Supported output
- File
- json
- csv
- avro
- parquet
- Jdbc
Quickstart
This guide helps you quickly explore the main features of Data Tweak. It provides config snippets that show how to read, define the steps and queries of the ETL and write data.
Config
DataTweak configurations is base on PureConfig which reads a config from:
- a file in a file system
- resources in your classpath
- an URL
- a string
Data ingest
Read a CSV with header using schema and save to avro format.
main class: ro.esolutions.datatweak.apps.IngestApp
input: {
format = "csv"
path = "file:///datasets/users.csv"
options = {
"header": "true"
}
schema = """{
"type": "struct",
"fields": [{
"name": "id",
"type": "integer",
"nullable": false
}, {
"name": "name",
"type": "string",
"nullable": false
}, {
"name": "age",
"type": "integer",
"nullable": true
}]
}"""
}
output: {
format = "avro"
path = "file://bootcamp/avro/"
}
Data wrangling
Read tow avro files, join it and save to parquet format.
main class: ro.esolutions.datatweak.apps.QueryApp
source: [
{
"name": "orders"
input: {
format = "avro"
path = "file:///home/lucian/workspace/bigdata/datasets/retail/warehouse/orders/"
}
},
{
"name": "order_items"
input: {
format = "avro"
path = "file:///home/lucian/workspace/bigdata/datasets/retail/warehouse/order_items/"
}
}
]
query: "SELECT * FROM orders o JOIN order_items i ON (o.order_id == i.order_item_order_id)"
output: {
format = "parquet"
path = "file:///tmp/bootcamp/parquet/"
}
Run
Usage: spark-submit... <application-jar> [options]
Available options:
option | description |
---|---|
-j, –job |
job is a required application name property |
-n, –namespace |
optional configuration namespace property |
-u, –url |
optional config url property |
-l, –literal |
optional literal config property |
Example:
./bin/spark-submit \
--class ro.esolutions.datatweak.apps.QueryApp \
--master spark://localhost:7077 \
--conf spark.eventLog.enabled=false \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
hdfs://spark/jars/apps-0.3.1.jar \
-j queryJobs \
-u http://localhost:8080/config/wrangling
Note: config-service (localhost:8080) return Data wrangling conf.