datatweak

DataTweak

Supported input

File
- json
- csv
- avro
- parquet
Jdbc

Supported output

File
- json
- csv
- avro
- parquet
Jdbc

Quickstart

This guide helps you quickly explore the main features of Data Tweak. It provides config snippets that show how to read, define the steps and queries of the ETL and write data.

Config

DataTweak configurations is base on PureConfig which reads a config from:

a file in a file system
resources in your classpath
an URL
a string

Data ingest

Read a CSV with header using schema and save to avro format.

main class: ro.esolutions.datatweak.apps.IngestApp

    input: {
        format = "csv"
        path = "file:///datasets/users.csv"
        options = {
        "header": "true"
        }
        schema = """{
            "type": "struct",
            "fields": [{
              "name": "id",
              "type": "integer",
              "nullable": false
            }, {
              "name": "name",
              "type": "string",
              "nullable": false
            }, {
              "name": "age",
              "type": "integer",
              "nullable": true
            }]
          }"""
    }
    output: {
        format = "avro"
        path = "file://bootcamp/avro/"
    }

Data wrangling

Read tow avro files, join it and save to parquet format.

main class: ro.esolutions.datatweak.apps.QueryApp

    source: [
        {
            "name": "orders"
            input: {
                format = "avro"
                path = "file:///home/lucian/workspace/bigdata/datasets/retail/warehouse/orders/"
            }
        },
        {
            "name": "order_items"
            input: {
                format = "avro"
                path = "file:///home/lucian/workspace/bigdata/datasets/retail/warehouse/order_items/"
            }
        }
    ]
    query: "SELECT * FROM orders o JOIN order_items i ON (o.order_id == i.order_item_order_id)"
    output: {
        format = "parquet"
        path = "file:///tmp/bootcamp/parquet/"
    }

Run

Usage: spark-submit... <application-jar> [options]

Available options:

option	description
-j, –job	job is a required application name property
-n, –namespace	optional configuration namespace property
-u, –url	optional config url property
-l, –literal	optional literal config property

Example:

./bin/spark-submit \
  --class ro.esolutions.datatweak.apps.QueryApp \
  --master spark://localhost:7077 \
  --conf spark.eventLog.enabled=false \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  hdfs://spark/jars/apps-0.3.1.jar \
  -j queryJobs \
  -u http://localhost:8080/config/wrangling

Note: config-service (localhost:8080) return Data wrangling conf.