datatweak

Data Tweak is a simplified, lightweight ETL framework based on Apache Spark.

View on GitHub

DataTweak

Supported input

Supported output

Quickstart

This guide helps you quickly explore the main features of Data Tweak. It provides config snippets that show how to read, define the steps and queries of the ETL and write data.

Config


DataTweak configurations is base on PureConfig which reads a config from:

Data ingest

Read a CSV with header using schema and save to avro format.

main class: ro.esolutions.datatweak.apps.IngestApp

    input: {
        format = "csv"
        path = "file:///datasets/users.csv"
        options = {
        "header": "true"
        }
        schema = """{
            "type": "struct",
            "fields": [{
              "name": "id",
              "type": "integer",
              "nullable": false
            }, {
              "name": "name",
              "type": "string",
              "nullable": false
            }, {
              "name": "age",
              "type": "integer",
              "nullable": true
            }]
          }"""
    }
    output: {
        format = "avro"
        path = "file://bootcamp/avro/"
    }

Data wrangling

Read tow avro files, join it and save to parquet format.

main class: ro.esolutions.datatweak.apps.QueryApp

    source: [
        {
            "name": "orders"
            input: {
                format = "avro"
                path = "file:///home/lucian/workspace/bigdata/datasets/retail/warehouse/orders/"
            }
        },
        {
            "name": "order_items"
            input: {
                format = "avro"
                path = "file:///home/lucian/workspace/bigdata/datasets/retail/warehouse/order_items/"
            }
        }
    ]
    query: "SELECT * FROM orders o JOIN order_items i ON (o.order_id == i.order_item_order_id)"
    output: {
        format = "parquet"
        path = "file:///tmp/bootcamp/parquet/"
    }

Run


Usage: spark-submit... <application-jar> [options]

Available options:

option description
-j, –job job is a required application name property
-n, –namespace optional configuration namespace property
-u, –url optional config url property
-l, –literal optional literal config property

Example:

./bin/spark-submit \
  --class ro.esolutions.datatweak.apps.QueryApp \
  --master spark://localhost:7077 \
  --conf spark.eventLog.enabled=false \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  hdfs://spark/jars/apps-0.3.1.jar \
  -j queryJobs \
  -u http://localhost:8080/config/wrangling

Note: config-service (localhost:8080) return Data wrangling conf.