Skip to main content

Setup Pipeline

Templates#

RDBMS to BigQuery#

Load data of a table from RDBMS (Mysql, Postgres, ...) to BigQuery.

Support:

  • Mysql
  • Postgres

MongoDB to BigQuery#

Load data of a collection from MongoDB to BigQuery.

Genric Transformation#

Create dependencies for transformations.

If you have multiple transformations and want to run sequentially, you can setup a pipeline and define the order you want each transformation to run.

info

Each template will have slightly different configurations depend on datasource.

Configuration#

General Information#

  • Name

  • Description

  • Schedule Time: schedule will be in Ho Chi Minh Timezone (+7)

    • On demand: run mannually.

    • Basic options: Hourly, Daily, Weekly, Monthly, Yearly (this will run at the beginning of each time unit). Example:

      • Hourly: start at minute 0 on every hour.
      • Daily: start at 00:00 every day.
    • Custom option: this provide you a flexibility to set schedule.

      Please find out here

      note

      If you choose Custom, you have to validate your expression before saving pipeline.

      Minimum schedule interval: 10 minutes.

Source Information#

IMPORTANT

Source Datasource, Table Name, Load Strategy cannot be changed once pipeline is created.

  • Source Datasource: datasource (RDBMS or MongoDB) that is created in Datasources.

  • Table Name: input table name you want to extract data.

  • Extract Mode: how to extract data from source table

    • Full: extract all data from source table each time pipeline runs.

    • Incremental: extract a partial data from source table each time pipeline runs. This is preferred way to extract data from table with large data.

      If this mode is selected, you will need to provide Max fields (columns) that we use to keep track of data in source table.

      note

      You can define multiple fields to be max fields (separated by comma ,) and change this option later.

  • Load Strategy

    • Merge: (similar to Upsert) when loading data to destination table, if records do not exist then insert new rows, otherwise update existings records based on Key fields.

      If this strategy is selected, you will need to provide Key fields (usually Primary Key for RDBMS tables or _id if it's MongoDB).

    • Overwrite: when loading data to destination table, replace all data in the table with new data.

    • Append: when loading data to destination table, only insert data to destination table.

      caution

      Since BigQuery is Columnar Database, its table does not have primary key fundamentally, so it does not have any constraints. Thus, be careful when you choose this strategy because destination table might contains duplicate data.

There are some additional configurations for MongoDB, you can checkout here

Google Cloud Storage (GCS) Information#

  • Staging GCS Bucket: where data will be stored.

  • Staging GCS Prefix: prefix folder inside bucket, if provided, it will store data as follow

    prefix_folder/table_name/staging_partition

    otherwise, default apply as follow

    table_name/staging_partition

BigQuery Staging Information#

  • Staging Dataset: name of dataset where staging table will be located.
info

If dataset does not exists, when pipeline runs it will create a dataset with provided name. This also apply for Destination Dataset

  • Staging Table: name of table that will be created as external table that link to GCS folder.

  • Custom Staging Partition: (Hive partitioning) If you are familiar with Hive Partitioning then you can defined your own partition, otherwise just skip this option.

    If this option is checked, you will need to provide:

    • Staging Table Partition Columns: columns that are used for partitioning. (Separated by comma ,)

      Default: dt, hour

    • Staging Table Partition Format: format of above columns. (Separated by slash /)

      Supported format: date, year, month, day, hour.

      Default: date/hour

      caution

      Please provide format of columns with the same order as columns.

    Example

    • Staging Table Partition Columns: year, month, day, hour
    • Staging Table Partition Format: year/month/day/hour

BigQuery Destination Information#

  • Dwh Datasource: BigQuery Connection that is created in Datasources.

  • Destination Dataset: name of dataset where destinaton table will be located.

  • Destination Table: name of table that will be created as native table.

  • Auto Update Schema: If this option is enabled, whenever source table has new fields, new fields will be added to destination table, and if Alert Channel has been set, a message will be sent to that channel to notify that table schema has been updated with new fields.

    Added date will be updated to description of new fields.

    auto_update_schema

    If otherwise you do not want to enable this option, you will still receive a message about new fields and have to update destination table schema manually. Within 3 pipeline runs, if you have not updated destination table schema, we assumed that you have decided to ignore new fields and we will not send any messages about new fields for that table in the future unless you update destination table schema to match source table schema or enable this option.

note

Currently, BigQuery does not support update column type, so this option will only add new fields.

  • Is Partitioned Table: create destination table with a partition. If this option is checked, you will need to provide:

    • Destination Table Partition Column Field: column that will be used to partition table (Timestamp or date).

    • Destination Table Partition Column Type: type of partition.

      Support: Day, Hour, Month, Year

      note

      BigQuery only support 4000 partitions per table. Recommend Day as the smallest partition type, you should not use Hour unless that table is too big (events, streaming).

    • Destination Table Require Partition Filter: By default, you can execute this query from a partitioned table

      SELECT * FROM partitioned_table

      However, with this option enabled you cannot perform above query. You have to add filter with partitioned column.

      SELECT * FROM partitioned_table WHERE partitioned_column > '2021-09-01'

      This will be come handy when you have big table and users do not familiar with partitioning.