Builds

A build refers to the process of running a transformation and producing an output dataset.

Regular Build

A build can be triggered by selecting one or multiple transformations in the lineage page and clicking on the Build button.

Note When multiple transformations are selected, the build system resolved dependencies and chooses the most efficient build sequence.

Upstream Build

When there are many upstream transformations, it can sometimes be beneficial to make use of the Upstream Buld functionality, which given a selected transformation recursively crawls all upstream dependencies and builds them before building the current transformation.

Incremental Build

A regular build always deletes the current dataset and replaces it with the freshly build dataset. In some cases, however, it is desirable to retain old data, for example when historisation is required. This use case is supported and can be achieved with the following setup:

DataSpace internally stores the transforms as folders, code as files, and the produced datasets as parquet files. To perform an incremental build, the existing parquet file from the transform can be manually read and concatenated to the new calculations.

import pandas as pd
from datetime import datetime
import os

def transform():
    df = pd.DataFrame({'date': [datetime.now()]})
    try:
        df_old = pd.read_parquet(f"/data/{os.environ['TRANSFORM_ID']}/datasets/data.parquet")
        df = pd.concat((df_old, df))
    except:
        print("No prior data in database.")

    return df

It often makes sense to run such an incremental build periodically. Please refer to Build Schedules to learn more.

PreviousIngest NextSchedules

Last updated 2 months ago