Transform Function

Parameters

When specifying a transform function, all of its parameters are automatically populated with the appropriate dataframes of type polars.LazyFrame.

# input is of type LazyFrame
def transform(input):
    ...

Return Value

A transform function must return a value. The return type can be one of the following:

  • polars.LazyFrame

  • polars.DataFrame

  • pandas.DataFrame

The recommended return type is polars.LazyFrame. A LazyFrame can be optimized by the query planner and can leverage the streaming engine to perform out-of-core computations. This results in significantly faster execution compared to the immediate mode of a polars.DataFrame , and is orders of magnitude faster than pandas.DataFrame.

Additionally, DataSpace uses the query plan of the LazyFrame to deduce column lineage, something that’s only possible when returning a LazyFrame.

Metadata

Input parameters also expose a special attribute called ds_meta. This attribute contains metadata about the DataSnapshot being used. The following fields are available through this attribute:

Attribute
Type
Description

transform_id

str

The Transform ID of the dataframe

artifact_dir

str

The directory path where the artifacts are stored

data_snapshot_id

str

The DataSnapshot ID of the dataframe

build_id

str

The Build ID of the dataframe

row_count

int

Number of rows

column_count

int

Number of columns

file_size

int

The size of the parquet file

creation_date

str

The date when the dataset was created

columns

list

The columns of the dataframe

Last updated