Transform Function
Parameters
When specifying a transform function, all of its parameters are automatically populated with the appropriate dataframes of type polars.LazyFrame
.
# input is of type LazyFrame
def transform(input):
...
Return Value
A transform function must return a value. The return type can be one of the following:
polars.LazyFrame
polars.DataFrame
pandas.DataFrame
The recommended return type is polars.LazyFrame
.
A LazyFrame
can be optimized by the query planner and can leverage the streaming engine to perform out-of-core computations. This results in significantly faster execution compared to the immediate mode of a polars.DataFrame
, and is orders of magnitude faster than pandas.DataFrame
.
Additionally, DataSpace uses the query plan of the LazyFrame
to deduce column lineage, something that’s only possible when returning a LazyFrame
.
Metadata
Input parameters also expose a special attribute called ds_meta
.
This attribute contains metadata about the DataSnapshot
being used. The following fields are available through this attribute:
transform_id
str
The Transform ID of the dataframe
artifact_dir
str
The directory path where the artifacts are stored
data_snapshot_id
str
The DataSnapshot ID of the dataframe
build_id
str
The Build ID of the dataframe
row_count
int
Number of rows
column_count
int
Number of columns
file_size
int
The size of the parquet file
creation_date
str
The date when the dataset was created
columns
list
The columns of the dataframe
Last updated