Scalability

DataSpace is primarily designed to run on-prem provisioned hardware rather than rely on distributed cloud services. This decision has been made deliberately due to the over-reliance on cloud infrastructure and the security and privacy concerns associated with not knowing where data is being sent for compute and storage.

Furthermore, the term big data is sometimes being conflated and incorrectly assumed to apply to rather small datasets (eg, ~10M rows). In such cases, spinning up a distributed Spark cluster to perform such trivial computation might be detrimental to the performance and produce unnecessary cost.

Our strategy is to focus on this dimension of data, very commonly seen in SMBs, which is too large to be able to be handled well in Excel but still small enough not to be considered big data. In this range, we can apply new and optimized open source engines such as Polars that have immensely optimized single-node computations by applying various low-level optimization techniques, such that running a transformation on local hardware will be faster and cheaper rather than going out into a cloud computing cluster.

Consequently, the provided infrastructure also decides the feasibility and scalability of the transformations that can be performed. However, some topics can be explored even when the hardware is limited.

Out of Memory Streaming

When trying to perform a transformation on a dataset that exceeds the hardware memory limits, we can fall back to a streaming approach where data is processed in chunks, allowing us to process beyond the limits of the hardware.

The streaming mode is automatically enabled by default when returning a LazyFrame in the transformation function

Learn more about streaming in Polars in the official documentation:

Remote Distributed Compute

Going beyond streaming and local hardware, if cloud computing is unavoidable, Polars' remote distributed computing and querying service can be used.

The setup is rather straightforward. Instead of letting the DataSpace runners perform the computation, the LazyFrame query is sent to the Polars Cloud provisioned workspace, where the work is distributed and executed.

There is also not much code change needed. A ComputeContext needs to be defined with the desired cluster sizing, and before returning, the remote function can be called.

ctx = pc.ComputeContext(workspace="DataSpace-Workspace", cpus=16, memory=64)
return query.remote(context=ctx)

Last updated