Google Drive

Single File Extraction

To ingest a CSV file from Google Drive, you first have to enable sharing on the file by following the instructions on the Google Drive Help Page.

The generated share link will look something like this: https://drive.google.com/file/d/1Se7_LKZykBWweXpBths1oCmgGTGK4yyD/view?usp=sharing

This link is meant to open the Google Drive web interface. However, since we want the file itself, we have to modify the link. The file ID needs to be extracted from the original URL and combined with the direct file access link:

https://drive.google.com/uc?id=1Se7_LKZykBWweXpBths1oCmgGTGK4yyD

Following is the full code

import polars as pl
import os

url = 'https://drive.google.com/uc?id={os.environ['CSV_FILE_ID']}'

def transform():    
    df = pl.read_csv(url)
    return df

Typically, a file URL will contain an access token. It is advised not to hard-code the key into the string but rather use the secrets store to inject a key during the build.

Multiple File Extraction

If you have a folder with multiple files you would like to extract, it is not feasible to share every single file manually. In this case, we can leverage Google's API to programmatically access the share drive, index the files, and download all.

1. Create a Google Cloud Project

Go to https://console.cloud.google.com/.
Click the project selector (top left) and choose “New Project.”
Give it a descriptive name, for example DataSpace Drive Access.
Click Create.

2. Enable the Google Drive API

In the left sidebar, go to APIs & Services → Library.
Search for Google Drive API.
Click Enable.

3. Create a Service Account

Go to APIs & Services → Credentials.
Click Create Credentials → Service Account.
Enter a name like dataspace-drive-access.
Leave Permissions and Principals with access empty — no roles or users are needed.
Click Done.

4. Create a Key File

In the service account list, click your new account.
Open the Keys tab.
Click Add Key → Create New Key → JSON.
Save the downloaded .json file (for example, mcp.json).

You’ll need to upload this file to your DataSpace workspace later.

Go to Google Drive.
Right-click your folder and choose Share.
Copy the client email from the service account JSON file (it looks like dataspace-drive-access@your-project-id.iam.gserviceaccount.com).
Add that email as a Viewer.

Copy the folder ID from the URL — it’s the long string between /folders/ and the next /. Example:

https://drive.google.com/drive/folders/1r1cDZzgE7wclTYMv_Xzdv98znO9kHN8_
→ Folder ID: 1r1cDZzgE7wclTYMv_Xzdv98znO9kHN8_

6. Prepare Your DataSpace Workspace

In DataSpace, declare the dependencies in _config.json:

{
  "packages": [
    "google-api-python-client",
    "google-auth"
  ]
}

Make sure your service account key (mcp.json) is uploaded to the workspace root.

7. Write the Transformation

Now we can download the files and save them in the artifacts folder for further processing downstream.

import polars as pl
from google.oauth2 import service_account
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
import io, os

# Google Drive folder ID
FOLDER_ID = "<REPLACE_WITH_FOLDER_ID>"

# Path to your service account credentials
SERVICE_ACCOUNT_FILE = "./mcp.json"

# Drive API scope
SCOPES = ["https://www.googleapis.com/auth/drive.readonly"]

def transform():
    # Authenticate using the service account
    creds = service_account.Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE, scopes=SCOPES
    )
    service = build("drive", "v3", credentials=creds)

    # Query all Excel files in the folder
    query = f"'{FOLDER_ID}' in parents and (name contains '.xlsx' or name contains '.xls')"
    results = service.files().list(q=query, fields="files(id, name)").execute()
    files = results.get("files", [])

    # Download files into the artifacts folder
    for f in files:
        print(f"Downloading {f['name']}...")
        request = service.files().get_media(fileId=f["id"])
        with io.FileIO(os.path.join(os.environ["ARTIFACT_FOLDER"], f["name"]), "wb") as fh:
            downloader = MediaIoBaseDownload(fh, request)
            done = False
            while not done:
                status, done = downloader.next_chunk()
                if status:
                    print(f"  {int(status.progress() * 100)}%")

    print("✅ All files downloaded successfully")

    # Return an empty DataFrame (optional)
    df = pl.DataFrame()
    return df

All downloaded files are automatically stored in the artifacts folder, so they persist across runs and are available for further processing.

Summary

You’ve now successfully configured your DataSpace workspace to:

Authenticate securely via a Google service account
Access a shared Google Drive folder
Automatically download Excel files into the artifacts folder

PreviousExtract NextPostgres

Last updated 1 month ago