Skip to main content

Databricks

Databricks is a unified data analytics platform built on Apache Spark, designed for data engineering, machine learning, and analytics at scale. When combined with Tensorlake’s document parsing and serverless agentic application runtime, you can build AI workflows and agents which can automate processing of Documents and other forms of unstructured data and land them in Databricks. In Databricks’s Medallion Architecture, Tensorlake can extract semi-structured(JSON) or structured data form unstructured data and land them in the Broze stage tables in Databricks. This enables enterprises to increase data coverage in Databricks for downstream analytics usecases.

Integration Architecture

There are two main ways of integrating Tensorlake with Databricks:
  1. Document Ingestion API: Use Tensorlake’s Document Ingestion API from Databricks Jobs or Notebooks to extract structured data or markdown from documents, then load them into Databricks tables.
  2. Full Ingestion Pipeline on Tensorlake: Build the entire pipeline of ingestion, transformation, and writing to Databricks on Tensorlake’s platform. These pipelines are exposed as HTTP APIs and run whenever data is ingested, eliminating infrastructure management and scaling concerns. Tensorlake allows you to write distributed Python applications, making the developer experience of building and deploying scalable pipelines.

Installation

pip install tensorlake databricks-sql-connector pandas pyarrow

Quick Start: Simple Document-to-Database Integration

This example demonstrates the core integration pattern between Tensorlake’s DocumentAI and Databricks.

Step 1: Extract Structured Data from a Document

Define a schema and extract structured data using Tensorlake:
from tensorlake.documentai import DocumentAI, StructuredExtractionOptions
from pydantic import BaseModel, Field
from typing import List

# Define your extraction schema
class CompanyInfo(BaseModel):
    """Basic company information from a document"""
    company_name: str = Field(description="Name of the company")
    revenue: str = Field(description="Annual revenue")
    industry: str = Field(description="Primary industry")

# Initialize DocumentAI
doc_ai = DocumentAI()

# Extract structured data
result = doc_ai.parse_and_wait(
    file="https://example.com/company-report.pdf",
    structured_extraction_options=[
        StructuredExtractionOptions(
            schema_name="CompanyInfo",
            json_schema=CompanyInfo
        )
    ]
)

extracted_data = result.structured_data[0].data

Step 2: Load Data into Databricks

Connect to Databricks SQL Warehouse and insert the extracted data:
from databricks import sql
import pandas as pd
import os

# Connect to Databricks
connection = sql.connect(
    server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
    http_path=os.getenv("DATABRICKS_HTTP_PATH"),
    access_token=os.getenv("DATABRICKS_ACCESS_TOKEN")
)

cursor = connection.cursor()

# Create table if it doesn't exist
cursor.execute("""
    CREATE TABLE IF NOT EXISTS companies (
        company_name STRING,
        revenue STRING,
        industry STRING
    )
""")

# Insert the extracted data
cursor.execute("""
    INSERT INTO companies (company_name, revenue, industry)
    VALUES (?, ?, ?)
""", (
    extracted_data.get('company_name'),
    extracted_data.get('revenue'),
    extracted_data.get('industry')
))

connection.commit()
connection.close()

Step 3: Query Your Data

Run SQL analytics on the document data:
# Reconnect for queries
connection = sql.connect(
    server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
    http_path=os.getenv("DATABRICKS_HTTP_PATH"),
    access_token=os.getenv("DATABRICKS_ACCESS_TOKEN")
)

cursor = connection.cursor()

# Example: Query companies by industry
cursor.execute("""
    SELECT 
        industry,
        COUNT(*) as company_count,
        AVG(CAST(revenue AS DECIMAL)) as avg_revenue
    FROM companies
    GROUP BY industry
    ORDER BY company_count DESC
""")

results = cursor.fetchall()
for row in results:
    print(row)

connection.close()

How the Integration Works

The integration follows a straightforward pipeline:
  1. Document Processing: Tensorlake’s DocumentAI parses documents and extracts structured data based on your Pydantic schemas
  2. Data Transformation: Extracted data is converted into a format compatible with Databricks (typically DataFrames or dictionaries)
  3. Database Loading: Data is loaded into Databricks tables using the databricks-sql-connector library
  4. SQL Analytics: Run complex queries, joins, and aggregations on your document data using standard SQL
  5. Orchestration: You can orchestrate this process from Databricks Jobs, Notebooks or any other orchestrator.

Full Ingestion Pipeline on Tensorlake

The orchestration of your ingestion pipeline happens on Tensorlake. You can write a distributed and durable ingestion pipeline in pure Python and Tensorlake will automatically queue requests as they arrive and scale the cluster to process data. The platform is serverless, so you only pay for compute resources used for processing data. Architecture diagram showing documents flowing through Tensorlake Platform's serverless Python functions into Databricks tables. For a comprehensive example including page classification, multi-document processing, and advanced analytics, see our tutorial: Query SEC Filings Stored in Databricks

What’s Next?

Learn more about Tensorlake and Databricks: