Skip to main content

Analyzing AI Risk Disclosures in SEC Filings with Tensorlake & Databricks

Track how AI risk disclosures evolved across major tech companies from 2021-2025 by parsing SEC filings, extracting structured risk data, and running SQL analytics in Databricks Data Intelligence Platform.

Track AI Risk Evolution Across Tech Companies

Let’s set the context for this example: you’ll build a document analytics pipeline that processes SEC filings from major tech companies to track how AI risk disclosures have evolved from 2021-2025. You’ll learn how to:
  • Use Tensorlake’s Page Classification to identify risk factor pages with VLMs
  • Extract structured data from only relevant pages using Pydantic schemas
  • Deploy serverless applications on Tensorlake’s platform to run your entire pipeline
  • Load parsed document data into Databricks for SQL analytics
  • Query trends, compare companies, and discover emerging risk patterns

The Challenge

Major tech companies file lengthy SEC reports (100-200+ pages) quarterly. AI-related risk disclosures are scattered throughout these documents, making manual analysis time-consuming and prone to missing critical information.

Our Solution

We’ll analyze 3 SEC filings from Microsoft, Google, and Meta spanning 2024-2025 to:
  1. Use VLMs to identify pages containing AI risk factors (reducing processing from ~200 pages to ~20 per document)
  2. Extract structured risk data from only relevant pages
  3. Deploy the entire pipeline as serverless applications on Tensorlake
  4. Store and analyze trends in Databricks SQL Warehouse
  5. Uncover emerging AI risk patterns and regulatory concerns

Prerequisites

Getting Started

Databricks Setup

You need access to a Databricks SQL Warehouse. Find your connection details in the Databricks workspace under SQL Warehouses → Connection Details.

Local Testing

1. Install Dependencies

pip install --upgrade tensorlake databricks-sql-connector pandas pyarrow

2. Set Environment Variables

export TENSORLAKE_API_KEY=YOUR_TENSORLAKE_API_KEY
export DATABRICKS_SERVER_HOSTNAME=YOUR_DATABRICKS_SERVER_HOSTNAME
export DATABRICKS_HTTP_PATH=YOUR_DATABRICKS_HTTP_PATH
export DATABRICKS_ACCESS_TOKEN=YOUR_DATABRICKS_ACCESS_TOKEN
Or create a .env file with these values.

Build Your Document Processing Application

We’ll create a Tensorlake application that extracts AI risk data from SEC filings and stores it in Databricks. This application demonstrates a complete document processing pipeline using Tensorlake Applications with parallel processing via .map().

Pipeline Architecture

The application follows this flow:
document_ingestion (entry point)
└──> classify_pages - Classifies pages in SEC filings
└──> extract_structured_data.map() - Extracts data from classified pages IN PARALLEL
└──> initialize_databricks_table - Sets up database schema
└──> write_to_databricks.map() - Writes results to Databricks IN PARALLEL
Key Tensorlake concepts used:
  • @application(): Marks the entry point of your application
  • @function(): Makes functions distributed and executable in the cloud or locally
  • .map(): Enables parallel execution across multiple items
  • Image: Defines the Docker container environment with dependencies
  • secrets: Securely injects environment variables at runtime

Define Your Extraction Schemas

First, define the Pydantic models that describe the data structure you want to extract:
from pydantic import BaseModel, Field
from typing import List, Optional

class AIRiskMention(BaseModel):
    """Individual AI-related risk mention"""
    risk_category: str = Field(
        description="Category: Operational, Regulatory, Competitive, Ethical, Security, Liability"
    )
    risk_description: str = Field(description="Description of the AI risk")
    severity_indicator: Optional[str] = Field(None, description="Severity level if mentioned")
    citation: str = Field(description="Page reference")

class AIRiskExtraction(BaseModel):
    """Complete AI risk data from a filing"""
    company_name: str
    ticker: str
    filing_type: str
    filing_date: str
    fiscal_year: str
    fiscal_quarter: Optional[str] = None
    ai_risk_mentioned: bool
    ai_risk_mentions: List[AIRiskMention] = []
    num_ai_risk_mentions: int = 0
    ai_strategy_mentioned: bool = False
    ai_investment_mentioned: bool = False
    ai_competition_mentioned: bool = False
    regulatory_ai_risk: bool = False

Create the Document Processing Application

Create a file called process-sec.py:
import os
import json
from typing import List, Optional, Tuple, Any

from pydantic import BaseModel, Field
from databricks import sql

from tensorlake.applications import Image, application, function, cls
from tensorlake.documentai import (
    DocumentAI, PageClassConfig, StructuredExtractionOptions, ParseResult
)

# TENSORLAKE APPLICATIONS: Define a custom runtime environment
# Image defines the Docker container environment where your functions will run.
# You can specify dependencies, system packages, and environment configuration.
# All @function decorators can reference this image to ensure consistent execution.
image = (
    Image(base_image="python:3.11-slim", name="databricks-sec")
    .run("pip install databricks-sql-connector pandas pyarrow")
)

# [Include the Pydantic models from above]

# TENSORLAKE APPLICATIONS: Application Entry Point
# @application() marks this function as the main entry point for your Tensorlake application.
# @function() makes this a distributed function that can run in the cloud or locally.
#
# Key concepts:
# - secrets: List of environment variable names that will be securely injected at runtime
# - image: The runtime environment (Docker container) where this function executes
# - Functions decorated with @function can call other @function decorated functions
# - You can use .map() on @function decorated functions for parallel execution
@application()
@function(
    secrets=[
        "TENSORLAKE_API_KEY"
    ],
    image=image
)
def document_ingestion(document_urls: List[str]) -> None:
    """Main entry point for document processing pipeline"""
    print(f"Starting document ingestion for {len(document_urls)} documents.")

    # Step 1: Classify pages in all documents
    parse_ids = classify_pages(document_urls)
    print(f"Classification complete. Parse IDs: {parse_ids}")

    # Step 2: Extract structured data with parallel execution
    # .map() calls extract_structured_data once for each item in parse_ids.items()
    # Each call runs in parallel, making this very efficient for processing multiple documents
    # Returns a list of results (tuples in this case) from all parallel executions
    results = extract_structured_data.map(parse_ids.items())
    print(f"Extraction complete. Results: {results}")

    # Step 3: Initialize database schema
    initialize_databricks_table()
    print("Databricks table initialized.")

    # Step 4: Write data to Databricks in parallel
    # .map() again enables parallel processing - each result tuple is written to Databricks
    # in parallel, significantly speeding up the data ingestion process
    print("Writing results to Databricks.")
    write_to_databricks.map(results)
    print("Document ingestion process completed.")


@function(
    secrets=[
        "TENSORLAKE_API_KEY"
    ], 
    image=image
)
def classify_pages(document_urls: List[str]) -> None:
    """Classify pages in SEC filings to identify AI risk factors"""
    doc_ai = DocumentAI(api_key=os.getenv("TENSORLAKE_API_KEY"))
    
    page_classifications = [
        PageClassConfig(
            name="risk_factors",
            description="Pages that contain risk factors related to AI."
        ),
    ]
    parse_ids = {}

    for file_url in document_urls:
        try:
            parse_id = doc_ai.classify(
                file_url=file_url,
                page_classifications=page_classifications
            )
            parse_ids[file_url] = parse_id
            print(f"Successfully classified {file_url}: {parse_id}")
        except Exception as e:
            print(f"Failed to classify document {file_url}: {e}")

    return parse_ids


# TENSORLAKE APPLICATIONS: Distributed Function for Parallel Processing
# This function is designed to be called via .map() for parallel execution.
# When called with .map(), this function runs once for each item in the input list,
# with all executions happening in parallel across multiple workers.
#
# Error Handling: Always wrap .map() functions in try-except to return None on failure.
# This allows the pipeline to continue processing other items even if one fails.
@function(
    image=image,
    secrets=[
        "TENSORLAKE_API_KEY"
    ]
)
def extract_structured_data(url_parse_id_pair: Tuple[str, str]) -> Optional[Tuple[str, str]]:
    """Extract structured data from classified pages

    Args:
        url_parse_id_pair: Tuple of (file_url, parse_id) from the classification step

    Returns:
        Tuple of (extract_result_id, file_url) or None if processing fails
    """
    print(f"Processing: {url_parse_id_pair}")

    try:
        doc_ai = DocumentAI(api_key=os.getenv("TENSORLAKE_API_KEY"))
        result = doc_ai.wait_for_completion(parse_id=url_parse_id_pair[1])

        page_numbers = []
        for page_class in result.page_classes:
            if page_class.page_class == "risk_factors":
                page_numbers.extend(page_class.page_numbers)

        if not page_numbers:
            print(f"No risk factor pages found for {url_parse_id_pair[0]}")
            return None

        page_number_str_list = ",".join(str(i) for i in page_numbers)
        print(f"Extracting from pages: {page_number_str_list}")

        extract_result = doc_ai.extract(
            file_url=url_parse_id_pair[0],
            page_range=page_number_str_list,
            structured_extraction_options=[
                StructuredExtractionOptions(
                    schema_name="AIRiskExtraction",
                    json_schema=AIRiskExtraction
                )
            ]
        )
        print(f"Extraction result: {extract_result}")

        return (extract_result, url_parse_id_pair[0])
    except Exception as e:
        print(f"Error processing {url_parse_id_pair[0]}: {e}")
        return None


@function(
    image=image, 
    secrets=[
        "DATABRICKS_SERVER_HOSTNAME",
        "DATABRICKS_HTTP_PATH",
        "DATABRICKS_ACCESS_TOKEN"
    ]
)
def initialize_databricks_table() -> None:
    """Initialize the Databricks table with the required schema"""
    connection = sql.connect(
        server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
        http_path=os.getenv("DATABRICKS_HTTP_PATH"),
        access_token=os.getenv("DATABRICKS_ACCESS_TOKEN"),
        _tls_no_verify=True,
    )
    cursor = connection.cursor()
    
    create_ai_risk_factors_sql = """
    CREATE TABLE IF NOT EXISTS ai_risk_filings (
        company_name STRING,
        ticker STRING,
        filing_type STRING,
        filing_date STRING,
        fiscal_year STRING,
        fiscal_quarter STRING,
        ai_risk_mentioned BOOLEAN,
        ai_risk_mentions STRING,
        num_ai_risk_mentions INT,
        ai_strategy_mentioned BOOLEAN,
        ai_investment_mentioned BOOLEAN,
        ai_competition_mentioned BOOLEAN,
        regulatory_ai_risk BOOLEAN
    )
    """
    cursor.execute(create_ai_risk_factors_sql)
    
    create_ai_risk_mentions_sql = """
        CREATE TABLE IF NOT EXISTS ai_risks (
            company_name STRING,
            ticker STRING,
            fiscal_year STRING,
            fiscal_quarter STRING,
            source_file STRING,
            risk_category STRING,
            risk_description STRING,
            severity_indicator STRING,
            citation STRING
        )
    """
    cursor.execute(create_ai_risk_mentions_sql)
    connection.commit()
    connection.close()


# TENSORLAKE APPLICATIONS: Parallel Database Write Function
# This function is called via .map() to write results to Databricks in parallel.
# Each execution processes one result tuple from the extraction step.
#
# Data Flow: extract_structured_data returns tuples -> .map() collects them into a list
#            -> write_to_databricks.map() processes each tuple in parallel
#
# Secrets: Multiple secrets can be specified. Each will be available as an environment
# variable inside the function. Secrets are never logged or exposed in code.
@function(
    image=image,
    secrets=[
        "TENSORLAKE_API_KEY",
        "DATABRICKS_SERVER_HOSTNAME",
        "DATABRICKS_HTTP_PATH",
        "DATABRICKS_ACCESS_TOKEN"
    ]
)
def write_to_databricks(result_tuple: Tuple[Any, str]) -> None:
    """Write structured data to Databricks tables

    Args:
        result_tuple: Tuple of (extract_result_id, file_url) from extract_structured_data
    """
    # Handle None values - functions called via .map() should gracefully skip failed items
    if result_tuple is None:
        return

    extract_result, file_url = result_tuple
    if extract_result is None:
        return

    doc_ai = DocumentAI(api_key=os.getenv("TENSORLAKE_API_KEY"))
    result: ParseResult = doc_ai.wait_for_completion(extract_result)
    if not result.structured_data:
        return
    raw = result.structured_data[0].data
    record = raw if isinstance(raw, dict) else (raw[0] if isinstance(raw, list) and raw else {})
    data = dict(record)
    mentions = data.pop("ai_risk_mentions", []) or []
    
    # Add source file reference
    source_file = os.path.basename(file_url)
    connection = sql.connect(
        server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
        http_path=os.getenv("DATABRICKS_HTTP_PATH"),
        access_token=os.getenv("DATABRICKS_ACCESS_TOKEN"),
        _tls_no_verify=True,
    )
    cursor = connection.cursor()

    # Serialize mentions for STRING column storage
    ai_risk_mentions_json = json.dumps(mentions) if mentions else None
    
    # Insert the single record into ai_risk_filings
    insert_sql = """
    INSERT INTO ai_risk_filings (
        company_name,
        ticker,
        filing_type,
        filing_date,
        fiscal_year,
        fiscal_quarter,
        ai_risk_mentioned,
        ai_risk_mentions,
        num_ai_risk_mentions,
        ai_strategy_mentioned,
        ai_investment_mentioned,
        ai_competition_mentioned,
        regulatory_ai_risk
    ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """

    # Execute the insert with positional parameters
    cursor.execute(insert_sql, (
        data.get('company_name'),
        data.get('ticker'),
        data.get('filing_type'),
        data.get('filing_date'),
        data.get('fiscal_year'),
        data.get('fiscal_quarter'),
        data.get('ai_risk_mentioned', False),
        ai_risk_mentions_json,
        data.get('num_ai_risk_mentions', 0),
        data.get('ai_strategy_mentioned', False),
        data.get('ai_investment_mentioned', False),
        data.get('ai_competition_mentioned', False),
        data.get('regulatory_ai_risk', False)
    ))
    
    # Insert into ai_risks table
    if mentions:
        insert_mentions_sql = """
        INSERT INTO ai_risks (
            company_name,
            ticker,
            fiscal_year,
            fiscal_quarter,
            source_file,
            risk_category,
            risk_description,
            severity_indicator,
            citation
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        """
        
        for mention in mentions:
            cursor.execute(insert_mentions_sql, (
                data.get('company_name'),
                data.get('ticker'),
                data.get('fiscal_year'),
                data.get('fiscal_quarter'),
                source_file,
                mention.get('risk_category'),
                mention.get('risk_description'),
                mention.get('severity_indicator'),
                mention.get('citation')
            ))
    
    connection.commit()
    connection.close()


if __name__ == "__main__":
    from tensorlake.applications import run_local_application

    # TENSORLAKE APPLICATIONS: Local Development
    # run_local_application() executes your application locally for testing and development.
    # Pass the entry point function (decorated with @application()) and its arguments.
    #
    # For production deployment:
    # 1. Use Tensorlake CLI to deploy: `tensorlake deploy`
    # 2. Your application will run in the cloud with automatic scaling
    # 3. All @function decorated functions will execute in their specified container environments
    #
    # Secrets: When running locally, secrets are read from environment variables.
    #          In production, secrets are managed securely through the Tensorlake platform.

    # Example usage with a single document
    test_urls = [
        "https://investors.confluent.io/static-files/95299e90-a988-42c5-b9b5-7da387691f6a"
    ]

    response = run_local_application(
        document_ingestion,
        test_urls
    )

    print(response.output())

Test Locally

Run the processing script to extract data from a test SEC filing:
python process-sec.py
This will:
  1. Classify pages to find AI risk factors using VLMs
  2. Extract structured data from those pages in parallel via .map()
  3. Initialize the Databricks table schema
  4. Load the extracted data into your Databricks tables in parallel via .map()

Build Your Query Application

Now create a separate application for querying the extracted data. Create a file called query-sec.py:
import os
import json
from databricks import sql

from tensorlake.applications import Image, application, function

image = (
    Image(base_image="python:3.11-slim", name="databricks-sec")
    .run("pip install databricks-sql-connector pandas pyarrow")
)

@application()
@function(image=image)
def query_sec(query_choice: str) -> str:
    """Query AI risk data from Databricks"""
    
    # Default query: Risk category distribution
    query = """
        SELECT 
            risk_category,
            COUNT(*) as total_mentions,
            COUNT(DISTINCT company_name) as companies_mentioning
        FROM ai_risks
        WHERE risk_category IS NOT NULL
        GROUP BY risk_category
        ORDER BY total_mentions DESC
    """
    
    # Select query based on user choice
    match query_choice:
        case "operational-risks":
            query = """
                WITH ranked_risks AS (
                    SELECT 
                        company_name,
                        ticker,
                        risk_description,
                        citation,
                        LENGTH(risk_description) as description_length,
                        ROW_NUMBER() OVER (
                            PARTITION BY company_name 
                            ORDER BY LENGTH(risk_description) DESC
                        ) as rn
                    FROM ai_risks
                    WHERE risk_category = 'Operational'
                )
                SELECT 
                    company_name,
                    ticker,
                    risk_description,
                    citation,
                    description_length
                FROM ranked_risks
                WHERE rn = 1
                ORDER BY company_name
            """
        
        case "risk-evolution":
            query = """
                SELECT 
                    company_name,
                    ticker,
                    fiscal_year,
                    fiscal_quarter,
                    risk_category,
                    risk_description,
                    citation
                FROM ai_risks
                WHERE fiscal_year = '2025'
                ORDER BY company_name, fiscal_quarter
            """
        
        case "risk-timeline":
            query = """
                SELECT 
                    fiscal_year,
                    fiscal_quarter,
                    COUNT(DISTINCT company_name) as num_companies,
                    SUM(num_ai_risk_mentions) as total_risk_mentions,
                    AVG(num_ai_risk_mentions) as avg_risk_mentions_per_filing,
                    SUM(CASE WHEN regulatory_ai_risk THEN 1 ELSE 0 END) 
                        as filings_with_regulatory_risk
                FROM ai_risk_filings
                GROUP BY fiscal_year, fiscal_quarter
                ORDER BY fiscal_year, fiscal_quarter
            """
        
        case "risk-profiles":
            query = """
                SELECT 
                    company_name,
                    ticker,
                    risk_category,
                    COUNT(*) as frequency
                FROM ai_risks
                WHERE risk_category IS NOT NULL
                GROUP BY company_name, ticker, risk_category
                ORDER BY company_name, frequency DESC
            """
        
        case "company-summary":
            query = """
                SELECT 
                    company_name,
                    ticker,
                    COUNT(*) as total_filings,
                    AVG(num_ai_risk_mentions) as avg_risk_mentions,
                    SUM(CASE WHEN regulatory_ai_risk THEN 1 ELSE 0 END) 
                        as filings_with_regulatory_risk,
                    SUM(CASE WHEN ai_competition_mentioned THEN 1 ELSE 0 END) 
                        as filings_mentioning_competition,
                    SUM(CASE WHEN ai_investment_mentioned THEN 1 ELSE 0 END) 
                        as filings_mentioning_investment
                FROM ai_risk_filings
                GROUP BY company_name, ticker
                ORDER BY avg_risk_mentions DESC
            """
    
    return make_query(query)

@function(
    image=image, 
    secrets=[
        "DATABRICKS_SERVER_HOSTNAME",
        "DATABRICKS_HTTP_PATH",
        "DATABRICKS_ACCESS_TOKEN"
    ]
)
def make_query(query: str) -> str:
    """Execute query against Databricks and return JSON results"""
    import pandas as pd
    from databricks import sql
    
    try:
        connection = sql.connect(
            server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
            http_path=os.getenv("DATABRICKS_HTTP_PATH"),
            access_token=os.getenv("DATABRICKS_ACCESS_TOKEN"),
            _tls_no_verify=True,
        )
        cursor = connection.cursor()
        cursor.execute(query)
        
        # Fetch results as pandas DataFrame
        results = cursor.fetchall()
        columns = [desc[0] for desc in cursor.description]
        df = pd.DataFrame(results, columns=columns)
        
        cursor.close()
        connection.close()
        
        return df.to_json(orient='records')
    except Exception as e:
        raise e

if __name__ == "__main__":
    from tensorlake.applications import run_local_application
    import sys
    
    queries = [
        "risk-distribution", 
        "operational-risks", 
        "risk-evolution", 
        "risk-timeline", 
        "risk-profiles", 
        "company-summary"
    ]
    query = queries[0]
    
    if len(sys.argv) > 1:
        query = queries[int(sys.argv[1])]
    
    response = run_local_application(query_sec, query)
    pretty_json = json.loads(response.output())
    print(json.dumps(pretty_json, indent=4))

Test Queries Locally

Query the extracted data (replace 5 with any query number 0-5):
python query-sec.py 5
Available queries:
  • 0 - Risk category distribution
  • 1 - Operational AI risks (most detailed per company)
  • 2 - Emerging risks in 2025
  • 3 - Risk timeline analysis
  • 4 - Company risk profiles
  • 5 - Company summary statistics

Deploy to Tensorlake Cloud

Now that you’ve tested locally, deploy your applications to run as serverless functions in the cloud.

1. Verify Tensorlake Connection

tensorlake whoami

2. Set Secrets

Store your credentials securely in Tensorlake:
tensorlake secrets set DATABRICKS_SERVER_HOSTNAME='YOUR_DATABRICKS_SERVER_HOSTNAME'
tensorlake secrets set DATABRICKS_HTTP_PATH='YOUR_DATABRICKS_HTTP_PATH'
tensorlake secrets set DATABRICKS_ACCESS_TOKEN='YOUR_DATABRICKS_ACCESS_TOKEN'
tensorlake secrets set TENSORLAKE_API_KEY='YOUR_TENSORLAKE_API_KEY'

3. Verify Secrets

tensorlake secrets list

4. Deploy Applications

Deploy the processing application:
tensorlake deploy process-sec.py
Deploy the query application:
tensorlake deploy query-sec.py
Once deployed, you’ll see both applications in your dashboard at cloud.tensorlake.ai.

5. Run the Full Pipeline

Create a script called process-sec-remote.py to process all SEC filings using your deployed application:
from tensorlake.applications import run_remote_application, Request

# SEC Filings to process
sec_filings = [
    'https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/goog-10k-december-24.pdf',
    'https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/msft-10k-june-25.pdf',
    'https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/meta-10k-december-24.pdf'
]

request: Request = run_remote_application('document_ingestion', sec_filings)
Run it:
python process-sec-remote.py

6. Query from the Deployed Application

Create a script called query-sec-remote.py:
from tensorlake.applications import run_remote_application, Request
import json
import sys

# Available queries
queries = [
    "risk-distribution", 
    "operational-risks", 
    "risk-evolution", 
    "risk-timeline", 
    "risk-profiles", 
    "company-summary"
]

# Choose a query (default: risk-distribution)
query = queries[0]

if len(sys.argv) > 1:
    query = queries[int(sys.argv[1])]

request: Request = run_remote_application('query_sec', query)
output = request.output()

pretty_json = json.loads(output)
print(json.dumps(pretty_json, indent=4))
Run a specific query:
python query-sec-remote.py 2

Analyze Your Results

Let’s examine what insights we can extract from the data.

Query 1: Risk Category Distribution

See which types of AI risks are most common:
python query-sec-remote.py 0
Expected Output:
[
    {
        "risk_category": "Operational",
        "total_mentions": 15,
        "companies_mentioning": 3
    },
    {
        "risk_category": "Regulatory",
        "total_mentions": 12,
        "companies_mentioning": 3
    },
    {
        "risk_category": "Ethical",
        "total_mentions": 10,
        "companies_mentioning": 3
    }
]

Query 2: Most Detailed Operational Risks

Find the most comprehensive operational risk description from each company:
python query-sec-remote.py 1
This returns the longest (most detailed) operational risk disclosure per company, helping you understand each company’s primary operational concerns.

Query 3: Timeline Analysis

Track how risk mentions evolved over time:
python query-sec-remote.py 3
This shows trends in risk disclosure volume and helps identify when companies started taking AI risks more seriously.

Query 4: Company Risk Profiles

Compare risk category frequencies across companies:
python query-sec-remote.py 4
Understand which companies focus on which types of risks.

Key Insights

Through this analysis pipeline, you can uncover:
  1. Risk Category Trends: Operational and regulatory risks dominate across all companies
  2. Disclosure Evolution: Risk mention frequency increases in more recent filings
  3. Company Differences: Each company emphasizes different risk categories based on their AI strategy
  4. Emerging Patterns: New risk categories appear over time (liability, IP concerns, energy dependencies)

Architecture Benefits

This Tensorlake + Databricks integration provides:
  • Serverless Execution: No infrastructure to manage, applications scale automatically
  • Parallel Processing: Multiple documents processed simultaneously via .map() at both extraction and database write stages
  • Separation of Concerns: Document processing and querying are independent applications
  • Reusable Components: Each function can be called independently or composed into larger pipelines
  • Secret Management: Credentials stored securely and injected at runtime
  • Fault Tolerance: Functions wrapped in try-except ensure pipeline continues even if individual items fail

Adapt This Pipeline

This pipeline can be adapted for any document analysis use case:
  • ESG Disclosures: Track sustainability commitments across annual reports
  • Financial Metrics Tracking: Extract KPIs from earnings reports over time
  • Competitive Intelligence: Monitor competitor product launches and strategies
  • Regulatory Compliance: Alert on new compliance requirements in legal documents
  • Contract Analysis: Extract key terms and obligations from agreements

Clean Up

When you’re done with this example:
# Deactivate virtual environment
deactivate

# Optional: Delete deployed applications
tensorlake applications delete document_ingestion
tensorlake applications delete query_sec

Next Steps

Now that you have the basics down, explore these resources: Try building your own document intelligence pipeline with Tensorlake and Databricks today!