Skip to content

Database Schema

Carbon Connect uses PostgreSQL 16 with the pgvector extension for combined relational storage and 768-dimensional vector similarity search. The database is managed through SQLAlchemy 2.0 (async) with Alembic for migrations.


Entity Relationship Diagram

erDiagram
    tenants ||--o{ users : "has many"
    tenants ||--o{ companies : "has many"
    companies ||--o{ matches : "has many"
    companies ||--o{ applications : "has many"
    grants ||--o{ matches : "has many"
    grants ||--o{ applications : "has many"
    users ||--o| notification_preferences : "has one"
    users ||--o{ notification_history : "has many"
    applications ||--o{ documents : "has many"
    matches }o--|| applications : "optional link"

    tenants {
        uuid id PK
        string name
        string slug UK
        jsonb settings
        string subscription_tier
        boolean is_active
        timestamp created_at
        timestamp updated_at
    }

    users {
        uuid id PK
        uuid tenant_id FK
        string email
        string hashed_password
        string full_name
        string role
        boolean is_active
        boolean email_verified
        timestamp last_login_at
        timestamp created_at
        timestamp updated_at
    }

    companies {
        uuid id PK
        uuid tenant_id FK
        string name
        text description
        array nace_codes
        string country
        vector description_embedding
        float scope1_emissions
        float scope2_emissions
        float scope3_emissions
        array certifications
        vector carbon_profile_embedding
        timestamp created_at
        timestamp updated_at
    }

    grants {
        uuid id PK
        string external_id
        string source
        string title
        text description
        float funding_amount_min
        float funding_amount_max
        timestamp deadline
        array countries
        array nace_codes
        boolean is_carbon_focused
        array carbon_categories
        vector title_embedding
        vector description_embedding
        string content_hash
        timestamp created_at
        timestamp updated_at
    }

    matches {
        uuid id PK
        uuid tenant_id
        uuid company_id FK
        uuid grant_id FK
        float total_score
        float semantic_score
        float rule_score
        float carbon_score
        float collaborative_score
        float recency_score
        boolean is_viewed
        boolean is_saved
        boolean is_dismissed
        timestamp calculated_at
        timestamp created_at
        timestamp updated_at
    }

    applications {
        uuid id PK
        uuid tenant_id
        uuid company_id FK
        uuid grant_id FK
        uuid match_id FK
        string status
        float requested_amount
        float approved_amount
        jsonb generated_content
        timestamp submitted_at
        timestamp created_at
        timestamp updated_at
    }

Base Model

All models inherit from BaseModel which provides a UUID primary key and automatic timestamps:

Column Type Description
id UUID Auto-generated UUID v4 primary key
created_at TIMESTAMP WITH TIME ZONE Auto-set on creation via func.now()
updated_at TIMESTAMP WITH TIME ZONE Auto-updated on modification via func.now()

Source: backend/app/db/base.py


Table Definitions

tenants

Multi-tenant organizations with subscription tiers. All tenant-scoped data is isolated via PostgreSQL Row-Level Security policies.

Column Type Nullable Default Indexed Description
id UUID No uuid4() PK Primary key
name VARCHAR(255) No -- -- Organization name
slug VARCHAR(100) No -- UNIQUE URL-safe identifier
settings JSONB No {} -- Tenant-specific settings
subscription_tier VARCHAR(50) No explorer -- Tier: explorer, professional, enterprise
is_active BOOLEAN No true Yes Active status
created_at TIMESTAMPTZ No now() -- Creation timestamp
updated_at TIMESTAMPTZ No now() -- Last update timestamp

Relationships: Has many users, has many companies (both cascade delete).


users

User accounts with tenant association and role-based access control.

Column Type Nullable Default Indexed Description
id UUID No uuid4() PK Primary key
tenant_id UUID No -- Yes, FK References tenants.id (CASCADE)
email VARCHAR(255) No -- Yes User email address
hashed_password VARCHAR(255) No -- -- bcrypt-hashed password
full_name VARCHAR(255) Yes -- -- Display name
role VARCHAR(50) No user -- Role: user, admin, owner
is_active BOOLEAN No true -- Active status
email_verified BOOLEAN No false -- Email verification status
last_login_at TIMESTAMPTZ Yes -- -- Last login timestamp
created_at TIMESTAMPTZ No now() -- Creation timestamp
updated_at TIMESTAMPTZ No now() -- Last update timestamp

Relationships: Belongs to tenant, has one notification_preferences, has many notification_history.


companies

SME profiles with industry classification, carbon profile, and 768-dimensional description embeddings for semantic matching.

Column Type Nullable Default Indexed Description
id UUID No uuid4() PK Primary key
tenant_id UUID No -- Yes, FK References tenants.id (CASCADE)
name VARCHAR(500) No -- -- Company name
description TEXT Yes -- -- Business description
nace_codes VARCHAR[] No {} -- European industry classification codes
sic_codes VARCHAR[] No {} -- UK Standard Industrial Classification
legal_form VARCHAR(100) Yes -- -- Legal form (Ltd, GmbH, SARL, etc.)
employee_count_min INTEGER Yes -- -- Minimum employee range
employee_count_max INTEGER Yes -- -- Maximum employee range
annual_revenue_min NUMERIC(15,2) Yes -- -- Minimum annual revenue (EUR)
annual_revenue_max NUMERIC(15,2) Yes -- -- Maximum annual revenue (EUR)
founding_year INTEGER Yes -- -- Year founded
country VARCHAR(3) No -- Yes ISO 3166-1 alpha-3 country code
region VARCHAR(100) Yes -- -- Region/state
city VARCHAR(100) Yes -- -- City
postal_code VARCHAR(20) Yes -- -- Postal code
website VARCHAR(500) Yes -- -- Company website
contact_email VARCHAR(255) Yes -- -- Contact email
description_embedding VECTOR(768) Yes -- HNSW Embedding for semantic matching
extra_data JSONB No {} -- Additional metadata
is_active BOOLEAN No true -- Active status
created_at TIMESTAMPTZ No now() -- Creation timestamp
updated_at TIMESTAMPTZ No now() -- Last update timestamp

Carbon Profile Fields

Column Type Nullable Default Description
carbon_profile_completed BOOLEAN No false Whether carbon profile is fully completed
has_carbon_baseline BOOLEAN No false Whether baseline emissions are calculated
scope1_emissions NUMERIC(15,4) Yes -- Scope 1 direct emissions in tCO2e/year
scope1_sources JSONB No {} Breakdown: {natural_gas_kwh, diesel_liters, ...}
scope2_emissions NUMERIC(15,4) Yes -- Scope 2 indirect emissions in tCO2e/year
scope2_sources JSONB No {} Breakdown: {electricity_kwh, district_heating_kwh, ...}
scope3_emissions NUMERIC(15,4) Yes -- Scope 3 value chain emissions in tCO2e/year
scope3_categories JSONB No {} GHG Protocol category breakdown
reduction_target_percent NUMERIC(5,2) Yes -- Target emission reduction percentage
reduction_target_year INTEGER Yes -- Year to achieve reduction target
net_zero_target_year INTEGER Yes -- Net-zero target year
certifications VARCHAR[] No {} Certifications: ISO_14001, SBTi_COMMITTED, CDP_DISCLOSURE, B_CORP
carbon_intensity_revenue NUMERIC(15,6) Yes -- tCO2e per EUR revenue
carbon_intensity_employee NUMERIC(10,4) Yes -- tCO2e per employee
carbon_profile_embedding VECTOR(768) Yes -- Embedding for carbon profile matching

Computed Properties (Python, not in database):

  • company_size -- Returns micro (< 10), small (< 50), medium (< 250), or large based on employee_count_max
  • total_emissions -- Sum of Scope 1 + Scope 2 + Scope 3 emissions
  • carbon_readiness_score -- 0-100 score based on profile completeness

Relationships: Belongs to tenant, has many matches, has many applications.


grants

Grant opportunities with eligibility criteria, carbon classification, and 768-dimensional embeddings. Grants are NOT tenant-scoped -- they are shared across all tenants.

Column Type Nullable Default Indexed Description
id UUID No uuid4() PK Primary key
external_id VARCHAR(255) Yes -- -- ID from source system
source VARCHAR(50) No -- Yes Source: cordis, eu_portal, innovate_uk, etc.
source_url TEXT Yes -- -- Original URL at source
title VARCHAR(500) No -- -- Grant title
description TEXT Yes -- -- Full description
summary TEXT Yes -- -- Short description for cards
funding_amount_min NUMERIC(15,2) Yes -- -- Minimum funding amount
funding_amount_max NUMERIC(15,2) Yes -- -- Maximum funding amount
funding_rate_min NUMERIC(5,2) Yes -- -- Minimum funding rate (0-100%)
funding_rate_max NUMERIC(5,2) Yes -- -- Maximum funding rate (0-100%)
currency VARCHAR(3) No EUR -- Currency code
deadline TIMESTAMPTZ Yes -- Yes Application deadline
publication_date TIMESTAMPTZ Yes -- -- Publication date
start_date TIMESTAMPTZ Yes -- -- Grant program start date
end_date TIMESTAMPTZ Yes -- -- Grant program end date
countries VARCHAR[] No {} -- Eligible countries
regions VARCHAR[] No {} -- Eligible regions
nace_codes VARCHAR[] No {} -- Eligible NACE industry codes
company_sizes VARCHAR[] No {} -- Eligible sizes: micro, small, medium, large
legal_forms VARCHAR[] No {} -- Eligible legal forms
min_employees INTEGER Yes -- -- Minimum employee count
max_employees INTEGER Yes -- -- Maximum employee count
eligibility_criteria JSONB No {} -- Additional eligibility data
grant_type VARCHAR(100) Yes -- -- Grant type classification
thematic_areas VARCHAR[] No {} -- Thematic area tags
title_embedding VECTOR(768) Yes -- HNSW Title text embedding
description_embedding VECTOR(768) Yes -- HNSW Description text embedding
content_hash VARCHAR(64) Yes -- Yes SHA-256 hash for deduplication
status VARCHAR(50) No active Yes Status: active, closed, upcoming
is_active BOOLEAN No true -- Active flag
extra_data JSONB No {} -- Additional metadata
raw_data JSONB Yes -- -- Original data from source (for debugging)
last_synced_at TIMESTAMPTZ Yes -- -- Last pipeline sync timestamp
created_at TIMESTAMPTZ No now() -- Creation timestamp
updated_at TIMESTAMPTZ No now() -- Last update timestamp

Carbon Classification Fields

Column Type Nullable Default Indexed Description
is_carbon_focused BOOLEAN No false Yes Primary carbon indicator
carbon_categories VARCHAR[] No {} -- Carbon categories (see list below)
min_emission_reduction_percent NUMERIC(5,2) Yes -- -- Required minimum emission reduction %
eligible_scopes VARCHAR[] No {} -- Eligible emission scopes: scope1, scope2, scope3
eu_taxonomy_aligned BOOLEAN No false -- EU Taxonomy compliance flag
taxonomy_objectives VARCHAR[] No {} -- EU Taxonomy objectives (see list below)
green_deal_aligned BOOLEAN No false -- EU Green Deal alignment
fit_for_55_relevant BOOLEAN No false -- Fit for 55 package relevance
supports_csrd_compliance BOOLEAN No false -- CSRD reporting support
supports_sbti_alignment BOOLEAN No false -- Science Based Targets support
carbon_extra_data JSONB No {} -- Additional carbon metadata

Carbon Categories:

energy_efficiency, renewable_energy, clean_technology, circular_economy,
sustainable_transport, green_buildings, carbon_capture, hydrogen,
industrial_decarbonization, sustainable_agriculture, biodiversity,
water_management, waste_reduction, climate_adaptation

EU Taxonomy Objectives:

climate_mitigation, climate_adaptation, water_marine_resources,
circular_economy, pollution_prevention, biodiversity_ecosystems

Relationships: Has many matches, has many applications.


matches

Company-grant matching scores with full component breakdown. Matches are tenant-scoped through the tenant_id column and company relationship.

Column Type Nullable Default Indexed Description
id UUID No uuid4() PK Primary key
tenant_id UUID No -- Yes Tenant isolation column
company_id UUID No -- Yes, FK References companies.id (CASCADE)
grant_id UUID No -- Yes, FK References grants.id (CASCADE)
total_score NUMERIC(5,4) No -- -- Final weighted score (0.0000 - 1.0000)
semantic_score NUMERIC(5,4) Yes -- -- Embedding cosine similarity (25% weight)
rule_score NUMERIC(5,4) Yes -- -- Rule-based criteria (30% weight)
carbon_score NUMERIC(5,4) Yes -- -- Carbon alignment (25% weight)
collaborative_score NUMERIC(5,4) Yes -- -- Peer signals (10% weight)
recency_score NUMERIC(5,4) Yes -- -- Deadline urgency (10% weight)
score_breakdown JSONB No {} -- Human-readable score explanation
is_viewed BOOLEAN No false -- User has viewed this match
is_saved BOOLEAN No false -- User has saved this match
is_dismissed BOOLEAN No false -- User has dismissed this match
user_rating INTEGER Yes -- -- User rating (1-5)
calculated_at TIMESTAMPTZ No utcnow() -- Score calculation timestamp
expires_at TIMESTAMPTZ Yes -- -- Score expiration timestamp
created_at TIMESTAMPTZ No now() -- Creation timestamp
updated_at TIMESTAMPTZ No now() -- Last update timestamp

Unique Constraint: (company_id, grant_id) -- prevents duplicate matches.

Relationships: Belongs to company, belongs to grant.


applications

Grant application tracking with a Kanban-style status workflow.

Column Type Nullable Default Indexed Description
id UUID No uuid4() PK Primary key
tenant_id UUID No -- Yes Tenant isolation column
company_id UUID No -- Yes, FK References companies.id (CASCADE)
grant_id UUID No -- Yes, FK References grants.id
match_id UUID Yes -- FK References matches.id (optional link)
status VARCHAR(50) No draft Yes Workflow status (see transitions below)
application_reference VARCHAR(255) Yes -- -- External reference number
submitted_at TIMESTAMPTZ Yes -- -- Submission timestamp
decision_at TIMESTAMPTZ Yes -- -- Decision timestamp
requested_amount NUMERIC(15,2) Yes -- -- Amount requested
approved_amount NUMERIC(15,2) Yes -- -- Amount approved
documents JSONB No [] -- Document references (array)
notes TEXT Yes -- -- User-visible notes
internal_notes TEXT Yes -- -- Internal team notes
generated_content JSONB No {} -- AI-generated application content
extra_data JSONB No {} -- Additional metadata
created_at TIMESTAMPTZ No now() -- Creation timestamp
updated_at TIMESTAMPTZ No now() -- Last update timestamp

Status Workflow:

stateDiagram-v2
    [*] --> draft
    draft --> in_progress
    draft --> submitted
    in_progress --> draft
    in_progress --> submitted
    submitted --> under_review
    under_review --> approved
    under_review --> rejected
    approved --> [*]
    rejected --> [*]

Relationships: Belongs to company, belongs to grant, has many documents.


Additional Tables

The database includes several supporting tables:

Table Description
documents File metadata for S3-stored documents linked to applications
notification_preferences Per-user notification frequency, channels, and thresholds
notification_history Sent notification log with delivery status
partners Partner/referral program accounts
referrals Partner referral tracking with conversion status
commissions Partner commission calculations and payment status
sync_runs Grant pipeline sync execution history
sync_status Per-source sync status tracking
vcri_projects VCRI (Verified Carbon Reduction Initiative) project records
vcri_measurements VCRI emission measurement data points
vcri_certificates VCRI certificates with optional blockchain anchoring
alembic_version Alembic migration version tracking

Indexes

Standard Indexes

Composite and single-column indexes for common query patterns:

Table Index Columns Purpose
tenants B-tree slug (UNIQUE) Tenant lookup by slug
tenants B-tree is_active Active tenant filtering
users B-tree tenant_id Tenant-scoped user queries
users B-tree email Login lookup
companies B-tree tenant_id Tenant-scoped company queries
companies B-tree country Country filtering
grants B-tree source Source filtering
grants B-tree deadline Deadline sorting and filtering
grants B-tree status Status filtering
grants B-tree is_carbon_focused Carbon grant filtering
grants B-tree content_hash Deduplication lookups
matches B-tree tenant_id Tenant-scoped match queries
matches B-tree company_id Company match lookups
matches B-tree grant_id Grant match lookups
matches UNIQUE (company_id, grant_id) Prevent duplicate matches
applications B-tree tenant_id Tenant-scoped queries
applications B-tree company_id Company application lookups
applications B-tree grant_id Grant application lookups
applications B-tree status Status filtering

Vector Indexes (pgvector)

HNSW (Hierarchical Navigable Small World) indexes for approximate nearest-neighbor search on 768-dimensional embeddings:

Table Column Dimensions Algorithm Purpose
companies description_embedding 768 HNSW Semantic company matching
companies carbon_profile_embedding 768 HNSW Carbon profile similarity
grants title_embedding 768 HNSW Grant title similarity
grants description_embedding 768 HNSW Grant description similarity

HNSW vs IVFFlat

The project uses HNSW indexes over IVFFlat because HNSW provides better recall at query time without requiring periodic retraining. HNSW index build time is higher but query performance is superior for the expected dataset size (100k-500k grants).


Alembic Migrations

Configuration

Alembic is configured in alembic/env.py to use the synchronous psycopg2 driver for migrations (asyncpg has compatibility issues on Windows). The database URL is pulled from the application Settings class.

All SQLAlchemy models are imported in alembic/env.py so that autogenerate can detect schema changes:

from backend.app.models.database import (
    Application, Company, Document, Grant, Match,
    NotificationHistory, NotificationPreferences,
    SyncRun, SyncStatus, Tenant, User,
)

Common Commands

# Apply all pending migrations
poetry run alembic upgrade head

# Roll back one migration
poetry run alembic downgrade -1

# Roll back to a specific revision
poetry run alembic downgrade <revision_id>

# Generate a new migration from model changes
poetry run alembic revision --autogenerate -m "Add new column to grants"

# View current migration state
poetry run alembic current

# View migration history
poetry run alembic history

# View migration history with verbose output
poetry run alembic history -v

Migration Best Practices

Always Review Autogenerated Migrations

Alembic's --autogenerate flag detects most schema changes but may miss:

  • Index name changes
  • Constraint modifications
  • Custom SQL (RLS policies, functions)
  • pgvector-specific operations

Always review the generated migration file before applying it.

pgvector Extension

The pgvector extension is created by scripts/init-extensions.sql during Docker container initialization, not by Alembic. If you see pgvector not defined errors during migration, ensure the init script has run.


Connection Strings

# Async (application runtime)
postgresql+asyncpg://grant_user:grant_password_dev@localhost:5433/grant_engine?ssl=disable

# Sync (Alembic migrations)
postgresql://grant_user:grant_password_dev@localhost:5433/grant_engine
# Async (application runtime)
postgresql+asyncpg://{user}:{password}@{rds_endpoint}:5432/{database}

# Sync (Alembic migrations)
postgresql://{user}:{password}@{rds_endpoint}:5432/{database}