Database Subsetting and Synthetic Data

Subsetted databases populated with synthetic data are important for cross database analytics.

Database subsetting is complex due to several challenges:

  1. When selecting a subset of rows from a table, you must also get the rows of related tables.
  2. A relational constraint may be on multiple columns (Composite keys). In this case a subset requires rows of related tables using the multiple column constraint.
  3. Some tables may have no foreign key columns but may have other tables that have foreign keys referencing the table.
  4. Cyclical dependencies. Table A has a foreign key to table B and table B has a foreign key back to table A.

LangGrant solves these challenges with a simple visual interface. Start by specifying a percent of the source database size, with or without

Challenges in cross database analytics

Joining data between databases is challenging as column level data rarely matches exactly. Fortunately, subsetting can reduce a multi-terabyte database to megabytes in size, while retaining full relational integrity. Down sized databases are then populated with synthetic data, to provide LLMs with safe context needed for LLM generated cross database analytics.

Cross database joins

An LLM provided with a safe synthetically populated database context, is better able to specify a join strategy.   LangGrant supports a range of joins, including fuzzy, distance, and exact match.  

Automated database context 

LangGrant automatically delivers complete database context for LLMs to comprehend multiple databases simultaneously at scale. Like a skilled engineer, once an LLM understands databases it can contribute to solution design.

Read more 

Micro data lakes on demand 

LangGrant binds LLMs to create accurate analytic plans for user queries, resulting in a inference ready “micro data lake.”  Plans are saved, easily validated and modified, and run to deliver the analytic data within minutes of the user query.

Read more 

Governance

PII safeguards, authorization controls, data residency rules, firewall restrictions, and token-governance policies are built-in by design.  No sensitive data leaves governed systems.

Read more 

Plan management

LLM generated plans are saved, easily reviewed and validated, modified, and executed, for LLM use that is transparent, explainable, and repeatable. 

Read more 

Database cloning and containers

On demand database clones with containers provide Agent developers with production database copies (with optional masking) for agentic AI dev/test.

Read more 

Database subsetting and synthetic data 

Database subsetting with synthetic data provides added context for working with complex multi-database environments.

Read more