Synthetic data generation uses a Select AI profile to populate database tables with generated rows. It is useful for demos, development, testing, and prototyping when representative data is needed but production data should not be copied.

Synthetic data is inserted into the target tables in the connected database schema or in the schema identified by owner_name or object_list. Before running generation, make sure the connected user has privileges on the target tables and that the Select AI profile is configured with a provider and credential.

Use synthetic data generation with care in shared schemas. The API writes rows to the target tables; use dedicated test tables or schemas when experimenting.

1. Generation modes

Use object_name for a single target table:

attributes = select_ai.SyntheticDataAttributes(
    object_name="MOVIE",
    record_count=100,
    user_prompt="the release date for the movies should be in 2019",
)

Use object_list for multiple target tables in one request:

attributes = select_ai.SyntheticDataAttributes(
    object_list=[
        {
            "owner": "SH",
            "name": "MOVIE",
            "record_count": 100,
            "user_prompt": (
                "the release date for the movies should be in 2019"
            ),
        },
        {"owner": "SH", "name": "ACTOR", "record_count": 10},
        {"owner": "SH", "name": "DIRECTOR", "record_count": 5},
    ]
)

Exactly one of object_name or object_list must be set.

2. Generation parameters

Use SyntheticDataParams to control how generation is performed:

params = select_ai.SyntheticDataParams(
    sample_rows=100,
    table_statistics=True,
    priority="HIGH",
    comments=True,
)

attributes = select_ai.SyntheticDataAttributes(
    object_name="MOVIE",
    record_count=100,
    user_prompt="Generate movie data for releases in 2019.",
    params=params,
)

sample_rows controls how many existing rows are used as examples for the model. table_statistics and comments include additional table metadata. priority controls resource priority for generation work; supported values are HIGH, MEDIUM, and LOW.

3. Sync and async APIs

Use Profile.generate_synthetic_data(...) for synchronous applications and await AsyncProfile.generate_synthetic_data(...) for asynchronous applications:

profile = select_ai.Profile(profile_name="oci_ai_profile")
profile.generate_synthetic_data(
    synthetic_data_attributes=attributes,
)
async_profile = await select_ai.AsyncProfile(
    profile_name="async_oci_ai_profile",
)
await async_profile.generate_synthetic_data(
    synthetic_data_attributes=attributes,
)

For additional database-side attribute details, see the generate_synthetic_data PL/SQL API.

3.1. SyntheticDataAttributes

class select_ai.SyntheticDataAttributes(object_name: str | None = None, object_list: List[Mapping] | None = None, owner_name: str | None = None, params: SyntheticDataParams | None = None, record_count: int | None = None, user_prompt: str | None = None)

Attributes to control generation of synthetic data

Parameters:
  • object_name (str) – Table name to populate synthetic data

  • object_list (List[Mapping]) – Use this to generate synthetic data on multiple tables

  • owner_name (str) – Database user who owns the referenced object. Default value is connected user’s schema

  • record_count (int) – Number of records to generate

  • user_prompt (str) – User prompt to guide generation of synthetic data For e.g. “the release date for the movies should be in 2019”

3.2. SyntheticDataParams

class select_ai.SyntheticDataParams(sample_rows: int | None = None, table_statistics: bool | None = False, priority: str | None = 'HIGH', comments: bool | None = False)

Optional parameters to control generation of synthetic data

Parameters:
  • sample_rows (int) – number of rows from the table to use as a sample to guide the LLM in data generation

  • table_statistics (bool) – Enable or disable the use of table statistics information. Default value is False

  • priority (str) – Assign a priority value that defines the number of parallel requests sent to the LLM for generating synthetic data. Tasks with a higher priority will consume more database resources and complete faster. Possible values are: HIGH, MEDIUM, LOW

  • comments (bool) – Enable or disable sending comments to the LLM to guide data generation. Default value is False

3.3. Single table synthetic data

The below example shows single table synthetic data generation

3.3.1. Single Table Sync API

import os

import select_ai

user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")

select_ai.connect(user=user, password=password, dsn=dsn)
profile = select_ai.Profile(profile_name="oci_ai_profile")
synthetic_data_params = select_ai.SyntheticDataParams(
    sample_rows=100, table_statistics=True, priority="HIGH"
)
synthetic_data_attributes = select_ai.SyntheticDataAttributes(
    object_name="MOVIE",
    user_prompt="the release date for the movies should be in 2019",
    params=synthetic_data_params,
    record_count=100,
)
profile.generate_synthetic_data(
    synthetic_data_attributes=synthetic_data_attributes
)

output:

SQL> select count(*) from movie;

  COUNT(*)
----------
       100

3.3.2. Single Table Async API

import asyncio
import os

import select_ai

user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")


async def main():
    await select_ai.async_connect(user=user, password=password, dsn=dsn)
    async_profile = await select_ai.AsyncProfile(
        profile_name="async_oci_ai_profile",
    )
    synthetic_data_params = select_ai.SyntheticDataParams(
        sample_rows=100, table_statistics=True, priority="HIGH"
    )
    synthetic_data_attributes = select_ai.SyntheticDataAttributes(
        object_name="MOVIE",
        user_prompt="the release date for the movies should be in 2019",
        params=synthetic_data_params,
        record_count=100,
    )
    await async_profile.generate_synthetic_data(
        synthetic_data_attributes=synthetic_data_attributes
    )


asyncio.run(main())

output:

SQL> select count(*) from movie;

  COUNT(*)
----------
       100

3.4. Multi table synthetic data

The below example shows multi-table synthetic data generation

3.4.1. Multi table Sync API

import os

import select_ai

user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")

select_ai.connect(user=user, password=password, dsn=dsn)
profile = select_ai.Profile(profile_name="oci_ai_profile")
synthetic_data_params = select_ai.SyntheticDataParams(
    sample_rows=100, table_statistics=True, priority="HIGH"
)
object_list = [
    {
        "owner": user,
        "name": "MOVIE",
        "record_count": 100,
        "user_prompt": "the release date for the movies should be in 2019",
    },
    {"owner": user, "name": "ACTOR", "record_count": 10},
    {"owner": user, "name": "DIRECTOR", "record_count": 5},
]
synthetic_data_attributes = select_ai.SyntheticDataAttributes(
    object_list=object_list, params=synthetic_data_params
)
profile.generate_synthetic_data(
    synthetic_data_attributes=synthetic_data_attributes
)

output:

SQL> select count(*) from actor;

  COUNT(*)
----------
    40

SQL> select count(*) from director;

  COUNT(*)
----------
    13

SQL> select count(*) from movie;

  COUNT(*)
----------
       300

3.4.2. Multi table Async API

import asyncio
import os

import select_ai

user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")


async def main():
    await select_ai.async_connect(user=user, password=password, dsn=dsn)
    async_profile = await select_ai.AsyncProfile(
        profile_name="async_oci_ai_profile",
    )
    synthetic_data_params = select_ai.SyntheticDataParams(
        sample_rows=100, table_statistics=True, priority="HIGH"
    )
    object_list = [
        {
            "owner": user,
            "name": "MOVIE",
            "record_count": 100,
            "user_prompt": "the release date for the movies should be in 2019",
        },
        {"owner": user, "name": "ACTOR", "record_count": 10},
        {"owner": user, "name": "DIRECTOR", "record_count": 5},
    ]
    synthetic_data_attributes = select_ai.SyntheticDataAttributes(
        object_list=object_list, params=synthetic_data_params
    )
    await async_profile.generate_synthetic_data(
        synthetic_data_attributes=synthetic_data_attributes
    )


asyncio.run(main())

output:

SQL> select count(*) from actor;

  COUNT(*)
----------
    40

SQL> select count(*) from director;

  COUNT(*)
----------
    13

SQL> select count(*) from movie;

  COUNT(*)
----------
       300