Synthetic data generation uses a Select AI profile to populate database tables with generated rows. It is useful for demos, development, testing, and prototyping when representative data is needed but production data should not be copied.
Synthetic data is inserted into the target tables in the connected database
schema or in the schema identified by owner_name or object_list. Before
running generation, make sure the connected user has privileges on the target
tables and that the Select AI profile is configured with a provider and
credential.
Use synthetic data generation with care in shared schemas. The API writes rows to the target tables; use dedicated test tables or schemas when experimenting.
1. Generation modes¶
Use object_name for a single target table:
attributes = select_ai.SyntheticDataAttributes(
object_name="MOVIE",
record_count=100,
user_prompt="the release date for the movies should be in 2019",
)
Use object_list for multiple target tables in one request:
attributes = select_ai.SyntheticDataAttributes(
object_list=[
{
"owner": "SH",
"name": "MOVIE",
"record_count": 100,
"user_prompt": (
"the release date for the movies should be in 2019"
),
},
{"owner": "SH", "name": "ACTOR", "record_count": 10},
{"owner": "SH", "name": "DIRECTOR", "record_count": 5},
]
)
Exactly one of object_name or object_list must be set.
2. Generation parameters¶
Use SyntheticDataParams to control how generation is performed:
params = select_ai.SyntheticDataParams(
sample_rows=100,
table_statistics=True,
priority="HIGH",
comments=True,
)
attributes = select_ai.SyntheticDataAttributes(
object_name="MOVIE",
record_count=100,
user_prompt="Generate movie data for releases in 2019.",
params=params,
)
sample_rows controls how many existing rows are used as examples for the
model. table_statistics and comments include additional table metadata.
priority controls resource priority for generation work; supported values
are HIGH, MEDIUM, and LOW.
3. Sync and async APIs¶
Use Profile.generate_synthetic_data(...) for synchronous applications and
await AsyncProfile.generate_synthetic_data(...) for asynchronous
applications:
profile = select_ai.Profile(profile_name="oci_ai_profile")
profile.generate_synthetic_data(
synthetic_data_attributes=attributes,
)
async_profile = await select_ai.AsyncProfile(
profile_name="async_oci_ai_profile",
)
await async_profile.generate_synthetic_data(
synthetic_data_attributes=attributes,
)
For additional database-side attribute details, see the generate_synthetic_data PL/SQL API.
3.1. SyntheticDataAttributes¶
- class select_ai.SyntheticDataAttributes(object_name: str | None = None, object_list: List[Mapping] | None = None, owner_name: str | None = None, params: SyntheticDataParams | None = None, record_count: int | None = None, user_prompt: str | None = None)¶
Attributes to control generation of synthetic data
- Parameters:
object_name (str) – Table name to populate synthetic data
object_list (List[Mapping]) – Use this to generate synthetic data on multiple tables
owner_name (str) – Database user who owns the referenced object. Default value is connected user’s schema
record_count (int) – Number of records to generate
user_prompt (str) – User prompt to guide generation of synthetic data For e.g. “the release date for the movies should be in 2019”
3.2. SyntheticDataParams¶
- class select_ai.SyntheticDataParams(sample_rows: int | None = None, table_statistics: bool | None = False, priority: str | None = 'HIGH', comments: bool | None = False)¶
Optional parameters to control generation of synthetic data
- Parameters:
sample_rows (int) – number of rows from the table to use as a sample to guide the LLM in data generation
table_statistics (bool) – Enable or disable the use of table statistics information. Default value is False
priority (str) – Assign a priority value that defines the number of parallel requests sent to the LLM for generating synthetic data. Tasks with a higher priority will consume more database resources and complete faster. Possible values are: HIGH, MEDIUM, LOW
comments (bool) – Enable or disable sending comments to the LLM to guide data generation. Default value is False
3.3. Single table synthetic data¶
The below example shows single table synthetic data generation
3.3.1. Single Table Sync API¶
import os
import select_ai
user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")
select_ai.connect(user=user, password=password, dsn=dsn)
profile = select_ai.Profile(profile_name="oci_ai_profile")
synthetic_data_params = select_ai.SyntheticDataParams(
sample_rows=100, table_statistics=True, priority="HIGH"
)
synthetic_data_attributes = select_ai.SyntheticDataAttributes(
object_name="MOVIE",
user_prompt="the release date for the movies should be in 2019",
params=synthetic_data_params,
record_count=100,
)
profile.generate_synthetic_data(
synthetic_data_attributes=synthetic_data_attributes
)
output:
SQL> select count(*) from movie;
COUNT(*)
----------
100
3.3.2. Single Table Async API¶
import asyncio
import os
import select_ai
user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")
async def main():
await select_ai.async_connect(user=user, password=password, dsn=dsn)
async_profile = await select_ai.AsyncProfile(
profile_name="async_oci_ai_profile",
)
synthetic_data_params = select_ai.SyntheticDataParams(
sample_rows=100, table_statistics=True, priority="HIGH"
)
synthetic_data_attributes = select_ai.SyntheticDataAttributes(
object_name="MOVIE",
user_prompt="the release date for the movies should be in 2019",
params=synthetic_data_params,
record_count=100,
)
await async_profile.generate_synthetic_data(
synthetic_data_attributes=synthetic_data_attributes
)
asyncio.run(main())
output:
SQL> select count(*) from movie;
COUNT(*)
----------
100
3.4. Multi table synthetic data¶
The below example shows multi-table synthetic data generation
3.4.1. Multi table Sync API¶
import os
import select_ai
user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")
select_ai.connect(user=user, password=password, dsn=dsn)
profile = select_ai.Profile(profile_name="oci_ai_profile")
synthetic_data_params = select_ai.SyntheticDataParams(
sample_rows=100, table_statistics=True, priority="HIGH"
)
object_list = [
{
"owner": user,
"name": "MOVIE",
"record_count": 100,
"user_prompt": "the release date for the movies should be in 2019",
},
{"owner": user, "name": "ACTOR", "record_count": 10},
{"owner": user, "name": "DIRECTOR", "record_count": 5},
]
synthetic_data_attributes = select_ai.SyntheticDataAttributes(
object_list=object_list, params=synthetic_data_params
)
profile.generate_synthetic_data(
synthetic_data_attributes=synthetic_data_attributes
)
output:
SQL> select count(*) from actor;
COUNT(*)
----------
40
SQL> select count(*) from director;
COUNT(*)
----------
13
SQL> select count(*) from movie;
COUNT(*)
----------
300
3.4.2. Multi table Async API¶
import asyncio
import os
import select_ai
user = os.getenv("SELECT_AI_USER")
password = os.getenv("SELECT_AI_PASSWORD")
dsn = os.getenv("SELECT_AI_DB_CONNECT_STRING")
async def main():
await select_ai.async_connect(user=user, password=password, dsn=dsn)
async_profile = await select_ai.AsyncProfile(
profile_name="async_oci_ai_profile",
)
synthetic_data_params = select_ai.SyntheticDataParams(
sample_rows=100, table_statistics=True, priority="HIGH"
)
object_list = [
{
"owner": user,
"name": "MOVIE",
"record_count": 100,
"user_prompt": "the release date for the movies should be in 2019",
},
{"owner": user, "name": "ACTOR", "record_count": 10},
{"owner": user, "name": "DIRECTOR", "record_count": 5},
]
synthetic_data_attributes = select_ai.SyntheticDataAttributes(
object_list=object_list, params=synthetic_data_params
)
await async_profile.generate_synthetic_data(
synthetic_data_attributes=synthetic_data_attributes
)
asyncio.run(main())
output:
SQL> select count(*) from actor;
COUNT(*)
----------
40
SQL> select count(*) from director;
COUNT(*)
----------
13
SQL> select count(*) from movie;
COUNT(*)
----------
300