How to Perform Data Synthesis in WayFlow#
This guide provides practical and adaptable methods for generating synthetic datasets tailored to a range of real-world use cases. Use it when you need to create evaluation datasets for model testing, sample data for demonstrations, or address data privacy requirements. You will find reproducible workflows and techniques that you can flexibly adapt to your own data schema and constraints.
Each synthesis approach is presented in relation to specific feature requirements. Most examples start from a seed dataset to capture authentic data characteristics. However, you can also apply these methods without real data by specifying a target schema and using tools such as Faker. With this guide, you can generate synthetic data that balances statistical realism with project flexibility.
Prerequisites and setup#
Desired output schema: Define the fields and features your synthetic dataset should contain. Specify each field’s type and any necessary constraints.
Seed dataset (optional, recommended):
Use a seed dataset if you want to preserve real-world distributional properties, such as means, variances, or category frequencies.
A seed dataset is also essential if you need to interpolate or generate synthetic values conditioned on, or resembling, existing feature values (for example, generating synthetic ages based on a real distribution).
No seed dataset?
If you do not have real data to start with, you can use the Faker package (Faker). Faker provides many generators for structured data such as names, addresses, dates, countries, and phone numbers. Use this to quickly create dummy datasets with any schema, though the values will not reflect empirical sources.
Note
Choosing whether to use a seed dataset or a synthetic generator depends on your use case.
For data realism and statistical fidelity, start from a seed dataset.
For flexible or arbitrary data structures without statistical constraints, use Faker or similar tools.
Target schema and feature properties#
The following table summarizes the type, synthesis method, and key properties or constraints for each feature:
Feature |
Type |
Interp./Extrap. |
Method or constraint |
Distribution target |
|---|---|---|---|---|
customer_type |
cat |
Interpolation |
Sampled with empirical (seed) univariate distribution |
Preserve univariate |
customer_name |
string |
Extrapolation |
Use |
/ |
region |
cat |
Interpolation |
Choice among [EMEA, JAPAC, LAD, NA] |
/ |
country |
cat |
Interpolation |
Based on mapping from region; randomly select among possible countries in region |
Coherency: region |
customer_net_worth |
num |
Interpolation |
Sampled from joint empirical distribution with region |
Joint: region |
loan_value |
num |
Interpolation |
Sampled from joint empirical distribution with region |
Joint: region |
loan_proposed_interest |
num |
Interpolation |
Sampled from joint empirical distribution with region |
Joint: region |
loan_reason |
cat |
Interpolation |
Sampled with empirical (seed) univariate distribution |
Preserve univariate |
justification |
string |
Extrapolation |
Long text generation |
Coherency with other fields |
Interp./Extrap.: Indicates if the feature is synthesized from existing (seed) values (interpolation) or from new, plausible values (extrapolation).
Method or constraint: Specifies any specialized synthesis function or logic, including use of Faker.
Load and inspect the seed data#
Begin by loading your seed dataset. This data provides distributions, feature values, and coherency patterns for synthetic data generation. For demonstration, the following example defines a small seed dataset matching the target schema above, but you can use your own data.
import pandas as pd
seed_data = [
{
"customer_type": "organization",
"customer_name": "Atlas Engineering Ltd.",
"region": "EMEA",
"country": "Germany",
"customer_net_worth": 75200000,
"loan_value": 12500000,
"loan_proposed_interest": 4.9,
"loan_reason": "Business expansion",
"justification": "Atlas Engineering Ltd. seeks a $12,500,000 loan at a proposed 4.9% interest to expand its operations across the German automotive sector. With a robust net worth of $75,200,000 and a history of successful project delivery, the company is poised for growth. The expansion will create new jobs and increase market share, while our solid financials and collateral provide strong assurance. This aligns with our strategic business roadmap and justifies a favorable interest rate approval.",
},
{
"customer_type": "individual",
"customer_name": "Samantha Rivera",
"region": "NA",
"country": "United States",
"customer_net_worth": 1580000,
"loan_value": 80000,
"loan_proposed_interest": 3.5,
"loan_reason": "Home renovation",
"justification": "Samantha Rivera requests a $80,000 loan at 3.5% interest to renovate her primary residence. With a net worth of $1,580,000 and stable income, Samantha’s credit profile is strong. The renovation will enhance her property value, reducing default risk. Given her responsible financial management and the secured nature of the loan, she is a low-risk borrower deserving of these favorable terms.",
},
{
"customer_type": "organization",
"customer_name": "MedicoPharma Solutions",
"region": "JAPAC",
"country": "Singapore",
"customer_net_worth": 18500000,
"loan_value": 4500000,
"loan_proposed_interest": 5.7,
"loan_reason": "Investment capital",
"justification": "MedicoPharma Solutions applies for a $4,500,000 loan at a 5.7% rate to fund critical biotech investments. With a net worth of $18,500,000 and consistent year-over-year growth, the company demonstrates fiscal responsibility. The requested capital is earmarked for new laboratory equipment and R&D, supporting innovation and future revenue. The loan structure and interest rate are well-justified by the company's planned implementations and financial record.",
},
{
"customer_type": "individual",
"customer_name": "Rajeev Malhotra",
"region": "JAPAC",
"country": "India",
"customer_net_worth": 280000,
"loan_value": 42000,
"loan_proposed_interest": 4.2,
"loan_reason": "Education costs",
"justification": "Rajeev Malhotra is seeking a $42,000 loan at 4.2% interest to pursue a postgraduate program in Bangalore. With a net worth of $280,000 and a stable employment history, Rajeev is committed to investing in his education for career advancement. The loan will cover tuition and related expenses, and his repayment plan is backed by projected post-graduation earnings, warranting an approval at this rate.",
},
{
"customer_type": "organization",
"customer_name": "Green Horizons Ltd.",
"region": "LAD",
"country": "Brazil",
"customer_net_worth": 6270000,
"loan_value": 1300000,
"loan_proposed_interest": 5.2,
"loan_reason": "Major appliance",
"justification": "Green Horizons Ltd. is requesting a $1,300,000 loan with a 5.2% proposed interest rate to acquire industrial-scale solar panel systems for its new facility. With a net worth of $6,270,000, the organization’s history in sustainable development makes this investment logical. The equipment will reduce operational costs and support environmental commitments, providing a compelling rationale for loan approval.",
},
{
"customer_type": "individual",
"customer_name": "Lucia González",
"region": "LAD",
"country": "Argentina",
"customer_net_worth": 98000,
"loan_value": 19000,
"loan_proposed_interest": 6.0,
"loan_reason": "Debt consolidation",
"justification": "Lucia González seeks a $19,000 loan at 6.0% interest to consolidate existing debts into a single manageable payment. With a net worth of $98,000 and steady monthly income, Lucia will benefit from simplified repayment terms and reduced overall interest expenses. Her disciplined repayment record supports the requested terms and merits consideration.",
},
{
"customer_type": "organization",
"customer_name": "BlueStar Trading GmbH",
"region": "EMEA",
"country": "Switzerland",
"customer_net_worth": 26000000,
"loan_value": 6200000,
"loan_proposed_interest": 3.3,
"loan_reason": "Vehicle purchase",
"justification": "BlueStar Trading GmbH is applying for a $6,200,000 loan at 3.3% interest to upgrade its commercial vehicle fleet in Switzerland. Backed by a $26,000,000 net worth and healthy balance sheet, the acquisition will improve logistics and lower operational costs. Their creditworthiness and the use of vehicles as collateral minimize risk, justifying loan approval at this competitive rate.",
},
{
"customer_type": "individual",
"customer_name": "Fatima Al-Hassan",
"region": "EMEA",
"country": "United Arab Emirates",
"customer_net_worth": 1200000,
"loan_value": 38000,
"loan_proposed_interest": 3.9,
"loan_reason": "Vacation funding",
"justification": "Fatima Al-Hassan requests a $38,000 loan at a 3.9% interest rate to cover a family vacation abroad. With a net worth of $1,200,000 and a solid credit score, she is well-positioned to repay the loan comfortably. Her responsible financial history further supports approval at the proposed rate for this short-term, purpose-specific loan.",
},
{
"customer_type": "organization",
"customer_name": "Aussie Urban Properties",
"region": "JAPAC",
"country": "Australia",
"customer_net_worth": 51200000,
"loan_value": 9700000,
"loan_proposed_interest": 4.5,
"loan_reason": "Business expansion",
"justification": "Aussie Urban Properties seeks $9,700,000 in funding at a 4.5% interest rate to finance new residential developments in Sydney. With a net worth of $51,200,000 and proven experience in property management, the loan will be efficiently applied. The company’s past successes and robust financial health point to a strong ability to utilize and repay the loan as proposed.",
},
{
"customer_type": "individual",
"customer_name": "Jacob Peterson",
"region": "NA",
"country": "Canada",
"customer_net_worth": 405000,
"loan_value": 31000,
"loan_proposed_interest": 2.7,
"loan_reason": "Medical expenses",
"justification": "Jacob Peterson is requesting a $31,000 loan at 2.7% interest to cover medical expenses resulting from an unexpected surgery. With a net worth of $405,000 and reliable employment, Jacob demonstrates strong repayment capacity. His consistent repayment track record and insurance partial coverage will mitigate lender risk, supporting approval at this attractive rate.",
},
]
df_seed = pd.DataFrame(seed_data)
print(df_seed)
Data synthesis#
Synthesis functions#
Define Python functions to generate feature values according to empirical distributions and project-specific constraints.
fake_categorical: Sample a categorical column by seed frequency It uses NumPy to reproduce marginal distributions.
import numpy as np
def fake_categorical(
rng: np.random.Generator,
pd_column: pd.Series,
dropna: bool = False,
**choice_kwargs: Any,
) -> Any:
value_counts_dict = pd_column.value_counts(normalize=True, dropna=dropna).to_dict()
categories = np.array(list(value_counts_dict.keys()))
probabilities = np.array(list(value_counts_dict.values()))
return rng.choice(categories, p=probabilities, **choice_kwargs)
fake_joint_numerical: Sample a numerical feature by joint distribution It samples values for a numerical feature conditional on one or more categorical features (for example,customer_net_worthconditioned on region).
def fake_joint_numerical(
rng: np.random.Generator,
df: pd.DataFrame,
value_col: str,
cat_cols: str | list[str],
synthesized_rows: pd.DataFrame,
) -> list[Any]:
result = []
cat_array = (
synthesized_rows[cat_cols].values
if isinstance(cat_cols, list)
else synthesized_rows[[cat_cols]].values
)
for cats in cat_array:
mask = np.ones(len(df), dtype=bool)
if isinstance(cat_cols, list):
for col, val in zip(cat_cols, cats):
mask &= df[col].values == val
else:
mask &= df[cat_cols].values == cats[0]
matching_vals = df.loc[mask, value_col]
if not matching_vals.empty:
result.append(rng.choice(matching_vals))
else:
# Fallback: overall empirical
result.append(rng.choice(df[value_col]))
return result
Structured data generation#
Each section below demonstrates how to synthesize features while upholding a specific property. After each example, synthesize all other features by specification.
Reproducibility#
Set seeds for all relevant random number generators to make synthesis repeatable.
from faker import Faker
SEED = 0
rng = np.random.default_rng(SEED)
random.seed(SEED)
fake = Faker()
Faker.seed(SEED)
Generation configuration#
Configure the synthetic data generation process.
n_synthesized = 20 # adjust as needed
Univariate distribution preservation example (region)#
Sample the region column according to the frequency distribution in the seed dataset.
synth_region = fake_categorical(rng, df_seed["region"], size=n_synthesized)
print(pd.Series(synth_region).value_counts())
# EMEA 7
# LAD 6
# NA 4
# JAPAC 3
# Name: count, dtype: int64
Coherency constraint example (region vs. country)#
country and region must be coherent: each country should map to its correct business region.
region_country_map = df_seed.groupby("region")["country"].apply(list).to_dict()
synth_country = []
for region in synth_region:
possible_countries = region_country_map.get(region, [])
synth_country.append(rng.choice(possible_countries))
print(list(zip(synth_region[:10], synth_country[:10])))
# [('NA', 'United States'),
# ('EMEA', 'Germany'),
# ('EMEA', 'Germany'),
# ('EMEA', 'Germany'),
# ('LAD', 'Brazil'),
# ('LAD', 'Argentina'),
# ('NA', 'Canada'),
# ('NA', 'Canada'),
# ('JAPAC', 'Singapore'),
# ('LAD', 'Argentina')]
Joint distribution example (customer_net_worth | region)#
Synthesize customer_net_worth values respecting the empirical joint distribution over region.
demo_rows = pd.DataFrame({"region": synth_region})
synth_customer_net_worth = fake_joint_numerical(
rng, df_seed, "customer_net_worth", ["region"], demo_rows
)
print(pd.Series(synth_customer_net_worth).describe())
# count 2.000000e+01
# mean 1.431880e+07
# std 2.174939e+07
# min 9.800000e+04
# 25% 4.050000e+05
# 50% 3.735000e+06
# 75% 2.600000e+07
# max 7.520000e+07
# dtype: float64
Extrapolation (new values) example (customer_name)#
Use Faker to synthesize new individual names or company names, even if not present in the seed.
synth_customer_type = fake_categorical(rng, df_seed["customer_type"], size=n_synthesized)
synth_customer_name = [
fake.name() if ct == "individual" else fake.company() for ct in synth_customer_type
]
print(synth_customer_name[:10])
# ['Norma Fisher',
# 'Sheppard-Tucker',
# 'Sandra Faulkner',
# 'Silva-Odonnell',
# 'Taylor, Taylor and Davis',
# 'Victoria Patel',
# 'Patrick, Barrera and Collins',
# 'Stephanie Sutton',
# 'Castro-Gomez',
# 'Martin Harris']
Synthesizing remaining structured features according to property table#
With each property demonstrated, now synthesize the remaining structured features as specified above.
from wayflowcore.property import AnyProperty, IntegerProperty
from wayflowcore.steps import ToolExecutionStep
from wayflowcore.tools import ServerTool
def synthesize_structured_features(
df_seed: pd.DataFrame, n_synthesized: int = 10, seed: int = 0
) -> pd.DataFrame:
rng = np.random.default_rng(seed=seed)
fake = Faker()
fake.seed_instance(seed)
synth_loan_reason = fake_categorical(rng, df_seed["loan_reason"], size=n_synthesized)
synth_region_full = fake_categorical(rng, df_seed["region"], size=n_synthesized)
synth_customer_type_full = fake_categorical(rng, df_seed["customer_type"], size=n_synthesized)
region_country_map = df_seed.groupby("region")["country"].apply(list).to_dict()
synth_country_full = []
for region in synth_region_full:
possible_countries = region_country_map.get(region, [])
synth_country_full.append(rng.choice(possible_countries))
synth_rows_for_joint = pd.DataFrame({"region": synth_region_full})
synth_customer_net_worth_full = fake_joint_numerical(
rng, df_seed, "customer_net_worth", ["region"], synth_rows_for_joint
)
synth_loan_value_full = fake_joint_numerical(
rng, df_seed, "loan_value", ["region"], synth_rows_for_joint
)
synth_loan_proposed_interest_full = fake_joint_numerical(
rng, df_seed, "loan_proposed_interest", ["region"], synth_rows_for_joint
)
synth_customer_name_full = [
fake.name() if ct == "individual" else fake.company() for ct in synth_customer_type_full
]
df_synthesized = pd.DataFrame(
{
"customer_type": synth_customer_type_full,
"customer_name": synth_customer_name_full,
"region": synth_region_full,
"country": synth_country_full,
"loan_reason": synth_loan_reason,
"customer_net_worth": synth_customer_net_worth_full,
"loan_value": synth_loan_value_full,
"loan_proposed_interest": synth_loan_proposed_interest_full,
}
)
return df_synthesized
def get_synthesize_structured_features_step() -> ToolExecutionStep:
return ToolExecutionStep(
tool=ServerTool(
name="synthesize_structured_features",
description="Synthesize all structured features",
func=synthesize_structured_features,
input_descriptors=[
AnyProperty(name="df_seed"),
IntegerProperty(name="n_synthesized", default_value=10),
IntegerProperty(name="seed", default_value=0),
],
output_descriptors=[AnyProperty(name="df_synthesized")],
),
raise_exceptions=True,
)
from wayflowcore.flowhelpers import create_single_step_flow
flow = create_single_step_flow(step=get_synthesize_structured_features_step())
conv = flow.start_conversation(inputs={"df_seed": df_seed, "n_synthesized": 150, "seed": 0})
status = conv.execute()
df_synthesized = status.output_values["df_synthesized"]
print(df_synthesized)
Verification: sanity checks and distribution comparison#
Compare value counts and distributions for select features, and compare them to the original seed dataset.
print("Synthesized region distribution:")
print(df_synthesized["region"].value_counts())
# Synthesized region distribution:
# JAPAC 46
# LAD 38
# EMEA 36
# NA 30
# Name: count, dtype: int64
print("\nSeed region distribution:")
print(df_seed["region"].value_counts())
# Seed region distribution:
# EMEA 3
# JAPAC 3
# NA 2
# LAD 2
# Name: count, dtype: int64
Long text generation#
To generate realistic business justifications for each record, you will use a language model (LLM) as part of your synthesis workflow. This section explains the configuration of the LLM, describes the prompts and parsing logic, outlines the flow definition, and demonstrates parallelized generation and validation.
LLM selection and initialization#
For long text generation, you need to use an LLM. WayFlow supports several LLM providers. Select and configure your LLM below:
from wayflowcore.models import OCIGenAIModel
if __name__ == "__main__":
llm = OCIGenAIModel(
model_id="provider.model-id",
service_endpoint="https://url-to-service-endpoint.com",
compartment_id="compartment-id",
auth_type="API_KEY",
)
from wayflowcore.models import VllmModel
llm = VllmModel(
model_id="model-id",
host_port="VLLM_HOST_PORT",
)
from wayflowcore.models import OllamaModel
llm = OllamaModel(
model_id="model-id",
)
Prompt and I/O constants#
Begin by defining the necessary output constants and prompts used to instruct the LLM and structure the flow:
# Define I/O constants and step names
JUSTIFICATION_GEN_IO = "JUSTIFICATION_GEN_IO"
JUSTIFICATION_PARSING_IO = "JUSTIFICATION_PARSING_IO"
VALIDATION_IO = "VALIDATION_IO"
VALIDATION_PARSING_IO = "VALIDATION_PARSING_IO"
JUSTIFICATION_GEN_STEP = "JUSTIFICATION_GEN_STEP"
JUSTIFICATION_PARSING_STEP = "JUSTIFICATION_PARSING_STEP"
VALIDATION_STEP = "VALIDATION_STEP"
VALIDATION_PARSING_STEP = "VALIDATION_PARSING_STEP"
# Define prompts
JUSTIFICATION_PROMPT = """
## Context
You are an assistant that helps the business with loan applications.
## Task
Write a justification for a loan application, given certain customer data points.
The justification should be concise and detailed, and include all infomration that could be relevant for the loan application.
As an intermediate step before writing the justification, provide a reasoning section consisting of facts, arguments or any other relevant data that would help you with writing the justification.
## Example
Below is an example of the input data points and how they are structured, and an output justitifcation. Note that in this example the reasoning section has been omitted.
Customer: {{ seed_example['customer_name'] }} (type = {{ seed_example['customer_type'] }})
Location: {{ seed_example['region'] }}, {{ seed_example['country'] }}
Customer Net Worth: {{ seed_example['customer_net_worth'] }}
Loan: {{ seed_example['loan_value'] }} at {{ seed_example['loan_proposed_interest'] }} interest
Purpose: {{ seed_example['loan_reason'] }}
Justification: {{ seed_example['justification'] }}
## Data
Below are the input data points based on which you have to write the justification.
Customer: {{ generated_row['customer_name'] }} (type = {{ generated_row['customer_type'] }})
Location: {{ generated_row['region'] }}, {{ generated_row['country'] }}
Customer Net Worth: {{ generated_row['customer_net_worth'] }}
Loan: {{ generated_row['loan_value'] }} at {{ generated_row['loan_proposed_interest'] }} interest
Purpose: {{ generated_row['loan_reason'] }}
## Insturctions
You must follow these guidelines:
- The tone, length and level of details in your justification are similarly aligned with the provided example
- The justification must be solely based on the provided data points
- The justification must be written around the `Purpose` data point
- The language should be professional and business-appropriate
- Include any specific financial details if relevant
The output must stricly follow the exacty format as below:
Reasoning: <fill this section with any facts or arguments relevant to the justification>
Justification: <fill this section with the justification>
## Output
""".strip()
VALIDATION_PROMPT = """
## Context
You are an assistant that helps the bank evaluate loan applications.
## Task
Evaluate the quality and plausibility of a loan justification against provided application data points of the customer.
The evaluation has to consider the folllowing criteria:
1. Factual accuracy: All amounts, names, and details in the justification match the provided customer data points
2. Logical consistency: The reasoning aligns with customer profile and loan parameters
3. Realism: The justification is plausible for this type of loan application
4. Professional tone: The justification is written in business-appropriate language, and follows a coherent and logical structure
5. Completeness: The justification adequately addresses the loan reason and key factors
## Example
Below is an example of how the input data should look like. It starts with the customer data points and ends with the loan justification.
Customer: {{ seed_example['customer_name'] }} (type = {{ seed_example['customer_type'] }})
Location: {{ seed_example['region'] }}, {{ seed_example['country'] }}
Customer Net Worth: {{ seed_example['customer_net_worth'] }}
Loan: {{ seed_example['loan_value'] }} at {{ seed_example['loan_proposed_interest'] }} interest
Purpose: {{ seed_example['loan_reason'] }}
Justification: {{ seed_example['justification'] }}
## Data
Below is the input data for the current loan application you need to analyze.
Customer: {{ generated_row['customer_name'] }} (type = {{ generated_row['customer_type'] }})
Location: {{ generated_row['region'] }}, {{ generated_row['country'] }}
Customer Net Worth: {{ generated_row['customer_net_worth'] }}
Loan: {{ generated_row['loan_value'] }} at {{ generated_row['loan_proposed_interest']}} interest
Purpose: {{ generated_row['loan_reason'] }}
Justification: {{ justification }}
## Instructions
You must follow these guidelines:
- All evaluation criteria must be assessed
- Each evaluation criterion must be assessed solely based on the currently provided data points
- If a criterion cannot be confidently assessed based on the provided data points, say so
- Keep the evaluation reasoning short and concise
After all criteria have been assessed, you must end your answer with a verdict. The verdict must be in uppercase letters, and the list of possible verdicts and their formats is provided below:
- VALID (if the justification meets all evaluation criteria)
- INVALID: <reasoning> (if the justification fails any criteria, list them and provide short reasoning)
""".strip()
Parsing functions#
Parsing functions are responsible for extracting and normalizing the LLM outputs produced during justification and validation steps:
def parse_cot_justification_output(llm_output: str) -> str:
parts = llm_output.split("Justification:")
if len(parts) < 2:
return "Justification is not provided"
return parts[1].strip()
def parse_validation_output(llm_output: str) -> bool:
if "INVALID" in llm_output:
return False
elif "VALID" in llm_output:
return True
return False
Justification generation flow definition#
This flow defines a multi-step pipeline for generation, parsing, validation, and conditional retry of business justifications:
from wayflowcore.controlconnection import ControlFlowEdge
from wayflowcore.models.llmgenerationconfig import LlmGenerationConfig
from wayflowcore.steps import CompleteStep, PromptExecutionStep, RetryStep
def get_justification_generation_step(llm: LlmModel) -> PromptExecutionStep:
return PromptExecutionStep(
prompt_template=JUSTIFICATION_PROMPT,
llm=llm,
generation_config=LlmGenerationConfig(max_tokens=2048),
input_mapping={
"seed_example": "seed_example",
"generated_row": "generated_row",
},
output_mapping={PromptExecutionStep.OUTPUT: JUSTIFICATION_GEN_IO},
)
def get_parse_justification_step() -> ToolExecutionStep:
tool = ServerTool(
name="parse_cot_justification_output",
description="Parsing the specific LLM output of the justification generation step.",
parameters={
"llm_output": {
"type": "string",
"description": "Output from justification generation step.",
}
},
func=parse_cot_justification_output,
output={"type": "string"},
)
return ToolExecutionStep(
tool=tool,
input_mapping={"llm_output": JUSTIFICATION_GEN_IO},
output_mapping={ToolExecutionStep.TOOL_OUTPUT: JUSTIFICATION_PARSING_IO},
)
def get_validation_step(llm: LlmModel) -> PromptExecutionStep:
return PromptExecutionStep(
prompt_template=VALIDATION_PROMPT,
llm=llm,
generation_config=LlmGenerationConfig(max_tokens=2048),
input_mapping={
"seed_example": "seed_example",
"generated_row": "generated_row",
"justification": JUSTIFICATION_PARSING_IO,
},
output_mapping={PromptExecutionStep.OUTPUT: VALIDATION_IO},
)
def get_parse_validation_step() -> ToolExecutionStep:
tool = ServerTool(
name="parse_validation_output",
description="Parsing the specific LLM output of the validation step.",
parameters={
"llm_output": {
"type": "string",
"description": "Output from validation step.",
}
},
func=parse_validation_output,
output={"type": "bool"},
)
return ToolExecutionStep(
tool=tool,
input_mapping={"llm_output": VALIDATION_IO},
output_mapping={ToolExecutionStep.TOOL_OUTPUT: VALIDATION_PARSING_IO},
)
def get_retry_step(flow: Flow) -> RetryStep:
return RetryStep(
flow=flow,
success_condition=VALIDATION_PARSING_IO,
max_num_trials=3,
)
def get_main_flow(llm_justification: LlmModel, llm_validation: LlmModel) -> Flow:
justification_gen_step = get_justification_generation_step(llm_justification)
justification_parse_step = get_parse_justification_step()
validation_step = get_validation_step(llm_validation)
validation_parse_step = get_parse_validation_step()
steps = {
JUSTIFICATION_GEN_STEP: justification_gen_step,
JUSTIFICATION_PARSING_STEP: justification_parse_step,
VALIDATION_STEP: validation_step,
VALIDATION_PARSING_STEP: validation_parse_step,
}
transitions = {
JUSTIFICATION_GEN_STEP: [JUSTIFICATION_PARSING_STEP],
JUSTIFICATION_PARSING_STEP: [VALIDATION_STEP],
VALIDATION_STEP: [VALIDATION_PARSING_STEP],
VALIDATION_PARSING_STEP: [None],
}
return Flow(steps=steps, begin_step=justification_gen_step, transitions=transitions)
def get_flow(llm_justification: LlmModel, llm_validation: LlmModel) -> Flow:
main_flow = get_main_flow(llm_justification, llm_validation)
retry_step = get_retry_step(main_flow)
success_step = CompleteStep()
failure_step = CompleteStep()
return Flow(
begin_step=retry_step,
steps={
"start": retry_step,
"success": success_step,
"failure": failure_step,
},
control_flow_edges=[
ControlFlowEdge(
source_step=retry_step,
source_branch=retry_step.BRANCH_NEXT,
destination_step=success_step,
),
ControlFlowEdge(
source_step=retry_step,
source_branch=retry_step.BRANCH_FAILURE,
destination_step=failure_step,
),
],
)
Batch justification generation and validation#
Finally, apply the justification generation flow in parallel over your dataset using a MapStep (for more information see How to Do Map and Reduce Operations in Flows). This enables efficient batch generation and validation:
from wayflowcore.property import ListProperty
from wayflowcore.steps import MapStep
flow = create_single_step_flow(
step=MapStep(
flow=get_flow(llm, llm),
parallel_execution=True,
unpack_input={
"seed_example": ".seed_example",
"generated_row": ".generated_row",
},
output_descriptors=[ListProperty(JUSTIFICATION_PARSING_IO)],
)
)
input_sequence = [
{
"seed_example": df_seed.sample(n=1).iloc[0].to_dict(),
"generated_row": row.to_dict(),
}
for _, row in df_synthesized.iterrows()
]
conversation = flow.start_conversation(inputs={MapStep.ITERATED_INPUT: input_sequence})
status = conversation.execute()
df_synthesized["justification"] = status.output_values[JUSTIFICATION_PARSING_IO]
Synthetic dataset export#
Save the synthesized dataset for downstream use.
df_synthesized.to_json(file_path, orient="records", indent=4)
Recap#
This guide demonstrated property-respecting structured data synthesis, including logic for univariate, joint, interpolated, extrapolated, and coherent features. It also described long text generation and validation with a language model as a judge, including conditional retries. By following these workflows and recommendations, you can create realistic, flexible synthetic datasets to support your WayFlow projects.
Next steps#
Having learned how to synthesize data in WayFlow, you may now proceed to How to Connect Assistants to Your Data to learn how to integrate your synthesized datasets with your assistants for downstream applications and enhanced testing.
Full code#
You can copy the full code for this guide below.
1# Copyright © 2025 Oracle and/or its affiliates.
2#
3# This software is under the Apache License 2.0
4# %%[markdown]
5# WayFlow Code Example - How to Perform Data Synthesis
6# ----------------------------------------------------
7
8# How to use:
9# Create a new Python virtual environment and install the latest WayFlow version.
10# ```bash
11# python -m venv venv-wayflowcore
12# source venv-wayflowcore/bin/activate
13# pip install --upgrade pip
14# pip install "wayflowcore==26.1"
15# ```
16
17# You can now run the script
18# 1. As a Python file:
19# ```bash
20# python howto_data_synthesis.py
21# ```
22# 2. As a Notebook (in VSCode):
23# When viewing the file,
24# - press the keys Ctrl + Enter to run the selected cell
25# - or Shift + Enter to run the selected cell and move to the cell below# (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0) or Universal Permissive License
26# (UPL) 1.0 (LICENSE-UPL or https://oss.oracle.com/licenses/upl), at your option.
27
28
29
30# %%[markdown]
31data:
32
33# %%
34import pandas as pd
35
36seed_data = [
37 {
38 "customer_type": "organization",
39 "customer_name": "Atlas Engineering Ltd.",
40 "region": "EMEA",
41 "country": "Germany",
42 "customer_net_worth": 75200000,
43 "loan_value": 12500000,
44 "loan_proposed_interest": 4.9,
45 "loan_reason": "Business expansion",
46 "justification": "Atlas Engineering Ltd. seeks a $12,500,000 loan at a proposed 4.9% interest to expand its operations across the German automotive sector. With a robust net worth of $75,200,000 and a history of successful project delivery, the company is poised for growth. The expansion will create new jobs and increase market share, while our solid financials and collateral provide strong assurance. This aligns with our strategic business roadmap and justifies a favorable interest rate approval.",
47 },
48 {
49 "customer_type": "individual",
50 "customer_name": "Samantha Rivera",
51 "region": "NA",
52 "country": "United States",
53 "customer_net_worth": 1580000,
54 "loan_value": 80000,
55 "loan_proposed_interest": 3.5,
56 "loan_reason": "Home renovation",
57 "justification": "Samantha Rivera requests a $80,000 loan at 3.5% interest to renovate her primary residence. With a net worth of $1,580,000 and stable income, Samantha’s credit profile is strong. The renovation will enhance her property value, reducing default risk. Given her responsible financial management and the secured nature of the loan, she is a low-risk borrower deserving of these favorable terms.",
58 },
59 {
60 "customer_type": "organization",
61 "customer_name": "MedicoPharma Solutions",
62 "region": "JAPAC",
63 "country": "Singapore",
64 "customer_net_worth": 18500000,
65 "loan_value": 4500000,
66 "loan_proposed_interest": 5.7,
67 "loan_reason": "Investment capital",
68 "justification": "MedicoPharma Solutions applies for a $4,500,000 loan at a 5.7% rate to fund critical biotech investments. With a net worth of $18,500,000 and consistent year-over-year growth, the company demonstrates fiscal responsibility. The requested capital is earmarked for new laboratory equipment and R&D, supporting innovation and future revenue. The loan structure and interest rate are well-justified by the company's planned implementations and financial record.",
69 },
70 {
71 "customer_type": "individual",
72 "customer_name": "Rajeev Malhotra",
73 "region": "JAPAC",
74 "country": "India",
75 "customer_net_worth": 280000,
76 "loan_value": 42000,
77 "loan_proposed_interest": 4.2,
78 "loan_reason": "Education costs",
79 "justification": "Rajeev Malhotra is seeking a $42,000 loan at 4.2% interest to pursue a postgraduate program in Bangalore. With a net worth of $280,000 and a stable employment history, Rajeev is committed to investing in his education for career advancement. The loan will cover tuition and related expenses, and his repayment plan is backed by projected post-graduation earnings, warranting an approval at this rate.",
80 },
81 {
82 "customer_type": "organization",
83 "customer_name": "Green Horizons Ltd.",
84 "region": "LAD",
85 "country": "Brazil",
86 "customer_net_worth": 6270000,
87 "loan_value": 1300000,
88 "loan_proposed_interest": 5.2,
89 "loan_reason": "Major appliance",
90 "justification": "Green Horizons Ltd. is requesting a $1,300,000 loan with a 5.2% proposed interest rate to acquire industrial-scale solar panel systems for its new facility. With a net worth of $6,270,000, the organization’s history in sustainable development makes this investment logical. The equipment will reduce operational costs and support environmental commitments, providing a compelling rationale for loan approval.",
91 },
92 {
93 "customer_type": "individual",
94 "customer_name": "Lucia González",
95 "region": "LAD",
96 "country": "Argentina",
97 "customer_net_worth": 98000,
98 "loan_value": 19000,
99 "loan_proposed_interest": 6.0,
100 "loan_reason": "Debt consolidation",
101 "justification": "Lucia González seeks a $19,000 loan at 6.0% interest to consolidate existing debts into a single manageable payment. With a net worth of $98,000 and steady monthly income, Lucia will benefit from simplified repayment terms and reduced overall interest expenses. Her disciplined repayment record supports the requested terms and merits consideration.",
102 },
103 {
104 "customer_type": "organization",
105 "customer_name": "BlueStar Trading GmbH",
106 "region": "EMEA",
107 "country": "Switzerland",
108 "customer_net_worth": 26000000,
109 "loan_value": 6200000,
110 "loan_proposed_interest": 3.3,
111 "loan_reason": "Vehicle purchase",
112 "justification": "BlueStar Trading GmbH is applying for a $6,200,000 loan at 3.3% interest to upgrade its commercial vehicle fleet in Switzerland. Backed by a $26,000,000 net worth and healthy balance sheet, the acquisition will improve logistics and lower operational costs. Their creditworthiness and the use of vehicles as collateral minimize risk, justifying loan approval at this competitive rate.",
113 },
114 {
115 "customer_type": "individual",
116 "customer_name": "Fatima Al-Hassan",
117 "region": "EMEA",
118 "country": "United Arab Emirates",
119 "customer_net_worth": 1200000,
120 "loan_value": 38000,
121 "loan_proposed_interest": 3.9,
122 "loan_reason": "Vacation funding",
123 "justification": "Fatima Al-Hassan requests a $38,000 loan at a 3.9% interest rate to cover a family vacation abroad. With a net worth of $1,200,000 and a solid credit score, she is well-positioned to repay the loan comfortably. Her responsible financial history further supports approval at the proposed rate for this short-term, purpose-specific loan.",
124 },
125 {
126 "customer_type": "organization",
127 "customer_name": "Aussie Urban Properties",
128 "region": "JAPAC",
129 "country": "Australia",
130 "customer_net_worth": 51200000,
131 "loan_value": 9700000,
132 "loan_proposed_interest": 4.5,
133 "loan_reason": "Business expansion",
134 "justification": "Aussie Urban Properties seeks $9,700,000 in funding at a 4.5% interest rate to finance new residential developments in Sydney. With a net worth of $51,200,000 and proven experience in property management, the loan will be efficiently applied. The company’s past successes and robust financial health point to a strong ability to utilize and repay the loan as proposed.",
135 },
136 {
137 "customer_type": "individual",
138 "customer_name": "Jacob Peterson",
139 "region": "NA",
140 "country": "Canada",
141 "customer_net_worth": 405000,
142 "loan_value": 31000,
143 "loan_proposed_interest": 2.7,
144 "loan_reason": "Medical expenses",
145 "justification": "Jacob Peterson is requesting a $31,000 loan at 2.7% interest to cover medical expenses resulting from an unexpected surgery. With a net worth of $405,000 and reliable employment, Jacob demonstrates strong repayment capacity. His consistent repayment track record and insurance partial coverage will mitigate lender risk, supporting approval at this attractive rate.",
146 },
147]
148
149df_seed = pd.DataFrame(seed_data)
150print(df_seed)
151
152
153from typing import Any
154
155
156# %%[markdown]
157fake categorical:
158
159# %%
160import numpy as np
161
162
163def fake_categorical(
164 rng: np.random.Generator,
165 pd_column: pd.Series,
166 dropna: bool = False,
167 **choice_kwargs: Any,
168) -> Any:
169 value_counts_dict = pd_column.value_counts(normalize=True, dropna=dropna).to_dict()
170 categories = np.array(list(value_counts_dict.keys()))
171 probabilities = np.array(list(value_counts_dict.values()))
172 return rng.choice(categories, p=probabilities, **choice_kwargs)
173
174
175
176
177
178# %%[markdown]
179fake joint numerical:
180
181# %%
182def fake_joint_numerical(
183 rng: np.random.Generator,
184 df: pd.DataFrame,
185 value_col: str,
186 cat_cols: str | list[str],
187 synthesized_rows: pd.DataFrame,
188) -> list[Any]:
189 result = []
190 cat_array = (
191 synthesized_rows[cat_cols].values
192 if isinstance(cat_cols, list)
193 else synthesized_rows[[cat_cols]].values
194 )
195 for cats in cat_array:
196 mask = np.ones(len(df), dtype=bool)
197 if isinstance(cat_cols, list):
198 for col, val in zip(cat_cols, cats):
199 mask &= df[col].values == val
200 else:
201 mask &= df[cat_cols].values == cats[0]
202 matching_vals = df.loc[mask, value_col]
203 if not matching_vals.empty:
204 result.append(rng.choice(matching_vals))
205 else:
206 # Fallback: overall empirical
207 result.append(rng.choice(df[value_col]))
208 return result
209
210
211
212import random
213
214
215# %%[markdown]
216seeds:
217
218# %%
219from faker import Faker
220
221SEED = 0
222rng = np.random.default_rng(SEED)
223random.seed(SEED)
224fake = Faker()
225Faker.seed(SEED)
226
227
228# %%[markdown]
229generation configuration:
230
231# %%
232n_synthesized = 20 # adjust as needed
233
234
235# %%[markdown]
236univariate distribution:
237
238# %%
239synth_region = fake_categorical(rng, df_seed["region"], size=n_synthesized)
240print(pd.Series(synth_region).value_counts())
241# EMEA 7
242# LAD 6
243# NA 4
244# JAPAC 3
245# Name: count, dtype: int64
246
247
248# %%[markdown]
249coherency constraint:
250
251# %%
252region_country_map = df_seed.groupby("region")["country"].apply(list).to_dict()
253synth_country = []
254for region in synth_region:
255 possible_countries = region_country_map.get(region, [])
256 synth_country.append(rng.choice(possible_countries))
257print(list(zip(synth_region[:10], synth_country[:10])))
258# [('NA', 'United States'),
259# ('EMEA', 'Germany'),
260# ('EMEA', 'Germany'),
261# ('EMEA', 'Germany'),
262# ('LAD', 'Brazil'),
263# ('LAD', 'Argentina'),
264# ('NA', 'Canada'),
265# ('NA', 'Canada'),
266# ('JAPAC', 'Singapore'),
267# ('LAD', 'Argentina')]
268
269
270# %%[markdown]
271joint distribution:
272
273# %%
274demo_rows = pd.DataFrame({"region": synth_region})
275synth_customer_net_worth = fake_joint_numerical(
276 rng, df_seed, "customer_net_worth", ["region"], demo_rows
277)
278print(pd.Series(synth_customer_net_worth).describe())
279# count 2.000000e+01
280# mean 1.431880e+07
281# std 2.174939e+07
282# min 9.800000e+04
283# 25% 4.050000e+05
284# 50% 3.735000e+06
285# 75% 2.600000e+07
286# max 7.520000e+07
287# dtype: float64
288
289
290# %%[markdown]
291extrapolation:
292
293# %%
294synth_customer_type = fake_categorical(rng, df_seed["customer_type"], size=n_synthesized)
295synth_customer_name = [
296 fake.name() if ct == "individual" else fake.company() for ct in synth_customer_type
297]
298print(synth_customer_name[:10])
299# ['Norma Fisher',
300# 'Sheppard-Tucker',
301# 'Sandra Faulkner',
302# 'Silva-Odonnell',
303# 'Taylor, Taylor and Davis',
304# 'Victoria Patel',
305# 'Patrick, Barrera and Collins',
306# 'Stephanie Sutton',
307# 'Castro-Gomez',
308# 'Martin Harris']
309
310
311
312# %%[markdown]
313full synthesis:
314
315# %%
316from wayflowcore.property import AnyProperty, IntegerProperty
317from wayflowcore.steps import ToolExecutionStep
318from wayflowcore.tools import ServerTool
319
320
321def synthesize_structured_features(
322 df_seed: pd.DataFrame, n_synthesized: int = 10, seed: int = 0
323) -> pd.DataFrame:
324 rng = np.random.default_rng(seed=seed)
325 fake = Faker()
326 fake.seed_instance(seed)
327 synth_loan_reason = fake_categorical(rng, df_seed["loan_reason"], size=n_synthesized)
328 synth_region_full = fake_categorical(rng, df_seed["region"], size=n_synthesized)
329 synth_customer_type_full = fake_categorical(rng, df_seed["customer_type"], size=n_synthesized)
330 region_country_map = df_seed.groupby("region")["country"].apply(list).to_dict()
331 synth_country_full = []
332 for region in synth_region_full:
333 possible_countries = region_country_map.get(region, [])
334 synth_country_full.append(rng.choice(possible_countries))
335 synth_rows_for_joint = pd.DataFrame({"region": synth_region_full})
336 synth_customer_net_worth_full = fake_joint_numerical(
337 rng, df_seed, "customer_net_worth", ["region"], synth_rows_for_joint
338 )
339 synth_loan_value_full = fake_joint_numerical(
340 rng, df_seed, "loan_value", ["region"], synth_rows_for_joint
341 )
342 synth_loan_proposed_interest_full = fake_joint_numerical(
343 rng, df_seed, "loan_proposed_interest", ["region"], synth_rows_for_joint
344 )
345 synth_customer_name_full = [
346 fake.name() if ct == "individual" else fake.company() for ct in synth_customer_type_full
347 ]
348 df_synthesized = pd.DataFrame(
349 {
350 "customer_type": synth_customer_type_full,
351 "customer_name": synth_customer_name_full,
352 "region": synth_region_full,
353 "country": synth_country_full,
354 "loan_reason": synth_loan_reason,
355 "customer_net_worth": synth_customer_net_worth_full,
356 "loan_value": synth_loan_value_full,
357 "loan_proposed_interest": synth_loan_proposed_interest_full,
358 }
359 )
360 return df_synthesized
361
362
363def get_synthesize_structured_features_step() -> ToolExecutionStep:
364 return ToolExecutionStep(
365 tool=ServerTool(
366 name="synthesize_structured_features",
367 description="Synthesize all structured features",
368 func=synthesize_structured_features,
369 input_descriptors=[
370 AnyProperty(name="df_seed"),
371 IntegerProperty(name="n_synthesized", default_value=10),
372 IntegerProperty(name="seed", default_value=0),
373 ],
374 output_descriptors=[AnyProperty(name="df_synthesized")],
375 ),
376 raise_exceptions=True,
377 )
378
379
380
381
382# %%[markdown]
383full synthesis flow:
384
385# %%
386from wayflowcore.flowhelpers import create_single_step_flow
387
388flow = create_single_step_flow(step=get_synthesize_structured_features_step())
389conv = flow.start_conversation(inputs={"df_seed": df_seed, "n_synthesized": 150, "seed": 0})
390status = conv.execute()
391df_synthesized = status.output_values["df_synthesized"]
392print(df_synthesized)
393
394
395# %%[markdown]
396verification:
397
398# %%
399print("Synthesized region distribution:")
400print(df_synthesized["region"].value_counts())
401# Synthesized region distribution:
402# JAPAC 46
403# LAD 38
404# EMEA 36
405# NA 30
406# Name: count, dtype: int64
407
408print("\nSeed region distribution:")
409print(df_seed["region"].value_counts())
410# Seed region distribution:
411# EMEA 3
412# JAPAC 3
413# NA 2
414# LAD 2
415# Name: count, dtype: int64
416
417
418# %%[markdown]
419long text generation:
420
421# %%
422
423# Define I/O constants and step names
424JUSTIFICATION_GEN_IO = "JUSTIFICATION_GEN_IO"
425JUSTIFICATION_PARSING_IO = "JUSTIFICATION_PARSING_IO"
426VALIDATION_IO = "VALIDATION_IO"
427VALIDATION_PARSING_IO = "VALIDATION_PARSING_IO"
428
429JUSTIFICATION_GEN_STEP = "JUSTIFICATION_GEN_STEP"
430JUSTIFICATION_PARSING_STEP = "JUSTIFICATION_PARSING_STEP"
431VALIDATION_STEP = "VALIDATION_STEP"
432VALIDATION_PARSING_STEP = "VALIDATION_PARSING_STEP"
433
434# Define prompts
435JUSTIFICATION_PROMPT = """
436## Context
437You are an assistant that helps the business with loan applications.
438
439
440## Task
441Write a justification for a loan application, given certain customer data points.
442The justification should be concise and detailed, and include all infomration that could be relevant for the loan application.
443As an intermediate step before writing the justification, provide a reasoning section consisting of facts, arguments or any other relevant data that would help you with writing the justification.
444
445
446## Example
447Below is an example of the input data points and how they are structured, and an output justitifcation. Note that in this example the reasoning section has been omitted.
448
449Customer: {{ seed_example['customer_name'] }} (type = {{ seed_example['customer_type'] }})
450Location: {{ seed_example['region'] }}, {{ seed_example['country'] }}
451Customer Net Worth: {{ seed_example['customer_net_worth'] }}
452Loan: {{ seed_example['loan_value'] }} at {{ seed_example['loan_proposed_interest'] }} interest
453Purpose: {{ seed_example['loan_reason'] }}
454
455Justification: {{ seed_example['justification'] }}
456
457
458## Data
459Below are the input data points based on which you have to write the justification.
460
461Customer: {{ generated_row['customer_name'] }} (type = {{ generated_row['customer_type'] }})
462Location: {{ generated_row['region'] }}, {{ generated_row['country'] }}
463Customer Net Worth: {{ generated_row['customer_net_worth'] }}
464Loan: {{ generated_row['loan_value'] }} at {{ generated_row['loan_proposed_interest'] }} interest
465Purpose: {{ generated_row['loan_reason'] }}
466
467
468## Insturctions
469You must follow these guidelines:
470- The tone, length and level of details in your justification are similarly aligned with the provided example
471- The justification must be solely based on the provided data points
472- The justification must be written around the `Purpose` data point
473- The language should be professional and business-appropriate
474- Include any specific financial details if relevant
475
476The output must stricly follow the exacty format as below:
477Reasoning: <fill this section with any facts or arguments relevant to the justification>
478Justification: <fill this section with the justification>
479
480
481## Output
482""".strip()
483
484
485VALIDATION_PROMPT = """
486## Context
487You are an assistant that helps the bank evaluate loan applications.
488
489
490## Task
491Evaluate the quality and plausibility of a loan justification against provided application data points of the customer.
492The evaluation has to consider the folllowing criteria:
4931. Factual accuracy: All amounts, names, and details in the justification match the provided customer data points
4942. Logical consistency: The reasoning aligns with customer profile and loan parameters
4953. Realism: The justification is plausible for this type of loan application
4964. Professional tone: The justification is written in business-appropriate language, and follows a coherent and logical structure
4975. Completeness: The justification adequately addresses the loan reason and key factors
498
499
500## Example
501Below is an example of how the input data should look like. It starts with the customer data points and ends with the loan justification.
502
503Customer: {{ seed_example['customer_name'] }} (type = {{ seed_example['customer_type'] }})
504Location: {{ seed_example['region'] }}, {{ seed_example['country'] }}
505Customer Net Worth: {{ seed_example['customer_net_worth'] }}
506Loan: {{ seed_example['loan_value'] }} at {{ seed_example['loan_proposed_interest'] }} interest
507Purpose: {{ seed_example['loan_reason'] }}
508
509Justification: {{ seed_example['justification'] }}
510
511
512## Data
513Below is the input data for the current loan application you need to analyze.
514
515Customer: {{ generated_row['customer_name'] }} (type = {{ generated_row['customer_type'] }})
516Location: {{ generated_row['region'] }}, {{ generated_row['country'] }}
517Customer Net Worth: {{ generated_row['customer_net_worth'] }}
518Loan: {{ generated_row['loan_value'] }} at {{ generated_row['loan_proposed_interest']}} interest
519Purpose: {{ generated_row['loan_reason'] }}
520
521Justification: {{ justification }}
522
523
524## Instructions
525You must follow these guidelines:
526- All evaluation criteria must be assessed
527- Each evaluation criterion must be assessed solely based on the currently provided data points
528- If a criterion cannot be confidently assessed based on the provided data points, say so
529- Keep the evaluation reasoning short and concise
530
531After all criteria have been assessed, you must end your answer with a verdict. The verdict must be in uppercase letters, and the list of possible verdicts and their formats is provided below:
532- VALID (if the justification meets all evaluation criteria)
533- INVALID: <reasoning> (if the justification fails any criteria, list them and provide short reasoning)
534""".strip()
535
536
537
538# %%[markdown]
539parsing functions:
540
541# %%
542def parse_cot_justification_output(llm_output: str) -> str:
543 parts = llm_output.split("Justification:")
544 if len(parts) < 2:
545 return "Justification is not provided"
546 return parts[1].strip()
547
548
549def parse_validation_output(llm_output: str) -> bool:
550 if "INVALID" in llm_output:
551 return False
552 elif "VALID" in llm_output:
553 return True
554 return False
555
556
557from wayflowcore.flow import Flow
558from wayflowcore.models import LlmModel
559
560
561# %%[markdown]
562justification generation flow:
563
564# %%
565from wayflowcore.controlconnection import ControlFlowEdge
566from wayflowcore.models.llmgenerationconfig import LlmGenerationConfig
567from wayflowcore.steps import CompleteStep, PromptExecutionStep, RetryStep
568
569
570def get_justification_generation_step(llm: LlmModel) -> PromptExecutionStep:
571 return PromptExecutionStep(
572 prompt_template=JUSTIFICATION_PROMPT,
573 llm=llm,
574 generation_config=LlmGenerationConfig(max_tokens=2048),
575 input_mapping={
576 "seed_example": "seed_example",
577 "generated_row": "generated_row",
578 },
579 output_mapping={PromptExecutionStep.OUTPUT: JUSTIFICATION_GEN_IO},
580 )
581
582
583def get_parse_justification_step() -> ToolExecutionStep:
584 tool = ServerTool(
585 name="parse_cot_justification_output",
586 description="Parsing the specific LLM output of the justification generation step.",
587 parameters={
588 "llm_output": {
589 "type": "string",
590 "description": "Output from justification generation step.",
591 }
592 },
593 func=parse_cot_justification_output,
594 output={"type": "string"},
595 )
596 return ToolExecutionStep(
597 tool=tool,
598 input_mapping={"llm_output": JUSTIFICATION_GEN_IO},
599 output_mapping={ToolExecutionStep.TOOL_OUTPUT: JUSTIFICATION_PARSING_IO},
600 )
601
602
603def get_validation_step(llm: LlmModel) -> PromptExecutionStep:
604 return PromptExecutionStep(
605 prompt_template=VALIDATION_PROMPT,
606 llm=llm,
607 generation_config=LlmGenerationConfig(max_tokens=2048),
608 input_mapping={
609 "seed_example": "seed_example",
610 "generated_row": "generated_row",
611 "justification": JUSTIFICATION_PARSING_IO,
612 },
613 output_mapping={PromptExecutionStep.OUTPUT: VALIDATION_IO},
614 )
615
616
617def get_parse_validation_step() -> ToolExecutionStep:
618 tool = ServerTool(
619 name="parse_validation_output",
620 description="Parsing the specific LLM output of the validation step.",
621 parameters={
622 "llm_output": {
623 "type": "string",
624 "description": "Output from validation step.",
625 }
626 },
627 func=parse_validation_output,
628 output={"type": "bool"},
629 )
630 return ToolExecutionStep(
631 tool=tool,
632 input_mapping={"llm_output": VALIDATION_IO},
633 output_mapping={ToolExecutionStep.TOOL_OUTPUT: VALIDATION_PARSING_IO},
634 )
635
636
637def get_retry_step(flow: Flow) -> RetryStep:
638 return RetryStep(
639 flow=flow,
640 success_condition=VALIDATION_PARSING_IO,
641 max_num_trials=3,
642 )
643
644
645def get_main_flow(llm_justification: LlmModel, llm_validation: LlmModel) -> Flow:
646 justification_gen_step = get_justification_generation_step(llm_justification)
647 justification_parse_step = get_parse_justification_step()
648 validation_step = get_validation_step(llm_validation)
649 validation_parse_step = get_parse_validation_step()
650
651 steps = {
652 JUSTIFICATION_GEN_STEP: justification_gen_step,
653 JUSTIFICATION_PARSING_STEP: justification_parse_step,
654 VALIDATION_STEP: validation_step,
655 VALIDATION_PARSING_STEP: validation_parse_step,
656 }
657 transitions = {
658 JUSTIFICATION_GEN_STEP: [JUSTIFICATION_PARSING_STEP],
659 JUSTIFICATION_PARSING_STEP: [VALIDATION_STEP],
660 VALIDATION_STEP: [VALIDATION_PARSING_STEP],
661 VALIDATION_PARSING_STEP: [None],
662 }
663
664 return Flow(steps=steps, begin_step=justification_gen_step, transitions=transitions)
665
666
667def get_flow(llm_justification: LlmModel, llm_validation: LlmModel) -> Flow:
668 main_flow = get_main_flow(llm_justification, llm_validation)
669 retry_step = get_retry_step(main_flow)
670 success_step = CompleteStep()
671 failure_step = CompleteStep()
672
673 return Flow(
674 begin_step=retry_step,
675 steps={
676 "start": retry_step,
677 "success": success_step,
678 "failure": failure_step,
679 },
680 control_flow_edges=[
681 ControlFlowEdge(
682 source_step=retry_step,
683 source_branch=retry_step.BRANCH_NEXT,
684 destination_step=success_step,
685 ),
686 ControlFlowEdge(
687 source_step=retry_step,
688 source_branch=retry_step.BRANCH_FAILURE,
689 destination_step=failure_step,
690 ),
691 ],
692 )
693
694
695# %%[markdown]
696llm definition:
697
698# %%
699from wayflowcore.models import VllmModel
700
701llm = VllmModel(
702 model_id="LLAMA_MODEL_ID",
703 host_port="LLAMA_API_URL",
704)
705
706
707# %%[markdown]
708justification generation:
709
710# %%
711from wayflowcore.property import ListProperty
712from wayflowcore.steps import MapStep
713
714flow = create_single_step_flow(
715 step=MapStep(
716 flow=get_flow(llm, llm),
717 parallel_execution=True,
718 unpack_input={
719 "seed_example": ".seed_example",
720 "generated_row": ".generated_row",
721 },
722 output_descriptors=[ListProperty(JUSTIFICATION_PARSING_IO)],
723 )
724)
725
726input_sequence = [
727 {
728 "seed_example": df_seed.sample(n=1).iloc[0].to_dict(),
729 "generated_row": row.to_dict(),
730 }
731 for _, row in df_synthesized.iterrows()
732]
733conversation = flow.start_conversation(inputs={MapStep.ITERATED_INPUT: input_sequence})
734status = conversation.execute()
735df_synthesized["justification"] = status.output_values[JUSTIFICATION_PARSING_IO]
736
737
738
739# %%[markdown]
740exporting synthetic dataset:
741
742# %%
743df_synthesized.to_json(file_path, orient="records", indent=4)