5 Helpful Python Scripts for Artificial Information Era

Picture by Editor

# Introduction

Artificial knowledge, because the identify suggests, is created artificially relatively than being collected from real-world sources. It appears like actual knowledge however avoids privateness points and excessive knowledge assortment prices. This lets you simply check software program and fashions whereas operating experiments to simulate efficiency after launch.

Whereas libraries like Faker, SDV, and SynthCity exist — and even giant language fashions (LLMs) are extensively used for producing artificial knowledge — my focus on this article is to keep away from counting on these exterior libraries or AI instruments. As an alternative, you’ll discover ways to obtain the identical outcomes by writing your personal Python scripts. This offers a greater understanding of methods to form a dataset and the way biases or errors are launched. We are going to begin with easy toy scripts to know the accessible choices. When you grasp these fundamentals, you’ll be able to comfortably transition to specialised libraries.

# 1. Producing Easy Random Information

The only place to begin is with a desk. For instance, in the event you want a pretend buyer dataset for an inside demo, you’ll be able to run a script to generate comma-separated values (CSV) knowledge:

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

international locations = [“Canada”, “UK”, “UAE”, “Germany”, “USA”]
plans = [“Free”, “Basic”, “Pro”, “Enterprise”]

def random_signup_date():
begin = datetime(2024, 1, 1)
finish = datetime(2026, 1, 1)
delta_days = (finish – begin).days
return (begin + timedelta(days=random.randint(0, delta_days))).date().isoformat()

rows = []
for i in vary(1, 1001):
age = random.randint(18, 70)
nation = random.selection(international locations)
plan = random.selection(plans)
monthly_spend = spherical(random.uniform(0, 500), 2)

rows.append({
“customer_id”: f”CUST{i:05d}”,
“age”: age,
“nation”: nation,
“plan”: plan,
“monthly_spend”: monthly_spend,
“signup_date”: random_signup_date()
})

with open(“prospects.csv”, “w”, newline=””, encoding=”utf-8″) as f:
author = csv.DictWriter(f, fieldnames=rows[0].keys())
author.writeheader()
author.writerows(rows)

print(“Saved prospects.csv”)

Output:

This script is simple: you outline fields, select ranges, and write rows. The random module helps integer era, floating-point values, random selection, and sampling. The csv module is designed to learn and write row-based tabular knowledge. This sort of dataset is appropriate for:

Frontend demos
Dashboard testing
API growth
Studying Structured Question Language (SQL)
Unit testing enter pipelines

Nonetheless, there’s a main weak spot to this strategy: all the things is totally random. This usually ends in knowledge that appears flat or unnatural. Enterprise prospects would possibly spend solely 2 {dollars}, whereas “Free” customers would possibly spend 400. Older customers behave precisely like youthful ones as a result of there isn’t a underlying construction.

In real-world situations, knowledge not often behaves this manner. As an alternative of producing values independently, we will introduce relationships and guidelines. This makes the dataset really feel extra life like whereas remaining totally artificial. For example:

Enterprise prospects ought to virtually by no means have zero spend
Spending ranges ought to depend upon the chosen plan
Older customers would possibly spend barely extra on common
Sure plans needs to be extra widespread than others

Let’s add these controls to the script:

import csv
import random

random.seed(42)

plans = [“Free”, “Basic”, “Pro”, “Enterprise”]

def choose_plan():
roll = random.random()
if roll < 0.45:
return “Free”
if roll < 0.75:
return “Fundamental”
if roll < 0.93:
return “Professional”
return “Enterprise”

def generate_spend(age, plan):
if plan == “Free”:
base = random.uniform(0, 10)
elif plan == “Fundamental”:
base = random.uniform(10, 60)
elif plan == “Professional”:
base = random.uniform(50, 180)
else:
base = random.uniform(150, 500)

if age >= 40:
base *= 1.15

return spherical(base, 2)

rows = []
for i in vary(1, 1001):
age = random.randint(18, 70)
plan = choose_plan()
spend = generate_spend(age, plan)

rows.append({
“customer_id”: f”CUST{i:05d}”,
“age”: age,
“plan”: plan,
“monthly_spend”: spend
})

with open(“controlled_customers.csv”, “w”, newline=””, encoding=”utf-8″) as f:
author = csv.DictWriter(f, fieldnames=rows[0].keys())
author.writeheader()
author.writerows(rows)

print(“Saved controlled_customers.csv”)

Output:

Now the dataset preserves significant patterns. Somewhat than producing random noise, you might be simulating behaviors. Efficient controls might embrace:

Weighted class choice
Practical minimal and most ranges
Conditional logic between columns
Deliberately added uncommon edge instances
Lacking values inserted at low charges
Correlated options as an alternative of impartial ones

# 2. Simulating Processes for Artificial Information

Simulation-based era is among the finest methods to create life like artificial datasets. As an alternative of instantly filling columns, you simulate a course of. For instance, take into account a small warehouse the place orders arrive, inventory decreases, and low inventory ranges set off backorders.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

stock = {
“A”: 120,
“B”: 80,
“C”: 50
}

rows = []
current_time = datetime(2026, 1, 1)

for day in vary(30):
for product in stock:
daily_orders = random.randint(0, 12)

for _ in vary(daily_orders):
qty = random.randint(1, 5)
earlier than = stock[product]

if stock[product] >= qty:
stock[product] -= qty
standing = “fulfilled”
else:
standing = “backorder”

rows.append({
“time”: current_time.isoformat(),
“product”: product,
“qty”: qty,
“stock_before”: earlier than,
“stock_after”: stock[product],
“standing”: standing
})

if stock[product] < 20:
restock = random.randint(30, 80)
stock[product] += restock
rows.append({
“time”: current_time.isoformat(),
“product”: product,
“qty”: restock,
“stock_before”: stock[product] – restock,
“stock_after”: stock[product],
“standing”: “restock”
})

current_time += timedelta(days=1)

with open(“warehouse_sim.csv”, “w”, newline=””, encoding=”utf-8″) as f:
author = csv.DictWriter(f, fieldnames=rows[0].keys())
author.writeheader()
author.writerows(rows)

print(“Saved warehouse_sim.csv”)

Output:

This technique is great as a result of the information is a byproduct of system conduct, which usually yields extra life like relationships than direct random row era. Different simulation concepts embrace:

Name heart queues
Experience requests and driver matching
Mortgage functions and approvals
Subscriptions and churn
Affected person appointment flows
Web site site visitors and conversion

# 3. Producing Time Sequence Artificial Information

Artificial knowledge isn’t just restricted to static tables. Many programs produce sequences over time, equivalent to app site visitors, sensor readings, orders per hour, or server response occasions. Right here is an easy time collection generator for hourly web site visits with weekday patterns.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

begin = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = []

for i in vary(hours):
ts = begin + timedelta(hours=i)
weekday = ts.weekday()

base = 120
if weekday >= 5:
base = 80

hour = ts.hour
if 8 <= hour <= 11:
base += 60
elif 18 <= hour <= 21:
base += 40
elif 0 <= hour <= 5:
base -= 30

visits = max(0, int(random.gauss(base, 15)))

rows.append({
“timestamp”: ts.isoformat(),
“visits”: visits
})

with open(“traffic_timeseries.csv”, “w”, newline=””, encoding=”utf-8″) as f:
author = csv.DictWriter(f, fieldnames=[“timestamp”, “visits”])
author.writeheader()
author.writerows(rows)

print(“Saved traffic_timeseries.csv”)

Output:

This strategy works properly as a result of it incorporates tendencies, noise, and cyclic conduct whereas remaining straightforward to clarify and debug.

# 4. Creating Occasion Logs

Occasion logs are one other helpful script type, best for product analytics and workflow testing. As an alternative of 1 row per buyer, you create one row per motion.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

occasions = [“signup”, “login”, “view_page”, “add_to_cart”, “purchase”, “logout”]

rows = []
begin = datetime(2026, 1, 1)

for user_id in vary(1, 201):
event_count = random.randint(5, 30)
current_time = begin + timedelta(days=random.randint(0, 10))

for _ in vary(event_count):
occasion = random.selection(occasions)

if occasion == “buy” and random.random() < 0.6:
worth = spherical(random.uniform(10, 300), 2)
else:
worth = 0.0

rows.append({
“user_id”: f”USER{user_id:04d}”,
“event_time”: current_time.isoformat(),
“event_name”: occasion,
“event_value”: worth
})

current_time += timedelta(minutes=random.randint(1, 180))

with open(“event_log.csv”, “w”, newline=””, encoding=”utf-8″) as f:
author = csv.DictWriter(f, fieldnames=rows[0].keys())
author.writeheader()
author.writerows(rows)

print(“Saved event_log.csv”)

Output:

This format is helpful for:

Funnel evaluation
Analytics pipeline testing
Enterprise intelligence (BI) dashboards
Session reconstruction
Anomaly detection experiments

A helpful method right here is to make occasions depending on earlier actions. For instance, a purchase order ought to usually observe a login or a web page view, making the artificial log extra plausible.

# 5. Producing Artificial Textual content Information with Templates

Artificial knowledge can also be helpful for pure language processing (NLP). You don’t all the time want an LLM to begin; you’ll be able to construct efficient textual content datasets utilizing templates and managed variation. For instance, you’ll be able to create assist ticket coaching knowledge:

import json
import random

random.seed(42)

points = [
(“billing”, “I was charged twice for my subscription”),
(“login”, “I cannot log into my account”),
(“shipping”, “My order has not arrived yet”),
(“refund”, “I want to request a refund”),
]

tones = [“Please help”, “This is urgent”, “Can you check this”, “I need support”]

data = []

for _ in vary(100):
label, message = random.selection(points)
tone = random.selection(tones)

textual content = f”{tone}. {message}.”
data.append({
“textual content”: textual content,
“label”: label
})

with open(“support_tickets.jsonl”, “w”, encoding=”utf-8″) as f:
for merchandise in data:
f.write(json.dumps(merchandise) + “n”)

print(“Saved support_tickets.jsonl”)

Output:

This strategy works properly for:

Textual content classification demos
Intent detection
Chatbot testing
Immediate analysis

# Ultimate Ideas

Artificial knowledge scripts are highly effective instruments, however they are often applied incorrectly. You’ll want to keep away from these widespread errors:

Making all values uniformly random
Forgetting dependencies between fields
Producing values that violate enterprise logic
Assuming artificial knowledge is inherently protected by default
Creating knowledge that’s too “clear” to be helpful for testing real-world edge instances
Utilizing the identical sample so ceaselessly that the dataset turns into predictable and unrealistic

Privateness stays probably the most essential consideration. Whereas artificial knowledge reduces publicity to actual data, it isn’t risk-free. If a generator is just too carefully tied to authentic delicate knowledge, leakage can nonetheless happen. This is the reason privacy-preserving strategies, equivalent to differentially non-public artificial knowledge, are important.

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with drugs. She co-authored the e book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.

What's Hot

Your Telephone Pinging Hijacks Your Mind for 7 Seconds, Examine Finds

Mapping Google’s Unmappable Metropolis

Australia PM heckled at Sydney mosque Ramadan occasion

Mapping Google’s Unmappable Metropolis

Blue Origin Joins the Race for Orbital Information Facilities With 51K Satellite tv for pc Plan

Use RAG for video era utilizing Amazon Bedrock and Amazon Nova Reel

Use Customized Abilities on Claude Code

Google Colab Now Has an Open-Supply MCP (Mannequin Context Protocol) Server: Use Colab Runtimes with GPUs from Any Native AI Agent

RIP Metaverse, an $80 Billion Dumpster Fireplace No person Wished

Your Telephone Pinging Hijacks Your Mind for 7 Seconds, Examine Finds

Mapping Google’s Unmappable Metropolis

Australia PM heckled at Sydney mosque Ramadan occasion

Your Telephone Pinging Hijacks Your Mind for 7 Seconds, Examine Finds

Mapping Google’s Unmappable Metropolis

Australia PM heckled at Sydney mosque Ramadan occasion

Usefull link

categories

What's Hot

# Introduction

# 1. Producing Easy Random Information

# 2. Simulating Processes for Artificial Information

# 3. Producing Time Sequence Artificial Information

# 4. Creating Occasion Logs

# 5. Producing Artificial Textual content Information with Templates

# Ultimate Ideas

Related Posts

Usefull link

categories