HomeSample Page

Sample Page Title


5 Helpful Python Scripts for Artificial Information Era
Picture by Editor

 

Introduction

 
Artificial knowledge, because the identify suggests, is created artificially relatively than being collected from real-world sources. It appears like actual knowledge however avoids privateness points and excessive knowledge assortment prices. This lets you simply take a look at software program and fashions whereas working experiments to simulate efficiency after launch.

Whereas libraries like Faker, SDV, and SynthCity exist — and even massive language fashions (LLMs) are extensively used for producing artificial knowledge — my focus on this article is to keep away from counting on these exterior libraries or AI instruments. As an alternative, you’ll discover ways to obtain the identical outcomes by writing your personal Python scripts. This supplies a greater understanding of the way to form a dataset and the way biases or errors are launched. We are going to begin with easy toy scripts to know the accessible choices. When you grasp these fundamentals, you possibly can comfortably transition to specialised libraries.

 

1. Producing Easy Random Information

 
The best place to start out is with a desk. For instance, when you want a pretend buyer dataset for an inner demo, you possibly can run a script to generate comma-separated values (CSV) knowledge:

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

nations = ["Canada", "UK", "UAE", "Germany", "USA"]
plans = ["Free", "Basic", "Pro", "Enterprise"]

def random_signup_date():
    begin = datetime(2024, 1, 1)
    finish = datetime(2026, 1, 1)
    delta_days = (finish - begin).days
    return (begin + timedelta(days=random.randint(0, delta_days))).date().isoformat()

rows = []
for i in vary(1, 1001):
    age = random.randint(18, 70)
    nation = random.selection(nations)
    plan = random.selection(plans)
    monthly_spend = spherical(random.uniform(0, 500), 2)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "nation": nation,
        "plan": plan,
        "monthly_spend": monthly_spend,
        "signup_date": random_signup_date()
    })

with open("prospects.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved prospects.csv")

 

Output:
 
Simple Random Data Generation
 
This script is easy: you outline fields, select ranges, and write rows. The random module helps integer technology, floating-point values, random selection, and sampling. The csv module is designed to learn and write row-based tabular knowledge. This type of dataset is appropriate for:

  • Frontend demos
  • Dashboard testing
  • API improvement
  • Studying Structured Question Language (SQL)
  • Unit testing enter pipelines

Nevertheless, there’s a main weak spot to this method: all the pieces is totally random. This usually ends in knowledge that appears flat or unnatural. Enterprise prospects would possibly spend solely 2 {dollars}, whereas “Free” customers would possibly spend 400. Older customers behave precisely like youthful ones as a result of there isn’t a underlying construction.

In real-world situations, knowledge not often behaves this fashion. As an alternative of producing values independently, we are able to introduce relationships and guidelines. This makes the dataset really feel extra real looking whereas remaining totally artificial. As an illustration:

  • Enterprise prospects ought to nearly by no means have zero spend
  • Spending ranges ought to rely on the chosen plan
  • Older customers would possibly spend barely extra on common
  • Sure plans ought to be extra widespread than others

Let’s add these controls to the script:

import csv
import random

random.seed(42)

plans = ["Free", "Basic", "Pro", "Enterprise"]

def choose_plan():
    roll = random.random()
    if roll < 0.45:
        return "Free"
    if roll < 0.75:
        return "Primary"
    if roll < 0.93:
        return "Professional"
    return "Enterprise"

def generate_spend(age, plan):
    if plan == "Free":
        base = random.uniform(0, 10)
    elif plan == "Primary":
        base = random.uniform(10, 60)
    elif plan == "Professional":
        base = random.uniform(50, 180)
    else:
        base = random.uniform(150, 500)

    if age >= 40:
        base *= 1.15

    return spherical(base, 2)

rows = []
for i in vary(1, 1001):
    age = random.randint(18, 70)
    plan = choose_plan()
    spend = generate_spend(age, plan)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "plan": plan,
        "monthly_spend": spend
    })

with open("controlled_customers.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved controlled_customers.csv")

 

Output:
 
Simple Random Data Generation-2
 

Now the dataset preserves significant patterns. Somewhat than producing random noise, you might be simulating behaviors. Efficient controls could embody:

  • Weighted class choice
  • Sensible minimal and most ranges
  • Conditional logic between columns
  • Deliberately added uncommon edge circumstances
  • Lacking values inserted at low charges
  • Correlated options as an alternative of unbiased ones

 

2. Simulating Processes for Artificial Information

 
Simulation-based technology is among the greatest methods to create real looking artificial datasets. As an alternative of instantly filling columns, you simulate a course of. For instance, think about a small warehouse the place orders arrive, inventory decreases, and low inventory ranges set off backorders.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

stock = {
    "A": 120,
    "B": 80,
    "C": 50
}

rows = []
current_time = datetime(2026, 1, 1)

for day in vary(30):
    for product in stock:
        daily_orders = random.randint(0, 12)

        for _ in vary(daily_orders):
            qty = random.randint(1, 5)
            earlier than = stock[product]

            if stock[product] >= qty:
                stock[product] -= qty
                standing = "fulfilled"
            else:
                standing = "backorder"

            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": qty,
                "stock_before": earlier than,
                "stock_after": stock[product],
                "standing": standing
            })

        if stock[product] < 20:
            restock = random.randint(30, 80)
            stock[product] += restock
            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": restock,
                "stock_before": stock[product] - restock,
                "stock_after": stock[product],
                "standing": "restock"
            })

    current_time += timedelta(days=1)

with open("warehouse_sim.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved warehouse_sim.csv")

 

Output:
 
Simulation Based Synthetic Data
 
This methodology is superb as a result of the info is a byproduct of system conduct, which generally yields extra real looking relationships than direct random row technology. Different simulation concepts embody:

  • Name heart queues
  • Journey requests and driver matching
  • Mortgage functions and approvals
  • Subscriptions and churn
  • Affected person appointment flows
  • Web site site visitors and conversion

 

3. Producing Time Sequence Artificial Information

 
Artificial knowledge isn’t just restricted to static tables. Many programs produce sequences over time, similar to app site visitors, sensor readings, orders per hour, or server response instances. Right here is a straightforward time collection generator for hourly web site visits with weekday patterns.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

begin = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = []

for i in vary(hours):
    ts = begin + timedelta(hours=i)
    weekday = ts.weekday()

    base = 120
    if weekday >= 5:
        base = 80

    hour = ts.hour
    if 8 <= hour <= 11:
        base += 60
    elif 18 <= hour <= 21:
        base += 40
    elif 0 <= hour <= 5:
        base -= 30

    visits = max(0, int(random.gauss(base, 15)))

    rows.append({
        "timestamp": ts.isoformat(),
        "visits": visits
    })

with open("traffic_timeseries.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=["timestamp", "visits"])
    author.writeheader()
    author.writerows(rows)

print("Saved traffic_timeseries.csv")

 

Output:
 
Time Series Synthetic Data
 
This method works effectively as a result of it incorporates traits, noise, and cyclic conduct whereas remaining straightforward to elucidate and debug.

 

4. Creating Occasion Logs

 
Occasion logs are one other helpful script fashion, preferrred for product analytics and workflow testing. As an alternative of 1 row per buyer, you create one row per motion.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

occasions = ["signup", "login", "view_page", "add_to_cart", "purchase", "logout"]

rows = []
begin = datetime(2026, 1, 1)

for user_id in vary(1, 201):
    event_count = random.randint(5, 30)
    current_time = begin + timedelta(days=random.randint(0, 10))

    for _ in vary(event_count):
        occasion = random.selection(occasions)

        if occasion == "buy" and random.random() < 0.6:
            worth = spherical(random.uniform(10, 300), 2)
        else:
            worth = 0.0

        rows.append({
            "user_id": f"USER{user_id:04d}",
            "event_time": current_time.isoformat(),
            "event_name": occasion,
            "event_value": worth
        })

        current_time += timedelta(minutes=random.randint(1, 180))

with open("event_log.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved event_log.csv")

 

Output:
 
Event Log Generation
 
This format is beneficial for:

  • Funnel evaluation
  • Analytics pipeline testing
  • Enterprise intelligence (BI) dashboards
  • Session reconstruction
  • Anomaly detection experiments

A helpful approach right here is to make occasions depending on earlier actions. For instance, a purchase order ought to usually observe a login or a web page view, making the artificial log extra plausible.

 

5. Producing Artificial Textual content Information with Templates

 
Artificial knowledge can also be helpful for pure language processing (NLP). You don’t at all times want an LLM to start out; you possibly can construct efficient textual content datasets utilizing templates and managed variation. For instance, you possibly can create assist ticket coaching knowledge:

import json
import random

random.seed(42)

points = [
    ("billing", "I was charged twice for my subscription"),
    ("login", "I cannot log into my account"),
    ("shipping", "My order has not arrived yet"),
    ("refund", "I want to request a refund"),
]

tones = ["Please help", "This is urgent", "Can you check this", "I need support"]

information = []

for _ in vary(100):
    label, message = random.selection(points)
    tone = random.selection(tones)

    textual content = f"{tone}. {message}."
    information.append({
        "textual content": textual content,
        "label": label
    })

with open("support_tickets.jsonl", "w", encoding="utf-8") as f:
    for merchandise in information:
        f.write(json.dumps(merchandise) + "n")

print("Saved support_tickets.jsonl")

 

Output:
 
Synthetic Text Data Using Templates
 
This method works effectively for:

  • Textual content classification demos
  • Intent detection
  • Chatbot testing
  • Immediate analysis

 

Remaining Ideas

 
Artificial knowledge scripts are highly effective instruments, however they are often carried out incorrectly. You should definitely keep away from these widespread errors:

  • Making all values uniformly random
  • Forgetting dependencies between fields
  • Producing values that violate enterprise logic
  • Assuming artificial knowledge is inherently secure by default
  • Creating knowledge that’s too “clear” to be helpful for testing real-world edge circumstances
  • Utilizing the identical sample so steadily that the dataset turns into predictable and unrealistic

Privateness stays essentially the most important consideration. Whereas artificial knowledge reduces publicity to actual information, it isn’t risk-free. If a generator is just too intently tied to unique delicate knowledge, leakage can nonetheless happen. This is the reason privacy-preserving strategies, similar to differentially personal artificial knowledge, are important.
 
 

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with drugs. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions variety and tutorial excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles