5 Helpful Python Scripts for Artificial Information Era

Picture by Editor

# Introduction

Artificial knowledge, because the identify suggests, is created artificially relatively than being collected from real-world sources. It appears like actual knowledge however avoids privateness points and excessive knowledge assortment prices. This lets you simply take a look at software program and fashions whereas working experiments to simulate efficiency after launch.

Whereas libraries like Faker, SDV, and SynthCity exist — and even massive language fashions (LLMs) are extensively used for producing artificial knowledge — my focus on this article is to keep away from counting on these exterior libraries or AI instruments. As an alternative, you’ll discover ways to obtain the identical outcomes by writing your personal Python scripts. This supplies a greater understanding of the way to form a dataset and the way biases or errors are launched. We are going to begin with easy toy scripts to know the accessible choices. When you grasp these fundamentals, you possibly can comfortably transition to specialised libraries.

# 1. Producing Easy Random Information

The best place to start out is with a desk. For instance, when you want a pretend buyer dataset for an inner demo, you possibly can run a script to generate comma-separated values (CSV) knowledge:

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

nations = ["Canada", "UK", "UAE", "Germany", "USA"]
plans = ["Free", "Basic", "Pro", "Enterprise"]

def random_signup_date():
    begin = datetime(2024, 1, 1)
    finish = datetime(2026, 1, 1)
    delta_days = (finish - begin).days
    return (begin + timedelta(days=random.randint(0, delta_days))).date().isoformat()

rows = []
for i in vary(1, 1001):
    age = random.randint(18, 70)
    nation = random.selection(nations)
    plan = random.selection(plans)
    monthly_spend = spherical(random.uniform(0, 500), 2)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "nation": nation,
        "plan": plan,
        "monthly_spend": monthly_spend,
        "signup_date": random_signup_date()
    })

with open("prospects.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved prospects.csv")

Output:

This script is easy: you outline fields, select ranges, and write rows. The random module helps integer technology, floating-point values, random selection, and sampling. The csv module is designed to learn and write row-based tabular knowledge. This type of dataset is appropriate for:

Frontend demos
Dashboard testing
API improvement
Studying Structured Question Language (SQL)
Unit testing enter pipelines

Nevertheless, there’s a main weak spot to this method: all the pieces is totally random. This usually ends in knowledge that appears flat or unnatural. Enterprise prospects would possibly spend solely 2 {dollars}, whereas “Free” customers would possibly spend 400. Older customers behave precisely like youthful ones as a result of there isn’t a underlying construction.

In real-world situations, knowledge not often behaves this fashion. As an alternative of producing values independently, we are able to introduce relationships and guidelines. This makes the dataset really feel extra real looking whereas remaining totally artificial. As an illustration:

Enterprise prospects ought to nearly by no means have zero spend
Spending ranges ought to rely on the chosen plan
Older customers would possibly spend barely extra on common
Sure plans ought to be extra widespread than others

Let’s add these controls to the script:

import csv
import random

random.seed(42)

plans = ["Free", "Basic", "Pro", "Enterprise"]

def choose_plan():
    roll = random.random()
    if roll < 0.45:
        return "Free"
    if roll < 0.75:
        return "Primary"
    if roll < 0.93:
        return "Professional"
    return "Enterprise"

def generate_spend(age, plan):
    if plan == "Free":
        base = random.uniform(0, 10)
    elif plan == "Primary":
        base = random.uniform(10, 60)
    elif plan == "Professional":
        base = random.uniform(50, 180)
    else:
        base = random.uniform(150, 500)

    if age >= 40:
        base *= 1.15

    return spherical(base, 2)

rows = []
for i in vary(1, 1001):
    age = random.randint(18, 70)
    plan = choose_plan()
    spend = generate_spend(age, plan)

    rows.append({
        "customer_id": f"CUST{i:05d}",
        "age": age,
        "plan": plan,
        "monthly_spend": spend
    })

with open("controlled_customers.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved controlled_customers.csv")

Output:

Now the dataset preserves significant patterns. Somewhat than producing random noise, you might be simulating behaviors. Efficient controls could embody:

Weighted class choice
Sensible minimal and most ranges
Conditional logic between columns
Deliberately added uncommon edge circumstances
Lacking values inserted at low charges
Correlated options as an alternative of unbiased ones

# 2. Simulating Processes for Artificial Information

Simulation-based technology is among the greatest methods to create real looking artificial datasets. As an alternative of instantly filling columns, you simulate a course of. For instance, think about a small warehouse the place orders arrive, inventory decreases, and low inventory ranges set off backorders.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

stock = {
    "A": 120,
    "B": 80,
    "C": 50
}

rows = []
current_time = datetime(2026, 1, 1)

for day in vary(30):
    for product in stock:
        daily_orders = random.randint(0, 12)

        for _ in vary(daily_orders):
            qty = random.randint(1, 5)
            earlier than = stock[product]

            if stock[product] >= qty:
                stock[product] -= qty
                standing = "fulfilled"
            else:
                standing = "backorder"

            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": qty,
                "stock_before": earlier than,
                "stock_after": stock[product],
                "standing": standing
            })

        if stock[product] < 20:
            restock = random.randint(30, 80)
            stock[product] += restock
            rows.append({
                "time": current_time.isoformat(),
                "product": product,
                "qty": restock,
                "stock_before": stock[product] - restock,
                "stock_after": stock[product],
                "standing": "restock"
            })

    current_time += timedelta(days=1)

with open("warehouse_sim.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved warehouse_sim.csv")

Output:

This methodology is superb as a result of the info is a byproduct of system conduct, which generally yields extra real looking relationships than direct random row technology. Different simulation concepts embody:

Name heart queues
Journey requests and driver matching
Mortgage functions and approvals
Subscriptions and churn
Affected person appointment flows
Web site site visitors and conversion

# 3. Producing Time Sequence Artificial Information

Artificial knowledge isn’t just restricted to static tables. Many programs produce sequences over time, similar to app site visitors, sensor readings, orders per hour, or server response instances. Right here is a straightforward time collection generator for hourly web site visits with weekday patterns.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

begin = datetime(2026, 1, 1, 0, 0, 0)
hours = 24 * 30
rows = []

for i in vary(hours):
    ts = begin + timedelta(hours=i)
    weekday = ts.weekday()

    base = 120
    if weekday >= 5:
        base = 80

    hour = ts.hour
    if 8 <= hour <= 11:
        base += 60
    elif 18 <= hour <= 21:
        base += 40
    elif 0 <= hour <= 5:
        base -= 30

    visits = max(0, int(random.gauss(base, 15)))

    rows.append({
        "timestamp": ts.isoformat(),
        "visits": visits
    })

with open("traffic_timeseries.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=["timestamp", "visits"])
    author.writeheader()
    author.writerows(rows)

print("Saved traffic_timeseries.csv")

Output:

This method works effectively as a result of it incorporates traits, noise, and cyclic conduct whereas remaining straightforward to elucidate and debug.

# 4. Creating Occasion Logs

Occasion logs are one other helpful script fashion, preferrred for product analytics and workflow testing. As an alternative of 1 row per buyer, you create one row per motion.

import csv
import random
from datetime import datetime, timedelta

random.seed(42)

occasions = ["signup", "login", "view_page", "add_to_cart", "purchase", "logout"]

rows = []
begin = datetime(2026, 1, 1)

for user_id in vary(1, 201):
    event_count = random.randint(5, 30)
    current_time = begin + timedelta(days=random.randint(0, 10))

    for _ in vary(event_count):
        occasion = random.selection(occasions)

        if occasion == "buy" and random.random() < 0.6:
            worth = spherical(random.uniform(10, 300), 2)
        else:
            worth = 0.0

        rows.append({
            "user_id": f"USER{user_id:04d}",
            "event_time": current_time.isoformat(),
            "event_name": occasion,
            "event_value": worth
        })

        current_time += timedelta(minutes=random.randint(1, 180))

with open("event_log.csv", "w", newline="", encoding="utf-8") as f:
    author = csv.DictWriter(f, fieldnames=rows[0].keys())
    author.writeheader()
    author.writerows(rows)

print("Saved event_log.csv")

Output:

This format is beneficial for:

Funnel evaluation
Analytics pipeline testing
Enterprise intelligence (BI) dashboards
Session reconstruction
Anomaly detection experiments

A helpful approach right here is to make occasions depending on earlier actions. For instance, a purchase order ought to usually observe a login or a web page view, making the artificial log extra plausible.

# 5. Producing Artificial Textual content Information with Templates

Artificial knowledge can also be helpful for pure language processing (NLP). You don’t at all times want an LLM to start out; you possibly can construct efficient textual content datasets utilizing templates and managed variation. For instance, you possibly can create assist ticket coaching knowledge:

import json
import random

random.seed(42)

points = [
    ("billing", "I was charged twice for my subscription"),
    ("login", "I cannot log into my account"),
    ("shipping", "My order has not arrived yet"),
    ("refund", "I want to request a refund"),
]

tones = ["Please help", "This is urgent", "Can you check this", "I need support"]

information = []

for _ in vary(100):
    label, message = random.selection(points)
    tone = random.selection(tones)

    textual content = f"{tone}. {message}."
    information.append({
        "textual content": textual content,
        "label": label
    })

with open("support_tickets.jsonl", "w", encoding="utf-8") as f:
    for merchandise in information:
        f.write(json.dumps(merchandise) + "n")

print("Saved support_tickets.jsonl")

Output:

This method works effectively for:

Textual content classification demos
Intent detection
Chatbot testing
Immediate analysis

# Remaining Ideas

Artificial knowledge scripts are highly effective instruments, however they are often carried out incorrectly. You should definitely keep away from these widespread errors:

Making all values uniformly random
Forgetting dependencies between fields
Producing values that violate enterprise logic
Assuming artificial knowledge is inherently secure by default
Creating knowledge that’s too “clear” to be helpful for testing real-world edge circumstances
Utilizing the identical sample so steadily that the dataset turns into predictable and unrealistic

Privateness stays essentially the most important consideration. Whereas artificial knowledge reduces publicity to actual information, it isn’t risk-free. If a generator is just too intently tied to unique delicate knowledge, leakage can nonetheless happen. This is the reason privacy-preserving strategies, similar to differentially personal artificial knowledge, are important.

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with drugs. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions variety and tutorial excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

Sample Page Title

# Introduction

# 1. Producing Easy Random Information

# 2. Simulating Processes for Artificial Information

# 3. Producing Time Sequence Artificial Information

# 4. Creating Occasion Logs

# 5. Producing Artificial Textual content Information with Templates

# Remaining Ideas

Related Articles

$3.5 Trillion Administrator Apex Group Units $100B Tokenization Goal for 2027

Setup Information/Documentation for Golden Oracle EA – Buying and selling Programs – 19 March 2026

Foreign exchange Value Motion Setups in Latest Market Situations » Study To Commerce The Market

LEAVE A REPLY Cancel reply

Latest Articles

$3.5 Trillion Administrator Apex Group Units $100B Tokenization Goal for 2027

Setup Information/Documentation for Golden Oracle EA – Buying and selling Programs – 19 March 2026

Foreign exchange Value Motion Setups in Latest Market Situations » Study To Commerce The Market

There’s No Manner the West Will Have a Regular Summer time

Russian hackers exploit Zimbra flaw in Ukrainian govt assaults

EDITOR PICKS

$3.5 Trillion Administrator Apex Group Units $100B Tokenization Goal for 2027

Setup Information/Documentation for Golden Oracle EA – Buying and selling Programs...

Foreign exchange Value Motion Setups in Latest Market Situations » Study...

POPULAR POSTS

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

What’s nano-texture glass and do I would like it?

Feedback on the brand new buying and selling dialog in Metatrader...

POPULAR CATEGORY