Generating Datasets with Code CoPilot

While there are plenty of public datasets available for use sometimes it makes sense to generate your own. That is what I did over the past weekend with the help of GPT Code CoPilot. I working healthcare IT and I am planning to test a few de-identification tools for researchers.

De-identification tools are important as they seek to remove PHI (protected health information) and PII (personally identifiable information) from datasets to make them safe enough to use without compromising the identity of the subject in the record set. Dataset which have not be de-identified are identifiable or limited at best. I began with the following prompt…

Write a bash script which generates a synthetic dataset and encompasses the following fields: First Name, Last Name, Address, City, State, Zip Code, Phone Number, Date of Birth, Sex, Religious Affiliation

GPT provided a decent script that generated a decent dataset in CVS format. However, since I’m in healthcare I thought it would be more useful to generate synthetic medical records. I asked GPT to expand the previous script to simulate medical record data files. It suggested the following fields:

• First Name
• Last Name
• Address
• City
• State
• Zip Code
• Phone Number
• Date of Birth
• Sex
• Religious Affiliation
• Medical Record Number: Unique identifier for the medical record.
• Primary Diagnosis: The primary medical condition.
• Secondary Diagnoses: Additional related conditions (optional).
• Medications: Current medications prescribed to the individual.
• Allergies: Known allergies.
• Blood Type: Blood group (e.g., A+, O-).
• Height (cm): Height in centimeters.
• Weight (kg): Weight in kilograms.

Before the subsequent scripts can be run, python faker library must be installed with the following command:

pip install faker

Here is the medial record script: medical_records_dataset.csv”

#!/bin/bash

# File to save the generated medical dataset
output_file="medical_records_dataset.csv"

# Number of records to generate
num_records=100

# Create the CSV file with headers
echo "Medical Record Number,First Name,Last Name,Address,City,State,Zip Code,Phone Number,Date of Birth,Sex,Religious Affiliation,Primary Diagnosis,Secondary Diagnoses,Medications,Allergies,Blood Type,Height (cm),Weight (kg)" > $output_file

# Python script to generate data
generate_data_script=$(cat <<EOF
import sys
from faker import Faker
import random

fake = Faker()

# Number of records to generate
num_records = int(sys.argv[1])

# Lists for randomized medical data
diagnoses = [
    "Hypertension", "Diabetes", "Asthma", "Heart Disease", 
    "Depression", "Anxiety", "Migraine", "Arthritis", 
    "Obesity", "Cancer", "Chronic Pain", "Epilepsy"
]
medications = [
    "Ibuprofen", "Metformin", "Lisinopril", "Amlodipine", 
    "Simvastatin", "Omeprazole", "Albuterol", "Levothyroxine", 
    "Atorvastatin", "Losartan"
]
allergies = [
    "Peanuts", "Shellfish", "Pollen", "Dust", 
    "Penicillin", "Latex", "None", "Eggs"
]
blood_types = ["A+", "A-", "B+", "B-", "O+", "O-", "AB+", "AB-"]

# Generate and print each record
for record_num in range(1, num_records + 1):
    medical_record_number = f"MRN{record_num:06d}"
    first_name = fake.first_name()
    last_name = fake.last_name()
    address = fake.street_address()
    city = fake.city()
    state = fake.state()
    zip_code = fake.zipcode()
    phone_number = fake.phone_number()
    date_of_birth = fake.date_of_birth(minimum_age=18, maximum_age=90).strftime("%Y-%m-%d")
    sex = random.choice(["Male", "Female", "Other"])
    religious_affiliation = random.choice([
        "Christianity", "Islam", "Hinduism", "Buddhism", "Judaism", 
        "Atheism", "Agnosticism", "Other"
    ])
    primary_diagnosis = random.choice(diagnoses)
    secondary_diagnoses = "; ".join(random.sample(diagnoses, random.randint(0, 2)))
    current_medications = "; ".join(random.sample(medications, random.randint(0, 3)))
    known_allergies = random.choice(allergies)
    blood_type = random.choice(blood_types)
    height_cm = random.randint(150, 200)
    weight_kg = random.randint(50, 120)

    # Print the data in CSV format
    print(f"{medical_record_number},{first_name},{last_name},{address},{city},{state},{zip_code},{phone_number},{date_of_birth},{sex},{religious_affiliation},{primary_diagnosis},{secondary_diagnoses},{current_medications},{known_allergies},{blood_type},{height_cm},{weight_kg}")
EOF
)

# Use Python to generate the dataset and append to the file
python3 -c "$generate_data_script" $num_records >> $output_file

echo "Medical records dataset generated and saved to $output_file"

Instructions to Run:
Save the script as generate_synthetic_dataset.sh.

Make the script executable:
chmod +x generate_synthetic_dataset.sh

Run the script:
./generate_synthetic_dataset.sh

The results were very usable but required a bit of clean up in LibreCalc. Since I plan to use this dataset to test data de-identification tools I asked GPT to add a field for SSNs. GPT obliged and assisted me further by suggesting the following additional enhancements in iterations:

Add validations to ensure SSNs are unique in the dataset.
Include optional fields like Email Address or Emergency Contact.
Include a field for Insurance Provider and Policy Number to further enhance the medical records.
Add another field to simulate optional Marital Status (e.g., Single, Married, Divorced).
Generate a companion file with visit history for each patient, linked by Medical Record Number.
Expand visit history to include optional fields like Doctor’s Name or Test Results.
Generate additional datasets for billing or insurance claims. I encourage you to try it out for yourself and see what types of dataset you can come up with.

GPT Code CoPilot was super helpful and exceeded my expectations. The enhancements continued but at some point I felt it was overkill. However, I am glad to know that I can go back to the well to continue to generate an excellent dataset for research testing needs. The code that I ended with is below.

#!/bin/bash

# Files to save the datasets
output_file="medical_records_dataset.csv"
visit_history_file="visit_history_dataset.csv"
billing_file="billing_dataset.csv"

# Number of medical records, visits, and billing records to generate
num_records=100
num_visits=300
num_billing_records=200

# Create the CSV file with headers for medical records
echo "Medical Record Number,First Name,Last Name,Address,City,State,Zip Code,Phone Number,Date of Birth,Sex,Religious Affiliation,Social Security Number,Email Address,Emergency Contact,Insurance Provider,Policy Number,Marital Status,Primary Diagnosis,Secondary Diagnoses,Medications,Allergies,Blood Type,Height (cm),Weight (kg)" > $output_file

# Create the CSV file with headers for visit history
echo "Visit ID,Medical Record Number,Visit Date,Reason for Visit,Doctor's Name,Test Results,Prescriptions,Notes" > $visit_history_file

# Create the CSV file with headers for billing/insurance claims
echo "Claim ID,Medical Record Number,Visit ID,Insurance Provider,Policy Number,Claim Amount,Claim Status,Date Submitted,Date Processed" > $billing_file

# Python script to generate data
generate_data_script=$(cat <<EOF
import sys
from faker import Faker
import random

fake = Faker()

# Number of records, visits, and billing records to generate
num_records = int(sys.argv[1])
num_visits = int(sys.argv[2])
num_billing_records = int(sys.argv[3])

# Lists for randomized medical data
diagnoses = [
    "Hypertension", "Diabetes", "Asthma", "Heart Disease", 
    "Depression", "Anxiety", "Migraine", "Arthritis", 
    "Obesity", "Cancer", "Chronic Pain", "Epilepsy"
]
medications = [
    "Ibuprofen", "Metformin", "Lisinopril", "Amlodipine", 
    "Simvastatin", "Omeprazole", "Albuterol", "Levothyroxine", 
    "Atorvastatin", "Losartan"
]
visit_reasons = [
    "Routine Checkup", "Follow-up", "New Symptoms", 
    "Emergency Visit", "Medication Refill", "Lab Results Discussion"
]
allergies = [
    "Peanuts", "Shellfish", "Pollen", "Dust", 
    "Penicillin", "Latex", "None", "Eggs"
]
blood_types = ["A+", "A-", "B+", "B-", "O+", "O-", "AB+", "AB-"]
insurance_providers = [
    "Blue Cross Blue Shield", "Aetna", "UnitedHealthcare", 
    "Cigna", "Humana", "Kaiser Permanente", "Medicare", 
    "Medicaid", "TRICARE", "State Farm"
]
marital_statuses = ["Single", "Married", "Divorced", "Widowed"]
claim_statuses = ["Submitted", "Approved", "Denied", "Pending Review"]

# Keep track of generated SSNs and Policy Numbers to ensure uniqueness
generated_ssns = set()
generated_policy_numbers = set()

# Function to generate a unique SSN
def generate_unique_ssn():
    while True:
        ssn = fake.ssn()
        if ssn not in generated_ssns:
            generated_ssns.add(ssn)
            return ssn

# Function to generate a unique Policy Number
def generate_unique_policy_number():
    while True:
        policy_number = f"POL{random.randint(10000000, 99999999)}"
        if policy_number not in generated_policy_numbers:
            generated_policy_numbers.add(policy_number)
            return policy_number

# Generate medical records
medical_records = []
for record_num in range(1, num_records + 1):
    medical_record_number = f"MRN{record_num:06d}"
    first_name = fake.first_name()
    last_name = fake.last_name()
    address = fake.street_address()
    city = fake.city()
    state = fake.state()
    zip_code = fake.zipcode()
    phone_number = fake.phone_number()
    date_of_birth = fake.date_of_birth(minimum_age=18, maximum_age=90).strftime("%Y-%m-%d")
    sex = random.choice(["Male", "Female", "Other"])
    religious_affiliation = random.choice([
        "Christianity", "Islam", "Hinduism", "Buddhism", "Judaism", 
        "Atheism", "Agnosticism", "Other"
    ])
    ssn = generate_unique_ssn()
    email_address = fake.email()
    emergency_contact = f"{fake.name()} - {fake.phone_number()}"
    insurance_provider = random.choice(insurance_providers)
    policy_number = generate_unique_policy_number()
    marital_status = random.choice(marital_statuses)
    primary_diagnosis = random.choice(diagnoses)
    secondary_diagnoses = "; ".join(random.sample(diagnoses, random.randint(0, 2)))
    current_medications = "; ".join(random.sample(medications, random.randint(0, 3)))
    known_allergies = random.choice(allergies)
    blood_type = random.choice(blood_types)
    height_cm = random.randint(150, 200)
    weight_kg = random.randint(50, 120)

    medical_records.append({
        "medical_record_number": medical_record_number,
        "insurance_provider": insurance_provider,
        "policy_number": policy_number,
        "record_line": f"{medical_record_number},{first_name},{last_name},{address},{city},{state},{zip_code},{phone_number},{date_of_birth},{sex},{religious_affiliation},{ssn},{email_address},{emergency_contact},{insurance_provider},{policy_number},{marital_status},{primary_diagnosis},{secondary_diagnoses},{current_medications},{known_allergies},{blood_type},{height_cm},{weight_kg}"
    })

# Save medical records to file
with open("$output_file", "a") as f:
    for record in medical_records:
        f.write(record["record_line"] + "\n")

# Generate visit history
visits = []
with open("$visit_history_file", "a") as f:
    for visit_id in range(1, num_visits + 1):
        medical_record = random.choice(medical_records)
        medical_record_number = medical_record["medical_record_number"]
        visit_date = fake.date_between(start_date="-3y", end_date="today").strftime("%Y-%m-%d")
        reason_for_visit = random.choice(visit_reasons)
        doctor_name = fake.name()
        test_results = f"Blood Test: {random.choice(['Normal', 'Abnormal'])}; X-Ray: {random.choice(['Clear', 'Findings'])}"
        prescriptions = "; ".join(random.sample(medications, random.randint(0, 3)))
        notes = fake.text(max_nb_chars=100)
        visit = {
            "visit_id": f"VST{visit_id:06d}",
            "medical_record_number": medical_record_number
        }
        visits.append(visit)
        f.write(f"{visit['visit_id']},{medical_record_number},{visit_date},{reason_for_visit},{doctor_name},{test_results},{prescriptions},{notes}\n")

# Generate billing records
with open("$billing_file", "a") as f:
    for claim_id in range(1, num_billing_records + 1):
        visit = random.choice(visits)
        medical_record_number = visit["medical_record_number"]
        visit_id = visit["visit_id"]
        insurance_provider = random.choice(insurance_providers)
        policy_number = f"POL{random.randint(10000000, 99999999)}"
        claim_amount = random.uniform(100, 10000)
        claim_status = random.choice(claim_statuses)
        date_submitted = fake.date_between(start_date="-2y", end_date="today").strftime("%Y-%m-%d")
        date_processed = fake.date_between(start_date=date_submitted).strftime("%Y-%m-%d")
        f.write(f"CLM{claim_id:06d},{medical_record_number},{visit_id},{insurance_provider},{policy_number},{claim_amount:.2f},{claim_status},{date_submitted},{date_processed}\n")
EOF
)

# Use Python to generate the dataset and append to the files
python3 -c "$generate_data_script" $num_records $num_visits $num_billing_records

echo "Medical records dataset generated and saved to $output_file"
echo "Visit history dataset generated and saved to $visit_history_file"
echo "Billing dataset generated and saved to $billing_file"

Conclusion

This was my first go-round with Code CoPilot but it will not be my last. This tool is excellent for people who are learning to code or for those who do not want to learn but need a quick solution to handle a task. The resulting datasets did need to be cleaned up manually and adjusted for gaps but with on 100 records resulting, the clean up was not horrible. I encourage you to try Code CoPilot and see what you can dream up.