LangGrant Windocks enables deterministic synthetic data, providing statistically accurate masked datasets for development, testing, and analytics.
How Synthetic Data Generation Works
Simplistic synthetic data generators use random number generation at their heart to create data. The random numbers are themselves generated using a “seed,” usually the current system time. By providing a random number generator the same seed, you will get the same series of random numbers each time.
After a random number is created, the synthetic data is generated as follows:
- Integer types – The random number generated can be multiplied by an integer from a certain range to get the synthetic value. The range is decided based on your needs (for example, you could get the range from a source of data if you have one)
- Decimal / float / double – Similar approach as above, using a range of values that are decimal / float / double
- Text – To generate text data, use a library such as Faker to generate the text data. Specify the type of text (such as name, address, phone, etc). To create deterministic synthetic data, you will need to maintain an array of values for each type and use a random number generator with the same seed to index into the array of text values.
- Dates/Times – Use a base date time and use random number generator to add an offset in hours. Use the appropriate multiplier to the random number based on how long you want the range of generated date time values to be. Specify the same initial seed to the random number generator for deterministic synthetic date time values
import datetime, random# For deterministic synthetic data each time you run thisrandom.seed(10)startTime = datetime.date(2020, 1, 1)endTime = datetime.date(2030, 1, 1)for i in range(10): eachDate = startTime + datetime.timedelta(hours=random.randint(0, 1000)) print(eachDate)
For more information, see the official Synthetic Data Vault documentation. See also database subsetting guide in the LangGrant documentation.