

This process generates tables and populates them with arbitrary data that fits into the type definition. The second cell contains the python class definition, the creation of one instance of the generator and its execution. The first cell contains configuration and libraries.

Then we can execute the cells of the notebook. You must introduce the name of the lake database in the notebook first cell. Open the synthetic data generator notebook. At this point we should create a Synapse Spark cluster with a midsize node (8 cores) and an elastic number of workers from 3-6, and add a requirements.txt file to include the python package faker in the cluster. Once created, the lake database has no data. and select the type of database format supported, text delimited or Parquet (better select the second one / Parquet):.

select the lake folder used by the db (verify it’s empty).If you have not created a lake database yet, open the synapse workspace Gallery and select one of the templates. A Spark cluster with the Python Faker library installed.Creating that database from the lake database templates is a simple manner to generate a good starting point for a dummy environment. A Synapse Workspace with an existing Lake Database.This notebook creates artificial data to allow for basic testing and demos of the lake databases. Often, we will create an empty lake database from the database industry templates. Table size is generated with a 10% random variation to avoid having the exact same length. We can pass the size parameter to the creation method. The created tables size order of magnitude will be 1x, 10x and 100x from the base size. If the lake database data exists, it will be deleted. It is designed to populate an empty database. The templates are pure metadata and are created empty. It does fill data on all the tables of a lake database often created with a database template. This repository contains a notebook which creates synthetic data for a Lake Database. Synthetic Data Generator Project for Lake Databases
