How to Generate Billions of Rows of Data in EXASOL with Python

Data

How to Generate Billions of Rows of Data in EXASOL with Python

by Archana Chinnaswamy
//

In most Tableau training sessions, we want to use large datasets to provide meaningful insights. It can be hard, however, to get a large dataset, as per our requirements. How about generating billions of rows of dataset in a few hours? I used Python script to generate random data and load into EXASOL.

Pre-requisite

  1. Download EXASOL MSI client, ODBC driver and SDK from here. Make sure you download 64 bit. https://www.exasol.com/portal/display/DOWNLOAD/6.0
  2. After installing EXASOL, log in with your username and password.
  3. Download and install Python (preferably version 2.7, 64 bit) from here: https://www.continuum.io/downloads#windows
  4. Download PIP from here: https://pip.pypa.io/en/stable/installing/
  5. Open command prompt and navigate to the directory where SDK and ODBC files are located.
  6. Run the below commands to connect to the EXASOL server.
python setup.py install

pip install pyodbc

python exasol.py
  1. Create a table in EXASOL according to the requirements before running the Python code.
create table RETAIL.marketshare

(category varchar (30),

order_dt varchar (30),

division varchar (30),

prod_code varchar (30),

sales_amt decimal (12,2),

season varchar (5),

store_brand varchar (30),

prod_brand varchar (30),

store_num int,

units int,

price_grp int,

price_grp1 int,

primary key (prod_code, store_num, store_brand, dt));

 

How It’s Done

Import packages necessary for the program. In this code, I use imports EXASOL, Random, String, Pandas and Numpy.

import exasol

import random

import string

import pandas as pd

import numpy as np

Set a random seed with any number. This helps to generate data in the given probability every time you run the same code. I am specifying the number of rows to be generated with the help of variable “num” and initializing all the variables used in this program.

np.random.seed(123)

num=100000000

sales_amt= []

prod_code= []

date= []

season_1= []

season= []

price1= []

rem= []

price_grp= []

price_grp1= []

I want units and price to be a random integer, for which I have given a range.

units=[random.randint(1,51) for x in range(num)]

price=[round(random.uniform(20,200),2) for X in range(num)]

Store brand, product brand, division and category is randomly generated from the given array. Here I want to generate data non-uniformly, hence giving probabilities for each item in the array. Please note that by default, it generates data uniformly. Then, we will not get variations in graphs in Tableau.

store_brand= ["TJ Max", "JC Penny", "H&M", "American Eagle", "Old Navy", "DSW", "Simon", "Macys", "Sears", "Kmart", "Walmart", "Aeropostale", "Banana Republic", "Dillard's", "Forever 21", "Gap", "Ikea", "Costco", "Target", "Zara", "Dillards"]

store_brand=np.random.choice(store_brand,num,p=[0.18,0.11,0.06,0.12,0.10,0.08,0.07,0.05,0.03,0.02,0.02,0.02,0.03,0.01,0.01,0.01,0.02,0.02,0.01,0.01,0.02])

prod_brand= ['Our Brand', 'CKC Brand', 'Jez Brand', 'Ral Brand', 'Mazi Brand', 'Zakka Brand', 'JB Brand']

prod_brand= np.random.choice(prod_brand,num,p=[0.30,0.13,0.18,0.15,0.05,0.08,0.11])

store_num=[random.randint (1,101) for x in range(num)]

division=["Footwear","Apparel","Equipment"]

division= np.random.choice(division, num, p=[0.30,0.55,0.15])

category= ["Women's Athletic","Men's Athletic","Sleepwear","Women's Apparel","Men's Apparel","Kid's Apparel","Women's Shoes", "Boots", "Men's Shoes", "Outdoor Equipments", "Indoor Equipments"]

category= np.random.choice(category,num,p=[0.10,0.14,0.06,0.22,0.15,0.13,0.08,0.06,0.03,0.02,0.01])

I want product code to be in a certain format with mixed characters and letters.

prod_cd1=[random.choice(string.ascii_uppercase) for X in range(num)]

prod_cd2=[random.choice(string.ascii_uppercase) for X in range(num)]

prod_cd3=[random.choice(string.ascii_uppercase) for X in range(num)]

prod_cd4=[random.randint(0,9) for X in range(num)]

prod_cd5=[random.randint(0,9) for X in range(num)]

prod_cd6=[random.randint(0,9) for X in range(num)]

prod_cd7=[random.randint(0,9) for X in range(num)]

prod_cd8=[random.randint(0,9) for X in range(num)]

prod_cd9=[random.choice(string.ascii_uppercase) for X in range(num)]

for i in range(0,len(prod_cd1)):    prod_code.append(prod_cd1[i]+prod_cd2[i]+prod_cd3[i]+str(prod_cd4[i])+str(prod_cd5[i])+str(prod_cd6[i])+"-"+str(prod_cd7[i])+str(prod_cd8[i])+prod_cd9[i])

Season should be based on the date and then split into Summer, Fall, Winter and Spring. I would also like to show more sales during the holiday season.

year=[random.choice(range(2014,2017)) for X in range(num)]

m=[1,2,3,4,5,6,7,8,9,10,11,12]

month=np.random.choice(m,num,p=[0.14,0.10,0.06,0.01,0.01,0.13,0.08,0.06,0.03,0.02,0.14,0.22])

day=[random.choice(range(1,28)) for X in range(num)]

for j in range(0,len(prod_cd1)):

    date.append(str(month[j]).zfill(2)+"-"+str(day[j]).zfill(2)+"-"+str(year[j]))

    if 3 <= month[j] <= 5:

        season_1.append("SP")

    elif 6 <= month[j] <= 8:

        season_1.append("SU")

    elif 9 <= month[j] <= 11:

        season_1.append("FA")

    else:

        season_1.append("WI")

    season.append(season_1[j]+str(str(year[j])[2:]))

for l in range(0,len(units)):

    if season[l]=="WI":

        units.append(units[l]*10)

    else:

        units.append(units[l])

Now I am calculating sales amount based on the units and unit price. The price has been bucketed to two groups – price rounded to 10’s in one group and price rounded to 1’s in the other group.

for k in range(0,len(price)):

    sales_amt.append(units[k]*price[k])

for y in range(0,len(price)):

    price1.append(sales_amt[y]/units[y])

    rem.append(price1[y] % 10)

    price_grp1.append(int(round(price1[y])))

    if rem[y] < 5:

        price_grp.append(int(price1[y] / 10) * 10)

    else:

        price_grp.append(int((price1[y] + 10) / 10) * 10)

Once all the logic is written, all I do is put it in Panda's data frame.

df=pd.DataFrame({'store_brand': store_brand[0:num],'store_num': store_num[0:num],'division': division[0:num],'category': category[0:num],'prod_code': prod_code[0:num],'date': date[0:num],'season': season[0:num],'units': units[0:num],'sales_amt': sales_amt[0:num],

'price_grp': price_grp[0:num], 'price_grp1': price_grp1[0:num], 'prod_brand': prod_brand[0:num]})

Connect to EXASOL using exasol.connect in the Python script. Be sure to specify driver name, server details, login credentials and table name.

df=pd.DataFrame({'store_brand': store_brand[0:num],'store_num': store_num[0:num],'division': division[0:num],'category': category[0:num],'prod_code': prod_code[0:num],'date': date[0:num],'season': season[0:num],'units': units[0:num],'sales_amt': sales_amt[0:num],

'price_grp': price_grp[0:num], 'price_grp1': price_grp1[0:num], 'prod_brand': prod_brand[0:num]})

 

End Product

Generating data with EXASOL

Final count for data in EXASOL

Generate data using EXASOL

It took me about 15 seconds to load one billion rows of data from EXASOL into Tableau and generate the tree map for product brand. 

More About the Author

Archana Chinnaswamy

Analytics Consultant
How to Load Data from JSON to EXASOL If you’ve been looking for a way to load JSON data into EXASOL with just a click, then you’ll love this post! As data scientists, most ...
EXASOL – Small Business Edition Many of you must have heard about EXASOL. If not, I hope this blog gives you the motivation to discover it’s capabilities for yourself. ...

See more from this author →

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072

×

Love our blog? You should see our emails. Sign up for our newsletter!