Pre-processing Part II. Prepare and Inspect Data.

Update (23 July 2018): The python package kero is up and ready! Here is the code in github.

Remember from Part I we have functions to generated some nice random data tables which can be made defective to simulate real life situation. Let us use the following to instantly create a random data table.

import kero.DataHandler.Debuggers as dhdeb
import pandas as pd
import kero.DataHandler.DataVisual as dv

rdf = dhdeb.check_initiate_random_table(csv_name="check_generate_level_1_report.csv",rate=0.09)
df = pd.read_csv(r"check_generate_level_1_report.csv")
vlab = dv.visual_lab()
vlab.prepare_for_visual(df)
vlab.generate_level_1_report(label_name="mydata")

Do not worry about dhdeb.check_initiate_random_table(), since it is a predefined function to generate such random data table and save it as .csv file. The real work begin when we read_csv and extract the data as data frame. The three lines using vlab do the following. (1) Create visual_lab() object, which is the machinery we create for data visualization. (2) Prepare the data for visual inspection. You can look into the function prepare_for_visual() and find that it uses data_sieve() function which in turns uses drop_defective_table() function to remove defective data from the data frame and put the defects together in another data frame. (3) Generate reports about what our data look like. A part of the report looks like this, showing a column in the cleaned data frame.

--> column 0 (first) : [1.0, 3.0, 1.0, 3.0, 3.0, 2.0, 1.0, 2.0, 2.0, 2.0, 3.0]
 length = 11
 unique list = [1.0, 2.0, 3.0]
 size of unique list (include various Nan) = 3
 type list = [<class 'numpy.float64'>]

Refer to the documentation for more complete descriptions of the reports. Next, we see that vlab stores the original, cleaned and defective data frame as the following.

print(vlab.df)
print("\nclean df:\n")
print(vlab.cleanD.clean_df)
print("\ncrippled df:\n")
print(vlab.crippledD.crippled_df)
print("\n\n")

As the package name DataVisual suggests, this function only helps us organise clean data from defective data just to get a sense of what our data entail. In next posts, we may instead patch the empty or defective entries rather than just separating them away for observation.  Let us proceed to the next post Pre-processing Part III. Build Conjugate Frame.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s