build_conj_dataframe()

home > kero > Documentation

kero.DataHandler.DataTransform.py

class clean_data:
  def build_conj_dataframe(self, conj_command_set, conj_command_setting_set = None):
    return

This function builds the conjugate data frame. What do we mean by conjugate data frame? If we have column A, B in our original data frame, scale column A to A’ and binarize column B to B1 and B2, then we have a new data frame consisting of columns A’, B1 and B2. This new data frame is the conjugate data frame. The example below shows how a data frame is transformed to its conjugate.

conj_frame

Example Usage

Let us start by creating some random data table, make some of its data points defective, and then extract the non-defective part of the data table.

import kero.DataHandler.RandomDataFrame as RDF
import pandas as pd
import kero.DataHandler.DataTransform as dt
import numpy as np

rdf = RDF.RandomDataFrame()
col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}
col_out={"column_name": "result", "items": ["classA","classB","classC"]}
rdf.initiate_random_table(20, col1, col2, col3, col4,col_out, panda=True)
rdf.crepify_table(rdf.clean_df, rate=0.08)
rdf.crepified_df.to_csv("testing_conj_df.csv", index=False)

df=pd.read_csv(r"testing_conj_df.csv")
cleanD, _, _ = dt.data_sieve(df)  # cleanD, crippD, origD'
cleanD.get_list_from_df()
colname_set = df.columns

Up to here, we have only created and removed the defects in the random data frame.  The table looks like this.

   first     second third fourth  result
0      1  10.851064   not     us  classA
1      1  12.127660   not     my  classC
2      3  16.808511    gg     bf  classB
3      2  18.085106   not     jp  classC
4      3  20.000000    gg     us  classB

Now we do the real work. Notice that cleanD is a clean_data object. Besides, the data frame cleanD.clean_df, which is the property of the clean_data object, must have been initiated for the following to work — data_sieve() does this for you. Naturally it is so, since we want to build the conjugate data frame of clean_df, a data frame that does not contain defects. In another words, we do not want to deal with the defective parts in this example.

# conversion choices
# - 1. "discrete_to_bool"
# - 2. "cont_to_scale"
# - 3. "discrete_to_int"
# - 4. ""
conj_command_set = {colname_set[0]: "discrete_to_bool",
                    colname_set[1]: "cont_to_scale",
                    colname_set[2]: "discrete_to_bool",
                    colname_set[3]: "discrete_to_bool",
                    colname_set[4]: "discrete_to_int"}
discrete_to_int_settings={"classA":0,"classB":1,"classC":2}
cont_to_scale_settings = {"scale": [-1, 1], "mode": "uniform", "original_scale":[10,20]}
conj_command_setting_set= {colname_set[0]: True,
                    colname_set[1]: cont_to_scale_settings,
                    colname_set[2]: True,
                    colname_set[3]: True,
                    colname_set[4]: discrete_to_int_settings}
cleanD.build_conj_dataframe(conj_command_set,conj_command_setting_set=conj_command_setting_set)

print(df)
print("\n\nCOMPARE : CLEANED\n\n")
print(cleanD.clean_df)
print("\n\nCOMPARE : CONJUGATED\n\n")

print(cleanD.clean_df_conj)

conj_command_set and conj_command_setting_set

Notice that in the code above we transform the data frame according to a set of rules. Some columns are binarized, and one of them is scaled. This is expressed by the conjugate commands, which comes with settings and options.

Format:

conj_command_set = { column_name _1: mode_1 ,…}

conj_command_setting_set= { column_name _1: option_1, …}

Command Description
cont_to_scale function invoked: def conj_from_cont_to_scaled(col, scale=None, original_scale=None, mode=”uniform”)

scale (list of double): [min, max]
original_scale (list of double): [o_min, o_max]
mode (string)

conj_command_setting : {“scale” : scale, “mode” : mode, “original_scale” : o_scale}

mode=”uniform”

If “original_scale” : [o_min, o_max]

Given a column with double data type col=[x1,…, xN], then scale in the following manner

x_k \rightarrow min+(max-min)\times\frac{x_k -o_min}{o_max-o_min}

If “original_scale” : None,

if not specified, then, scale similarly, but with o_min = min(col) and o_max = max(col)

Other mode(s) to be added

discrete_to_bool function invoked: def conj_from_discrete_to_bool(col, drop_one_column=False):

drop_one_column (Bool)

conj_command_setting : drop_one_column

Given column with header “somecolumn” and [x1,x2,…,xN] where xk is any element from {class1, class2,…, classN}, replace this column with Boolean columns “somecolumn_class1”,…,”somecolumn_classN”. If xk is classk, then in column classk , xk is 1 and the rest of the entries are 0.

If drop_one_column is True, then the column classN will be discarded. This is to prevent Dummy Variable Trap. Otherwise, nothing further happens.

discrete_to_int function invoked: def conj_from_discrete_to_int(col, rule=None):

rule (dictionary) = {class1: int1, …, classN : intN}

conj_command_setting : {class1: int1, …, classN : intN}

Given a column [x1, …, xN] where xk is any element from {class1, …, classN}, then convert every element xk=classj to xk=intj.

Note: technically intk can be any data type, i.e. we can use this function to convert the content to anything coherently.

See another example here.

See a more complete example in the loan problem post here.

kero version: 0.1 and above