Pre-processing Part IV. Build Conjugate Frame, another example

Update 04 August 2018. We are currently using the package kero, see home page on how to install. Note: we are using DataHandler v0.2.

In the previous part, we have built a conjugate data frame. In this part we will build another such example.

import numpy as np
import pandas as pd
import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataTransform as dt

rdf = RDF.RandomDataFrame()
# col0 : NOTE THAT IN THIS EXAMPLE we have column for unique ID
col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}
col_out = {"column_name": "result", "items": ["classA", "classB", "classC"]}
rdf.initiate_random_table(20, col1, col2, col3, col4, col_out, panda=True, with_unique_ID="person")
rdf.crepify_table(rdf.clean_df, rate=0.08)
rdf.crepified_df.to_csv("testing_conj_df.csv", index=False)

df = pd.read_csv(r"testing_conj_df.csv")
cleanD, _, _ = dt.data_sieve(df)  # cleanD, crippD, origD'
colname_set = df.columns

As before, the above code only initiate random table, make some data points defective, drop the defective parts and thus we have a clean data. The first 5 rows are shown here.

         ID  first     second third fourth  result
0   person1    1.0  12.978723   not     id  classC
1   person3    3.0  11.914894   not     us  classB
2   person6    3.0  18.936170    gg     my  classC
3   person8    2.0  10.425532    gg     id  classA
4  person10    3.0  10.851064   not     id  classA

Now we do the main process. You will be able to see the full data with 20 rows, the cleaned version and the transformed version with the following code.

# conversion choices
# - 1. "discrete_to_bool"
# - 2. "cont_to_scale"
# - 3. "discrete_to_int"
# - 4. ""
conj_command_set = {colname_set[0]: "",
                    colname_set[1]: "discrete_to_bool",
                    colname_set[2]: "cont_to_scale",
                    colname_set[3]: "discrete_to_bool",
                    colname_set[4]: "discrete_to_bool",
                    colname_set[5]: "discrete_to_int"}
discrete_to_int_settings = {"classA": 0, "classB": 1, "classC": 2}
cont_to_scale_settings = {"scale": [-1, 1], "mode": "uniform", "original_scale" : [10,20]}
conj_command_setting_set = {colname_set[0]: None,
                            colname_set[1]: True,
                            colname_set[2]: cont_to_scale_settings,
                            colname_set[3]: True,
                            colname_set[4]: True,
                            colname_set[5]: discrete_to_int_settings}
cleanD.build_conj_dataframe(conj_command_set, conj_command_setting_set=conj_command_setting_set)

print("\n\nCOMPARE : CLEANED\n\n")
print("\n\nCOMPARE : CONJUGATED\n\n")


Just a reminder,  colname_set[k] corresponds to the name of the k-th column. The result is as the following. Hopefully this is easy to use! 🙂


In the next part we will be doing the loan problem as promised in Part I.