clean_data

home > kero > Documentation

kero.DataHandler.DataTransform.py 

class original_data:
  def __init__(self):
    self.clean_df
    self.clean_df_conj
    self.colname_set
    self.clean_list
    self.clean_list_conj
    self.colname_set_conj
    self.column_module_set
  def get_list_from_df(self):
    return
  def build_column_module_set(self, conj_command_set, conj_command_setting_set=None):
    return
  def build_conj_dataframe(self, conj_command_set, conj_command_setting_set=None):
    return

Properties Description
clean_df Pandas data frame. By convention the data frame should not have any defective row
clean_df_conj Pandas data frame. Data frame transformed from clean data frame, clean_df, in the manner specified by the input commands to the function build_conj_dataframe().
colname_set List of strings, containing the names of columns of data frame.
clean_list List. The list form of clean_df.
clean_list_conj List. The list form of clean_df_conj.
colname_set_conj List of strings, containing the names of columns of conjugated data frame.
column_module_set List of objects of the class column_module.

kero version: 0.1 and above

original_data

kero.DataHandler.DataTransform.py

class original_data:
  def __init__(self):
    self.original_df
    self.repaired_df
    self.original_list
    self.repaired_list
    return
  def initialize_dataframe_repair(self):
    return
  def repair_table(self, repair_command_settings=None):
    return
  def repair_single_column(self, column_key, mode=None):
    return

 

Properties

Description

original_df Pandas original data frame.
repaired_df Pandas data frame, with defective rows fixed according to commands specified by repair_table()
original_list List. The list form of original_df.
repaired_list List. The list form of repaired_df

kero version: 0.1 and above

data_sieve()

home > kero > Documentation

This function splits a data frame to its clean part and defective part, so that we can use the clean part for processing or analysis.

kero.DataHandler.DataTransform.py

def data_sieve(dataframe):
  return cleanD, crippD, origD

 

dataframe (panda data frame) Panda dataframe
return cleanD

(clean_data)

 clean_data object. This object has property “clean_df“.

cleanD.clean_df is the original data frame with all defective rows removed.

return crippD

(crippled_data)

crippled_data object. This object has property “crippled_df“.

crippD.crippled_df is a pandas data frame made of only the defective rows of the original data frame.

return origD

(original_data)

original_data object. The input dataframe df for data_sieve() will be stored as the property “df ” of this cleanD object.

 

 

Example usage 1

import kero.DataHandler.DataTransform as dt
import kero.DataHandler.RandomDataFrame as RDF
import numpy as np

rdf = RDF.RandomDataFrame()
col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}

df, _ = rdf.initiate_random_table(20, col1, col2, col3, col4, panda=True, with_unique_ID=None)
df = rdf.crepify_table(df, 0.05)

cleanD, crippD, origD = dt.data_sieve(df)
print(origD.original_df)
print("\nclean part")
print(cleanD.clean_df)
print('\ncrippled part:')
print(crippD.crippled_df)

kero version: 0.1 and above

Heuristic Vs schema

I was reading Aronson, Wilson, Akert’s Social Psychology textbook chapter 3, and did wonder what is the difference between schema and heuristics. It might seem trivial, but anyway I will lay down the definitions and do some comparisons.

In social psychology, schema is “the mental structure that organizes our knowledge about the social world”. Wikipedia furnishes us with more definitions (do look up the references) and I would go with the casual definition of a schema, which is the way we think about something. But it doesn’t end there. A schema can be very dynamic, and the concept of accessibility is important. If I mention the word “delicious” and the ramen you ate 30 minutes ago was superb, the ramen may be the first thing that comes to mind. You might as well think about the street food you had from the night market from yesterday and anything else, but the cognitive scientists and social psychologists would say that priming has occurred. In another words, in your thought, the accessibility to the very ramen you had just had was boosted; the ramen was brought to the fore front of your thought just because you experienced its deliciousness just now.

It could be more general, or even absurd out of context. If I mentioned “apple” to you, you might think of a red fruit, or perhaps a green fruit, or a smart phone, or if you are a day trader who just ate an apple but lost $20k from the last dip in the company’s stock price you might think about a new trading strategy. Oh, and self-fulfilling prophecy is a derivative to the schema concept.

AWA’s textbook (let’s so call it for brevity) begins with the idea of heuristic under the section “mental strategies and shortcut”. AWA introduces judgemental heuristic as the mental shortcut we use for efficient day to day operation. We do not want to think about all possible ways to sate your appetite and do some cost-benefit analysis for each lunch time. We could think “let’s go to food court A again but have a plate of Korean BBQ chicken today” etc. I do think it sounds like schema; let’s find some definitions then. Google says that something heuristic is something that allows people learn something for themselves. The word is from the Greek word that means to discover. Hmm, not quite helpful. Wikipedia, though, enforces the idea of efficiency: a method not necessarily optimal but sufficient for immediate goal. From these I would take heuristic as a way of thinking which puts emphasis on immediacy. To put it into practice right away, I think it is a good idea to stick with calling heuristic a mental shortcut.

It seems in the ramen example I have described some sort of heuristic as well. Well, AWA in fact says that there is availability heuristic: we make mental shortcut from what is available to us. There is also representative heuristic, which is essentially just a way to think of something based on generalized features. In applying representative heuristic, we often use base rate information. The idea is simple, we generalise based on frequently occurring information, regardless of truth, like, recently I hear some people saying “he’s an Indian therefore he must be an IT guy”. Base rate information is sometimes good enough, though. Check out base rate fallacy; studies showed that if presented with base rate information and a specific information, people tend to ignore the base rate information; this is a logical fallacy. There is some nice mathematics involved.

Before we end, you can check some people giving different answers about the difference between schema and heuristic. For example here, some say they are not even related. Heuristic does get associated with problem solving (see here, quoting Polllion and Dankar 1945), and is related to another sense of the word, which is to enable learning by oneself. We do talk about heuristic method of teaching but not schema-tic method; but schematic method does not sound too bad either, right, if I were to mean a method that follows some schemes, rather than chaotic random guess of problem solving? If I were to give one or two more opinions, then I would just say language is dynamic and not single dimensional. We use a single word to mean multiple things and multiple words to mean the same thing.

Ps. After the section about heuristic, AWA talks about Barnum effect and how it is related to representative heuristic.

multi_cause_rank()

home > kero > Documentation

Create a number table counting data points with certain features as independent variables and relate it with dependent variable.  See the example below.

kero.DataHandler.DataVisual.py

class visual_lab():
  def multi_cause_preparation(self, df, indep_var, dep_var):
    return
df panda data frame.
indep_var list of string. For indep_var = [column1, column2, …] with possible values

column1 : x1, x2, …

column2 : y1, y2, …

construct a table of (x1,y1), (x1,y2), …, (x2,y1) etc as independent variables.

dep_var string. For dep_var with possible valies z1, z2, …, and considering indep_var, the tables created have columns

z1, z2, … and rows (x1,y1), (x1,y2), …, (x2,y1), where each entry (xN, yM, zL) is the number of such data points.

return None

Example Usage 1

import pandas as pd
import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataVisual as dv


rdf = RDF.RandomDataFrame()
col1 = {"column_name": "first", "items": [1, 2, 3,4]}
col2 = {"column_name": "second", "items": ["A","B", "C", "D", "E", "F", "G"]}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}
col5 = {"column_name": "output", "items": ["classA", "classB", "classC"]}

rdf.initiate_random_table(1200, col1, col2, col3, col4, col5, panda=True)
rdf.crepify_table(rdf.clean_df, rate=0.005)
csv_name = 'check_multicause_rank.csv'
rdf.crepified_df.to_csv(csv_name, index=False)
df = pd.read_csv(csv_name)

vlab = dv.visual_lab()
vlab.prepare_for_visual(df)
print(vlab.cleanD.clean_df.head())

The above creates random table, as partially shown here.

   first second third fourth  output
0    1.0      E    gg     id  classA
1    2.0      A   not     id  classC
2    3.0      D    gg     bf  classC
3    3.0      A   not     bf  classB
4    2.0      F   not     id  classC

The following code constructs the table counting the number of data points with one feature and one output. The table entries are fractionalized using the function get_frac_from_number() and then rows whose fractions are all below 0.9 are dropped using drop_data_points_below().

indep_var = ["first"]
dep_var = "output"
numD1 = vlab.multi_cause_rank(vlab.cleanD.clean_df, indep_var, dep_var)
numD1.drop_data_points_below(frac=0.9)
print("\none dimensional:")
print(numD1.df_number)
print(numD1.df_frac)
print("-->Dropped version:\n", numD1.df_frac_drop)

The output is shown here, e.g. there are 90 data points with ‘first’ column value 2.0 and output “classB”.

one dimensional:
       classA  classB  classC
[1.0]    79.0   110.0    90.0
[2.0]   100.0    90.0    99.0
[3.0]    93.0   116.0   107.0
[4.0]    98.0    93.0   102.0
         classA    classB    classC
[1.0]  0.681034  0.948276  0.775862
[2.0]  0.862069  0.775862  0.853448
[3.0]  0.801724  1.000000  0.922414
[4.0]  0.844828  0.801724  0.879310
-->Dropped version:
 classA classB classC
[1.0] 0.681034 0.948276 0.775862
[3.0] 0.801724 1.000000 0.922414

Since this table is randomly generated, we do not see any strong relations. A table with strong relation can be, for example the following, possibly between classA and 1.0, classB and 2.0, and classC and 3.0.

       classA  classB  classC
[1.0]    0.9     0.1    0.05
[2.0]    0.02    0.88   0.24
[3.0]    0.04    0.1    0.99
[4.0]    0.04    0.54   0.1

The following code shows how we try to observe the relation between two features with the output.

print("\ntwo dimensional:")
indep_var = ["first","second"]
dep_var = "output"
numD2= vlab.multi_cause_rank(vlab.cleanD.clean_df, indep_var, dep_var)
numD2.drop_data_points_below(frac=0.9)
print(numD2.df_number)
print(numD2.df_frac)
print("-->Dropped version:\n", numD2.df_frac_drop)

The table is partially the following.

two dimensional:
             classA  classB  classC
[1.0, 'A']     9.0    14.0    11.0
[1.0, 'B']    10.0    18.0    15.0
[1.0, 'C']    14.0    12.0    19.0
[1.0, 'D']    12.0    11.0    15.0
[1.0, 'E']    10.0    21.0    10.0
[1.0, 'F']    15.0    13.0    10.0
[1.0, 'G']     9.0    21.0    10.0
[2.0, 'A']    14.0    18.0    14.0
[2.0, 'B']    14.0    11.0    13.0
[2.0, 'C']    11.0    12.0    11.0
...
              classA classB classC
[1.0, 'A'] 0.375000 0.583333 0.458333
[1.0, 'B'] 0.416667 0.750000 0.625000
[1.0, 'C'] 0.583333 0.500000 0.791667
[1.0, 'D'] 0.500000 0.458333 0.625000
[1.0, 'E'] 0.416667 0.875000 0.416667
[1.0, 'F'] 0.625000 0.541667 0.416667
[1.0, 'G'] 0.375000 0.875000 0.416667
[2.0, 'A'] 0.583333 0.750000 0.583333
[2.0, 'B'] 0.583333 0.458333 0.541667
[2.0, 'C'] 0.458333 0.500000 0.458333
...
-->Dropped version:
           classA classB classC
[4.0, 'D'] 0.791667 1.0 0.708333

kero version: 0.1 and above

get_frac_from_number()

home > kero > Documentation

kero.DataHandler.DataVisual.py.

class number_data():
  def get_frac_from_number(self)
    return

The example below shows how the table

     first  second  third  fourth
aa0      73      69     16      99
aa1      99       6     28      77
aa2       8      47      7      95
aa3      88      43     84      34
aa4      26       8     98      56

is converted to the fractional version. Each entry above is divided by the maximum number in the table, which is 99.

        first    second     third    fourth
aa0   0.737374  0.696970  0.161616  1.000000
aa1   1.000000  0.060606  0.282828  0.777778
aa2   0.080808  0.474747  0.070707  0.959596
aa3   0.888889  0.434343  0.848485  0.343434
aa4   0.262626  0.080808  0.989899  0.565657

See here for usage example.

kero version: 0.1 and above

drop_data_points_below()

home > kero > Documentation

For a table whose entries are either zero or positive integers, we might want to convert them into fraction and then drop rows whose entries are all below certain values. See an example in the following link: multi_cause_rank().

kero.DataHandler.DataVisual.py.

class visual_lab():
  def drop_data_points_below(self, frac=0.6)
    return
frac double, frac. If a row of the converted table has entries who are all below this fraction, it will be dropped.
return None

Example Usage 1

The following code creates a table as described above.

import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataVisual as dv

rdf = RDF.RandomDataFrame()
itemlist = range(100)
col1 = {"column_name": "first", "items": itemlist}
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": itemlist}
col4 = {"column_name": "fourth", "items": itemlist}
N_row = 20
rdf.initiate_random_table(N_row, col1, col2, col3, col4, panda=True, row_name_list='aa')
print(rdf.clean_df)
      first  second  third  fourth
aa0      73      69     16      99
aa1      99       6     28      77
aa2       8      47      7      95
aa3      88      43     84      34
aa4      26       8     98      56

Then the following code shows the table converted into fractions.

numD=dv.number_data()
numD.df_number=rdf.clean_df
numD.get_frac_from_number()
numD.df_frac
 first    second     third    fourth
aa0   0.737374  0.696970  0.161616  1.000000
aa1   1.000000  0.060606  0.282828  0.777778
aa2   0.080808  0.474747  0.070707  0.959596
aa3   0.888889  0.434343  0.848485  0.343434
aa4   0.262626  0.080808  0.989899  0.565657
aa5   0.181818  0.636364  0.171717  0.292929
aa6   0.202020  0.888889  0.373737  0.090909
aa7   0.848485  0.202020  0.525253  0.686869
aa8   0.434343  0.515152  0.171717  0.505051
aa9   0.505051  0.010101  0.212121  0.606061
aa10  0.101010  0.272727  0.373737  0.959596
aa11  0.404040  0.909091  0.878788  0.313131
aa12  0.737374  0.222222  0.040404  0.868687
aa13  0.555556  0.191919  0.202020  0.676768
aa14  0.767677  0.404040  0.666667  0.191919
aa15  0.171717  0.050505  0.131313  0.919192
aa16  0.444444  0.242424  0.434343  0.212121
aa17  0.252525  0.777778  0.151515  0.282828
aa18  0.444444  0.181818  0.262626  0.222222
aa19  0.565657  0.393939  0.373737  0.969697

and we drop the rows whose fractions are all below 0.9 with this code.

numD.drop_data_points_below(frac=0.9)
print("dropping...\n",numD.df_frac_drop)

We are left with the following table. In row ‘aa15’, we can see that aa15 and fourth has a relatively strong relationship compared to other columns.

          first    second     third    fourth
aa0   0.737374  0.696970  0.161616  1.000000
aa1   1.000000  0.060606  0.282828  0.777778
aa2   0.080808  0.474747  0.070707  0.959596
aa4   0.262626  0.080808  0.989899  0.565657
aa10  0.101010  0.272727  0.373737  0.959596
aa11  0.404040  0.909091  0.878788  0.313131
aa15  0.171717  0.050505  0.131313  0.919192
aa19  0.565657  0.393939  0.373737  0.969697

kero version: 0.1 and above

Cartesian Product

home > kero > Documentation

Output Cartesian product in mathematical sense. For example, give x=[1,2], y=[3,4], output [[1,3],[1,4],[2,3],[2,4]].

kero.DataHandler.Generic.py

def cartesian_product(full_set, toggle=True):
  return out
 full_set list of list
 toggle Boolean. If True, when full_set consists of one element, full_set = [x], then return [ [x[0], [x[1]], …]

Otherwise, return [x[0], x[1], … ]

 return out list of list

It is faster to show the results.

from  kero.DataHandler.Generic import *

x = [1, 2]
y = [3, 4]
z = [5, 6, 7]
k = [88, 99, 100, 200]

for x0 in cartesian_product([x,y,z]):
    print(x0)
print("\nbreak\n")
gg = cartesian_product([x, y, z, k])
for x1 in gg:
    print(x1)

The output is as the following.

[1, 3, 5]
[1, 3, 6]
[1, 3, 7]
[1, 4, 5]
[1, 4, 6]
[1, 4, 7]
[2, 3, 5]
[2, 3, 6]
[2, 3, 7]
[2, 4, 5]
[2, 4, 6]
[2, 4, 7]

break

[1, 3, 5, 88]
[1, 3, 5, 99]
[1, 3, 5, 100]
[1, 3, 5, 200]
[1, 3, 6, 88]
[1, 3, 6, 99]
[1, 3, 6, 100]
[1, 3, 6, 200]
[1, 3, 7, 88]
[1, 3, 7, 99]
[1, 3, 7, 100]
[1, 3, 7, 200]
[1, 4, 5, 88]
[1, 4, 5, 99]
[1, 4, 5, 100]
[1, 4, 5, 200]
[1, 4, 6, 88]
[1, 4, 6, 99]
[1, 4, 6, 100]
[1, 4, 6, 200]
[1, 4, 7, 88]
[1, 4, 7, 99]
[1, 4, 7, 100]
[1, 4, 7, 200]
[2, 3, 5, 88]
[2, 3, 5, 99]
[2, 3, 5, 100]
[2, 3, 5, 200]
[2, 3, 6, 88]
[2, 3, 6, 99]
[2, 3, 6, 100]
[2, 3, 6, 200]
[2, 3, 7, 88]
[2, 3, 7, 99]
[2, 3, 7, 100]
[2, 3, 7, 200]
[2, 4, 5, 88]
[2, 4, 5, 99]
[2, 4, 5, 100]
[2, 4, 5, 200]
[2, 4, 6, 88]
[2, 4, 6, 99]
[2, 4, 6, 100]
[2, 4, 6, 200]
[2, 4, 7, 88]
[2, 4, 7, 99]
[2, 4, 7, 100]
[2, 4, 7, 200]

kero version: 0.1 and above

prepare_for_visual()

home > kero > Documentation

prepare_for_visual() will automatically separate the defective and clean part of data frame. It is a method of the visual_lab object.

This method will store the input data frame as the property “df” of the object visual_lab. Then it will create 3 more objects make them the properties of visual_lab through the function data_sieve().

  1. original_data
  2. clean_data
  3. crippled_data

They will correspond to the (1) object that stores the original data, (2) object that stores the clean data, which is the data whose defective rows have been dropped and (3) the object that stores the crippled_data, which is the defective part of the data. Refer to the examples or respective documentation on how to access the data frames.

kero.DataHandler.DataVisual.py

class visual_lab:
  def prepare_for_visual(self,dataframe):
    return
dataframe (panda data frame) Panda dataframe.

Example usage 1.

import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataVisual as dv
import numpy as np

rdf = RDF.RandomDataFrame()
col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}

df, _ = rdf.initiate_random_table(20, col1, col2, col3, col4, panda=True, with_unique_ID=None)

The above only creates random table. This function is used straight-forwardly by feeding it with a pandas data frame. This example also shows how the data are accessed.

vlab = dv.visual_lab()
vlab.prepare_for_visual(df)
#### for checking ####
print(vlab.df)
print("\nclean df:\n")
print(vlab.cleanD.clean_df)
print("\ncrippled df:\n")
print(vlab.crippledD.crippled_df)
print("\n\n")
#######################
vlab.generate_level_1_report(label_name="gg")

kero version: 0.1 and above