crepify_table()

home > kero > Documentation

This function takes in a panda dataframe and makes some data points invalid by removing some data. It will save the output into a csv file.

kero.DataHandler.RandomDataFrame.py

class RandomDataFrame:
  def crepify_table(self, dataframe, rate=0.01, column_index_exception=None):
    return df
dataframe panda dataframe
rate rate: fraction between 0 to 1 or None

– default value =0.01

– if set to None, nothing will happen.

otherwise, for each column whose index is not specified in column_index_exception the column will be punctuated with blanks at probabilistic rate.

column_index
_exception
index of the column not to be punctured- e.g. [0, 1]
 return df  df is a panda data frame

Example usage 1

import numpy as np
import kero.DataHandler.RandomDataFrame as RDF

rdf = RDF.RandomDataFrame()
output_label = "classification"
csv_name = "check_table_defect_index.csv"
rate = 0.01
with_unique_ID = True

col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}

if output_label is not None:
    if output_label=="classification":
        col_out={"column_name": "result", "items": ["classA","classB","classC"]}
        rdf.initiate_random_table(20, col1, col2, col3, col4,col_out, panda=True)
else:
    rdf.initiate_random_table(20, col1, col2, col3, col4, panda=True,with_unique_ID=with_unique_ID)

Up to here we have only created a data frame. Do not worry about output_label: in this example, the output_label tells the code that it is a classification rather than regression problem. Now we puncture all columns of the data frame. Notice that col1 to col4 specifies the column name and the possible values that each column can take. Each is a dictionary.

rdf.crepify_table(rdf.clean_df, rate=rate)
try:
    rdf.crepified_df.to_csv(csv_name, index=False)
except:
    print("check_initiate_random_table. Error.")

Here is a snapshot of csv file created from this process.

defecttable

kero version: 0.1 and above

Pre-processing Part I. Generate Random Data.

 

Update (23 July 2018): The python package kero is up and ready! Here is the code in github.

Hiyah! We know this is the age for data science, IoT, blockchains and there are more interesting stuffs happening. As much as I want, I cannot follow all of them at once, so, I shall settle down with data, data and more data! For this post, I would like to demonstrate the usage of a package I wrote called DataHandler in my github. We will still use PyCharm, python 3.5.1.

The final objective is of course to process data: given probably very large volume of data, we train a model and make it predict the outcome of a new set of data. For example, later on I will post the usage of Convolutional Neural Network (CNN) on the Loan Problem. In the example, we have the data of a number of applicants and their eligibility status. For an applicant, either he is eligible or not, based on, say, their income etc. We make a CNN model, and then train it with this set of data, usually named “train.csv”. Then we will be given “test.csv”, which is the same as “train.csv” except that we do not know whether the applicant is eligible for the loan or not. We want our CNN model to tell us whether the applicants in “test.csv” are eligible.

The problem is, our “train.csv” is not always clean. It comes with all sorts of problems, for example misspelling or missing data. We will have to do some patch work and fill-in-the-blanks. We may never know what are the missing data, but we can reasonably guess, or we may even take the average of the rest of the data and fill the blank with it, just so that our model will not stumble upon empty data and get stuck.

This pre-processing series talks about the DataHandler package that is capable of doing this patch work.

Initiate Random Tables

We want to simulate having data of various shapes and sizes. This package provides a way to initiate systematically and randomly data tables that can take in different data type, such as integer, strings, doubles etc.

Create an empty python project using PyCharm, name it say, myrandtable. Copy every file from folder v0.2 downloaded or cloned from this github repository into it. In this project, create file createtable.py with the following code.

import kero.DataHandler.RandomDataFrame as RDF
import numpy as np

rdf=RDF.RandomDataFrame()
col1={"column_name": "first", "items":[1,2,3]}
itemlist=list(np.linspace(10,20,8))
col2={"column_name": "second", "items": itemlist}
df,_=rdf.initiate_random_table(4,col1,col2,panda=True)
print(df)

Output like this will be printed, with 2 columns, namely “first” and “second”.

       first      second
0      1       17.142857
1      3       12.857143
2      1       17.142857
3      2       10.000000

We have created our random table! See the documentation here, or in the user manual from the repository.

Puncture a table

Now we simulate having a data that has some defects, such as blanks. In the same project, create new file createbadtable.py with the following code.

import numpy as np
import kero.DataHandler.RandomDataFrame as RDF

rdf = RDF.RandomDataFrame()
output_label = "classification"
csv_name = "check_table_defect_index.csv"
rate = 0.01
with_unique_ID = True

col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}

if output_label is not None:
    if output_label=="classification":
        col_out={"column_name": "result", "items": ["classA","classB","classC"]}
        rdf.initiate_random_table(20, col1, col2, col3, col4,col_out, panda=True)
else:
    rdf.initiate_random_table(20, col1, col2, col3, col4, panda=True,with_unique_ID=with_unique_ID)
# up to here we are creating a dataframe
# now we puncture all columns of the dataframe.
rdf.crepify_table(rdf.clean_df, rate=rate)
try:
    rdf.crepified_df.to_csv(csv_name, index=False)
except:
    print("check_initiate_random_table. Error.")

The code yields the following check_table_defect_index.csv file that looks like the following. The documentation is here.

defecttable

Now we are good! We can tinker with this sort of “realistic” data, manipulate them, patch them etc to the format suitable for processing by machine learning methods like Convolutional Neural Network. We shall see in the next part other pre-processing methods.

initiate_random_table()

home > kero > Documentation

This function, as the name suggests, initiates random table systematically. We can specify the column data type etc.

kero.DataHandler.RandomDataFrame.py

class RandomDataFrame:
  def initiate_random_table(self, n_row, *argv, panda=True, with_unique_ID=None):
    return out
n_row (integer) no. of row entries
argv* (dictionary) For each k, argv[k] is a dictionary

key : name of column

value : possible data point in the column

Each column k is uniform randomly populated with x where x is any one of the element of value.

panda (Bool) True : output tuple (df,[]) where df is the random panda DataFrame
False: output (x,y) such that x is the matrix that forms a randomly initiated table and y the set of column names
with_unique_ID (String) If set to None¸then nothing happens.

If set to string, add to the first column a unique ID in this manner:

– if with_unique_ID=”person” then the ID will be person1, person2 etc in order

row_name_list If set to None, then nothing happens.

If set to string str, the row will be named with str1, str2, …

return out tuple, either (df,[]) or (x,y) as specified by the argument panda above

Example usage 1.

import kero.DataHandler.RandomDataFrame as RDF
import numpy as np

rdf=RDF.RandomDataFrame()
col1={"column_name": "first", "items":[1,2,3]}
itemlist=list(np.linspace(10,20,8))
col2={"column_name": "second", "items": itemlist}
df,_=rdf.initiate_random_table(4,col1,col2,panda=True)
print(df)

The output looks like the following.

   first     second
0      3  17.142857
1      3  18.571429
2      1  15.714286
3      1  11.428571

Example usage 2.

See example usage 1 in section “Puncture a table” in this link.

Example usage 3.

import kero.DataHandler.RandomDataFrame as RDF

rdf = RDF.RandomDataFrame()
itemlist = range(100)
col1 = {"column_name": "first", "items": itemlist}
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": itemlist}
col4 = {"column_name": "fourth", "items": itemlist}
N_row = 20
# rdf.initiate_random_table(N_row, col1, col2, col3, col4, panda=True)
# rdf.clean_df.index = ["".join(('aa', str(x))) for x in range(N_row)]
rdf.initiate_random_table(N_row, col1, col2, col3, col4, panda=True, row_name_list='aa')
print(rdf.clean_df)

The example output is partially shown here.

      first  second  third  fourth
aa0      50      14     25       3
aa1      91      28     52      90
aa2      36      90      4      82
aa3      54      82     85      46
aa4      81      57     68      94
aa5      89      52     51      24

Example Usage 4

See example usage 1 in repair_single_column. The example shows how a random table is initiated with unique ID.

kero version: 0.1 and above

Standalone Python Program Part II

Continuing, we will demonstrate how pyinstaller is used to build a project called _py_Haldane. Create a new folder, called “myproject2”. The source codes are in my github, folder _py_Haldane. Clone or download zip the repository, copy all the files in src into myproject2. The repository contains the manual for installation and usage, but we will explore a little differently here.

We will use PowerShell to build this standalone as well. Make sure you have installed pyinstaller, otherwise refer to part I. Now, we will (1) move into myproject2 folder and then (2) build the standalone program. In my case, input the lines into PowerShell.

cd Desktop/myproject2
pyinstaller main.py

Then, copy the folder Safekeep downloaded from github into myproject2/dist/main. There, run main.exe. In the File name field, insert Haldane_eigen1_3Sept, set the radio button to “eigen” mode and click Load button. Figure 1 will be shown, showing the eigenstate of Haldane model we modified. Easily done!

py_Haldane.png

Figure 1. program _py_Haldane in action.

We explore the “time” mode. This program is actually just a viewer. The actual values are generated by python codes in my actual Final Year Project. Set the radio button to “time” and File name to “Haldane28Aug17_high_1_”, one of the data we have generated and saved in the Safekeep folder. Click Load, then click <<. You will see figure 2(A), marked by index = 0. Then click Next repeatedly to see the time evolution of the state. Figure 2(B) shows the state at index = 25. From the color, we can see the that initially, with some energy supply, the electron is initialized at bottom-left corner and let evolved. In this particular example, the intensity increases and creeps along the bottom edge.

timepyhaldane.PNG

Figure 2. _py_Haldane showing (A) a modified Haldane system initialized at bottom-left corner (B) a time-evolved state.

Physics explanation

2D Haldane model is the simplest topologically non-trivial system. A system like graphene is a 2D sheet of carbons linked in hexagonal structures. Electron can run from one carbon to another carbon. The eigenstate, i.e. the state that can stay as it is even after time has passed, can have many patterns. When the system is in its topologically non-trivial state, it can take the shape of the so-called edge state: electron is more likely found at the edges (or corners) of the 2D sheet, as in in figure 3. The figure is taken directly from here. The site also details some good information regarding this topic. The red dots show carbon atoms, while the black dots show the “intensity” of electron, i.e. the likelihood that an electron will be found at that site: the bigger the black dot, the more likely it is found there.

The interpretation of the absolute value of intensity is not clear; by normalizing the amplitude, we can again reinterpret it as the probability of finding the electron at some site relative to the others.

haldanemodel

Figure 3. Edge state of the Haldane model.

Some extra stuffs

This project is a side project from my final year undergraduate study in NTU, Singapore. I have also been learning data science for awhile, and will update my progress slowly, here. Alright, actually I have to go back to the 4th ICT second part again. Signing off!

Standalone Python Program Part I

I am back from 4th ICT! Okay, that’s quite irrelevant, so let’s save it for later: this short post series will show 1. how a stand alone program can be built from python files and 2. my final year project sideshow, as the example.

How is your first programming experience? You probably got asked to write a script and print something like, “Hello world”, and you did, and you saw “Hello world” somewhere in that something you used. The something you used is probably an IDE, integrated development environment. It is a cool thing, it helps you organise your files, some have the auto-correct equivalent of the programming world (IntelliSense etc), and even helps you refactor your code, i.e. when you change name of say variable myinteger to someinteger, you do not need to change all the “myinteger”s in the long code: they do it for you.

All is good and fancy-sounding, but wait, we haven’t done anything. Here, we use pycharm. Create new project, name it, say “myproject”, right-click myproject with the folder icon, then add these 2 new python scripts

main.py

from sidefile import *

print("Hello world")
test_function()
input("Type in something:\n")

sidefile.py

def test_function():
    print("some function")
    return

and then Run > Run or press Alt+Shift+F10. In the console, we see that

Hello world
some function
Type in something:

is printed. All is fine, and we can type in something as requested by the last line. But as an inquisitive coder you were still wondering how do people build any standalone program? Like how does a program like Microsoft Word, or perhaps Steam, or perhaps a music player exist as a standalone program, i.e. it does not just exist inside an IDE? Yes, you need to do something to the python files, and such standalone program is what we exactly want to build here.

First, open a command line that can use python. In this example, we use PowerShell. The following three lines do the following (1) Download pyinstaller and then (2) move into myproject folder (wherever it is). And then we will be (3) using the pyinstaller to build the standalone program. In my case we enter the following lines in PowerShell:

pip install pyinstaller
cd Desktop/myproject
pyinstaller main.py

Check out the folder myfolder/main/dist. There will be main.exe. Double-click it, and we have a very simple standalone program!

In part II, I will demonstrate how this is used to compile and build a more complex standalone program.

Some extra stuffs

We use and assume some familiarities with:

  1. python 3.5.1, downloaded along with Anaconda
  2. pycharm community
  3. Windows 10

What is ICT mentioned at the start of this post? It is the in-camp training of the National Service in Singapore. We Singaporeans serve two years in the army, and, upon completion, have to serve 10 more times in the following years. For my vocation, I have gone back every year since my undergraduate study year 1 summer break. It seems like I will have to go every year after this *sigh*. Alright, it can be tiring at times, but there were fun times too!

Hello World!

It’s time to showcase things I’ve built up so far! For heads up, this blog is gonna chronicle some technical stuffs and probably some writings I did over the years. Here is some avant garde art (jkjk) from python program when I was trying to check through some errors in my final year project (2017).

I did think they were somewhat nice, named them the Pillar of Hope and the Horn of Error. See here for my self introduction.