repair_single_column

home > kero > Documentation

kero.DataHandler.DataTransform.py 

class original_data:
  def repair_single_column(self, column_key, mode=None):
    return

This function is the function of the object original_data. It looks for the defective part of a column and replaces it with an entry depending on the mode this function is set to. For example, in a column of doubles (decimal numbers), setting the mode to “mean” will fill a defective cell in the column with the average of all other cells in the column which are not defective.  To be more concrete, if the column consists of 1, 2, X, 3, where X is defective, the X is replaced with (1+2+3)/3=2.  The modes are as shown below.

column_key String. The name of the column in the panda data frame whose defective values are to be replaced/fixed.
mode String. Replace defective cells with of all non-defective values in the column. The allowed is either of the following.

1. mean: the average value.

2. mean_floor: the floor function of the average value.

3. mean_ceil: the ceiling function of the average value.

4. max_occuring: the value that occurs the most.

5. min_occuring: the value that occurs the least.

6. mid_occuring: the value that occurs median often.

If None, then nothing happens.

Where there is a tie, for example, in the most occurring value, uniform random distribution is used to choose between them.

Example Usage 1

First, we create a randomly generated table with unique ID.

import numpy as np
import pandas as pd
import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataTransform as dt

rdf = RDF.RandomDataFrame()
# col0 : NOTE THAT IN THIS EXAMPLE we have column for unique ID
col1 = {"column_name": "first", "items": [1, 2, 3]}
itemlist = list(np.linspace(10, 20, 48))
col2 = {"column_name": "second", "items": itemlist}
col3 = {"column_name": "third", "items": ["gg", "not"]}
col4 = {"column_name": "fourth", "items": ["my", "sg", "id", "jp", "us", "bf"]}
col_out = {"column_name": "result", "items": ["classA", "classB", "classC"]}
rdf.initiate_random_table(20, col1, col2, col3, col4, col_out, panda=True, with_unique_ID="person")
rdf.crepify_table(rdf.clean_df, rate=0.1)
rdf.crepified_df.to_csv("check_repair_single_column.csv", index=False)

df = pd.read_csv(r"check_repair_single_column.csv")
cleanD, _, origD = dt.data_sieve(df)  # cleanD, crippD, origD

print(origD.original_df)

The example table generated is the following.

         ID  first     second third fourth  result
0        NaN    3.0  15.319149   not     us  classC
1    person1    1.0  15.531915    gg     us  classB
2    person2    NaN  19.361702   not     sg  classA
3    person3    3.0  10.425532    gg     id  classA
4    person4    3.0  12.978723    gg     my  classB
5    person5    1.0  15.957447   not    NaN  classB
6    person6    NaN  18.936170   not     us  classC
7    person7    3.0  15.744681   not     bf  classC
8    person8    2.0  12.340426    gg     sg  classA
9    person9    1.0  12.553191   not     bf  classC
10  person10    1.0  19.361702   not     bf  classB
11  person11    1.0  18.085106    gg     id  classA
12  person12    1.0  16.170213   not     my  classC
13  person13    2.0  10.851064    gg     my  classB
14  person14    3.0  12.340426   not     my  classA
15  person15    1.0  19.787234    gg     us  classB
16  person16    1.0  13.617021   NaN    NaN  classC
17  person17    2.0  18.085106    gg     jp  classC
18  person18    3.0  11.489362    gg     sg  classB
19  person19    3.0  20.000000    gg     jp  classC

We try to do repair on the first column, which is a column of integers, with 3 different modes.

origD.initialize_dataframe_repair()
origD.repair_single_column("first", mode="mean")
print("\nCOMPARE: REPAIR by 'mean':\n", origD.repaired_df)
origD.initialize_dataframe_repair()
origD.repair_single_column("first", mode="mean_floor")
print("\nCOMPARE: REPAIR by 'mean_floor':\n", origD.repaired_df)
origD.initialize_dataframe_repair()
origD.repair_single_column("first", mode="mean_ceil")
print("\nCOMPARE: REPAIR by 'mean_ceil':\n", origD.repaired_df)

The different modes produce repaired columns as shown in the following example.

COMPARE: REPAIR by 'mean':
           ID     first     second third fourth  result
0        NaN  3.000000  15.319149   not     us  classC
1    person1  1.000000  15.531915    gg     us  classB
2    person2  1.944444  19.361702   not     sg  classA
3    person3  3.000000  10.425532    gg     id  classA
4    person4  3.000000  12.978723    gg     my  classB
5    person5  1.000000  15.957447   not    NaN  classB
6    person6  1.944444  18.936170   not     us  classC
7    person7  3.000000  15.744681   not     bf  classC
8    person8  2.000000  12.340426    gg     sg  classA
9    person9  1.000000  12.553191   not     bf  classC
10  person10  1.000000  19.361702   not     bf  classB
11  person11  1.000000  18.085106    gg     id  classA
12  person12  1.000000  16.170213   not     my  classC
13  person13  2.000000  10.851064    gg     my  classB
14  person14  3.000000  12.340426   not     my  classA
15  person15  1.000000  19.787234    gg     us  classB
16  person16  1.000000  13.617021   NaN    NaN  classC
17  person17  2.000000  18.085106    gg     jp  classC
18  person18  3.000000  11.489362    gg     sg  classB
19  person19  3.000000  20.000000    gg     jp  classC

The above replaces defective data with mean values. But sometimes the columns can only take integer. In this case, we might want to use the next 2 commands which repair them with the floor and ceiling functions applied to the mean value.

COMPARE: REPAIR by 'mean_floor':
           ID  first     second third fourth  result
0        NaN    3.0  15.319149   not     us  classC
1    person1    1.0  15.531915    gg     us  classB
2    person2    1.0  19.361702   not     sg  classA
3    person3    3.0  10.425532    gg     id  classA
4    person4    3.0  12.978723    gg     my  classB
5    person5    1.0  15.957447   not    NaN  classB
6    person6    1.0  18.936170   not     us  classC
7    person7    3.0  15.744681   not     bf  classC
8    person8    2.0  12.340426    gg     sg  classA
9    person9    1.0  12.553191   not     bf  classC
10  person10    1.0  19.361702   not     bf  classB
11  person11    1.0  18.085106    gg     id  classA
12  person12    1.0  16.170213   not     my  classC
13  person13    2.0  10.851064    gg     my  classB
14  person14    3.0  12.340426   not     my  classA
15  person15    1.0  19.787234    gg     us  classB
16  person16    1.0  13.617021   NaN    NaN  classC
17  person17    2.0  18.085106    gg     jp  classC
18  person18    3.0  11.489362    gg     sg  classB
19  person19    3.0  20.000000    gg     jp  classC

COMPARE: REPAIR by 'mean_ceil':
           ID  first     second third fourth  result
0        NaN    3.0  15.319149   not     us  classC
1    person1    1.0  15.531915    gg     us  classB
2    person2    2.0  19.361702   not     sg  classA
3    person3    3.0  10.425532    gg     id  classA
4    person4    3.0  12.978723    gg     my  classB
5    person5    1.0  15.957447   not    NaN  classB
6    person6    2.0  18.936170   not     us  classC
7    person7    3.0  15.744681   not     bf  classC
8    person8    2.0  12.340426    gg     sg  classA
9    person9    1.0  12.553191   not     bf  classC
10  person10    1.0  19.361702   not     bf  classB
11  person11    1.0  18.085106    gg     id  classA
12  person12    1.0  16.170213   not     my  classC
13  person13    2.0  10.851064    gg     my  classB
14  person14    3.0  12.340426   not     my  classA
15  person15    1.0  19.787234    gg     us  classB
16  person16    1.0  13.617021   NaN    NaN  classC
17  person17    2.0  18.085106    gg     jp  classC
18  person18    3.0  11.489362    gg     sg  classB
19  person19    3.0  20.000000    gg     jp  classC

Now we repair the “fourth” columns. The modes replace defective cells with the values that occurred the most, the least, and the median respectively.

origD.initialize_dataframe_repair()
origD.repair_single_column("fourth", mode="max_occuring")
print("\nCOMPARE: REPAIR by 'max_occuring':\n", origD.repaired_df)
origD.initialize_dataframe_repair()
origD.repair_single_column("fourth", mode="min_occuring")
print("\nCOMPARE: REPAIR by 'min_occuring':\n", origD.repaired_df)
origD.initialize_dataframe_repair()
origD.repair_single_column("fourth", mode="mid_occuring")
print("\nCOMPARE: REPAIR by 'mid_occuring':\n", origD.repaired_df)

The outputs are the following. If there are ties, for example, if maximum occurring values are “us” and “my”, then the values used to replace the defective cells are uniform randomly chosen from them. This applies to other modes as well.

COMPARE: REPAIR by 'max_occuring':
           ID  first     second third fourth  result
0        NaN    3.0  15.319149   not     us  classC
1    person1    1.0  15.531915    gg     us  classB
2    person2    NaN  19.361702   not     sg  classA
3    person3    3.0  10.425532    gg     id  classA
4    person4    3.0  12.978723    gg     my  classB
5    person5    1.0  15.957447   not     us  classB
6    person6    NaN  18.936170   not     us  classC
7    person7    3.0  15.744681   not     bf  classC
8    person8    2.0  12.340426    gg     sg  classA
9    person9    1.0  12.553191   not     bf  classC
10  person10    1.0  19.361702   not     bf  classB
11  person11    1.0  18.085106    gg     id  classA
12  person12    1.0  16.170213   not     my  classC
13  person13    2.0  10.851064    gg     my  classB
14  person14    3.0  12.340426   not     my  classA
15  person15    1.0  19.787234    gg     us  classB
16  person16    1.0  13.617021   NaN     us  classC
17  person17    2.0  18.085106    gg     jp  classC
18  person18    3.0  11.489362    gg     sg  classB
19  person19    3.0  20.000000    gg     jp  classC

COMPARE: REPAIR by 'min_occuring':
           ID  first     second third fourth  result
0        NaN    3.0  15.319149   not     us  classC
1    person1    1.0  15.531915    gg     us  classB
2    person2    NaN  19.361702   not     sg  classA
3    person3    3.0  10.425532    gg     id  classA
4    person4    3.0  12.978723    gg     my  classB
5    person5    1.0  15.957447   not     jp  classB
6    person6    NaN  18.936170   not     us  classC
7    person7    3.0  15.744681   not     bf  classC
8    person8    2.0  12.340426    gg     sg  classA
9    person9    1.0  12.553191   not     bf  classC
10  person10    1.0  19.361702   not     bf  classB
11  person11    1.0  18.085106    gg     id  classA
12  person12    1.0  16.170213   not     my  classC
13  person13    2.0  10.851064    gg     my  classB
14  person14    3.0  12.340426   not     my  classA
15  person15    1.0  19.787234    gg     us  classB
16  person16    1.0  13.617021   NaN     jp  classC
17  person17    2.0  18.085106    gg     jp  classC
18  person18    3.0  11.489362    gg     sg  classB
19  person19    3.0  20.000000    gg     jp  classC

COMPARE: REPAIR by 'mid_occuring':
           ID  first     second third fourth  result
0        NaN    3.0  15.319149   not     us  classC
1    person1    1.0  15.531915    gg     us  classB
2    person2    NaN  19.361702   not     sg  classA
3    person3    3.0  10.425532    gg     id  classA
4    person4    3.0  12.978723    gg     my  classB
5    person5    1.0  15.957447   not     sg  classB
6    person6    NaN  18.936170   not     us  classC
7    person7    3.0  15.744681   not     bf  classC
8    person8    2.0  12.340426    gg     sg  classA
9    person9    1.0  12.553191   not     bf  classC
10  person10    1.0  19.361702   not     bf  classB
11  person11    1.0  18.085106    gg     id  classA
12  person12    1.0  16.170213   not     my  classC
13  person13    2.0  10.851064    gg     my  classB
14  person14    3.0  12.340426   not     my  classA
15  person15    1.0  19.787234    gg     us  classB
16  person16    1.0  13.617021   NaN     sg  classC
17  person17    2.0  18.085106    gg     jp  classC
18  person18    3.0  11.489362    gg     sg  classB
19  person19    3.0  20.000000    gg     jp  classC

kero version: 0.1 and above