Deep Neural Network Regressor

home > kero > Machine Learning

DNN Regressor

  1. Part 1. Synthetic Regression
  2. Part 2.1. Synthetic Regression. Train and save the model.
  3. Part 2.2. Synthetic Regression. Load model and predict.

The following codes can be found here in folder DNN regression, python under the name synregMVar.ipynb (Jupyter notebook) and synregMVar.py format.

Prerequisites: make sure tensorflow works in your machine.

Sometimes we have large volume of data we can use to predict continuous values. For example, as seen here, given the features of a house, we may want to predict what is the price of the house. This is as opposed to, say, MNIST, which is a classification problem. In MNIST, given an image of handwritten number, we want to use machine learning to tell us which number it is. This is not a very trivial task for the machine, since, for example, number 4, when written fast or blurred, can look like 9. In MNIST problem, we are classifying data (images) into either 0, 1, … or 9. But for the house pricing problem, the price can be continuous: it could be $1,234,000 or $623,691 or anything in decimal number. This is a regression problem.

Synthetic Regression

We call this synthetic regression because this is not based on any real data. We randomly generated the data points and create arbitrary formulas as shown below.

DNN reg.png

Figure 1. Randomly generated data table.

Our tasks are:

  1. generate a lot of data like the table above and use them to train the machine learning model,
  2. create more data like above, except without the columns output1 and output2. We then use the model we have trained to predict what output1 and output2 are. At the end of this post, a graph will be plotted to show how accurate the predicted values and the theoretically correct values.

Preparing Data

This section does only one thing: prepare training and testing data tables like the one shown in figure 1 above.

To simulate a regression problem, we create a table (data frame) made of columns of either boolean and double that can take a range of continuous values. This code will save regressionMVartest_train.csv and regressionMVartest_test.csv, the data set for training and testing respectively. Note that we create a table of 8 columns and two extra columns specified by temp and temp2 (see blue highlight in the code below). These 2 columns are the outputs of the table that depend on the other 8 columns, called “output1” and “output2”.

I recommend using Jupyter so that we can slowly see the output of each process in steps.

Also, note that hidden_unit_set, step_set and activationfn_set are lists of 2 different settings, one for “output1” and the other for “output2”. The variable no_of_data_points = [2500, 1000] specifies the number of training data and test data we want to generate. In this example regressionMVartest_train.csv will have 2500 rows of data and regressionMVartest_test.csv will have 1000 rows of data. Do play around with other variables to obtain optimum prediction (ideally we should do hyper-parameter tuning). For example, set the number of steps in step_set higher or to None in order to allow for more number of model updates.

Note that regressionMVartest_test.csv does not have output columns, since the output are to be predicted by us. However, we know the theoretically correct values of the outputs since we specify the exact formula. These values will be saved separately in regressionMVartest_test_correctans.csv for double checking.

import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataTransform as dt
from kero.DataHandler.Generic import *

import numpy as np
import pandas as pd
import tensorflow as tf
import itertools
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
from scipy.stats.stats import pearsonr
from pylab import rcParams

# CONTROL PANEL

# example 1
hiddenunit_set=[[8,4],[8,4]]
step_set=[6400, 6400]
activationfn_set=[tf.nn.relu, tf.nn.relu]

no_of_data_points = [2500, 1000] # number of rows for training and testing data sets to be generated.
puncture_rate=0.001

rdf = RDF.RandomDataFrame()
####################################################
# Specify the input variables here
####################################################
FEATURES = ["first","second","third", "fourth","bool1", "bool2", "bool3", "bool4"]
output_label="output1" # !! List all the output column names
output_label2= "output2"
col1 = {"column_name": "first", "items": list(range(4))}
col2 = {"column_name": "second", "items": list(np.linspace(10, 20, 8))}
col3 = {"column_name": "third", "items": list(np.linspace(-100, 100, 1250))}
col4 = {"column_name": "fourth", "items": list(np.linspace(-1, 1, 224))}
col5 = {"column_name": "bool1", "items": [0, 1]}
col6 = {"column_name": "bool2", "items": [0, 1]}
col7 = {"column_name": "bool3", "items": [0, 1]}
col8 = {"column_name": "bool4", "items": [0, 1]}


LABEL = [output_label, output_label2]

for toggley in [0, 1]: # once for train.csv once for test.csv
    rdf.initiate_random_table(no_of_data_points[toggley], col1, col2, col3, col4, col5, col6, col7, col8, panda=True)
    # print("clean\n", rdf.clean_df)

    df_temp = rdf.clean_df
    listform, _ = dt.dataframe_to_list(df_temp)
    
    
    ########################################################
    # Specify the system of equations which determines
    # the output variables.
    ########################################################
    tempcol = []
    tempcol2 = []
    gg = listform[:]

    ########## Specifiy the name(s) of the output variable(s) ##########
    

    listform = list(listform)
    for i in range(len(listform[0])):
        # example 0 (very easy)
#             temp = gg[0][i] + gg[1][i] + gg[2][i] + gg[3][i] + gg[4][i] + gg[5][i] + gg[6][i] + gg[7][i]
#             temp2 = gg[0][i] - gg[1][i] + gg[2][i] - gg[3][i] + gg[4][i] - gg[5][i] + gg[6][i] - gg[7][i]

        # example 1
        temp = gg[0][i]**2 + gg[1][i] + gg[2][i] + (gg[4][i] + gg[5][i])*gg[3][i] + gg[6][i] + gg[7][i]
        temp2 = gg[0][i] - gg[1][i]**2 + gg[2][i] - gg[3][i]*(0.5*(gg[6][i] - gg[7][i])) + gg[4][i] - gg[5][i] 
        ########################################
        tempcol = tempcol + [temp]
        tempcol2 = tempcol2 + [temp2]
    if toggley==0:
        listform = listform + [tempcol, tempcol2]
        column_name_list = FEATURES + LABEL
    else:
        correct_test_df = pd.DataFrame(np.transpose([tempcol, tempcol2]),columns=LABEL)
        correct_test_df.to_csv("regressionMVartest_test_correctans.csv", index=False)
        column_name_list = FEATURES
    # for i in range(len(listform)):
    #     print(column_name_list[i], '-', listform[i])
    ########################################################
    

    listform = transpose_list(listform)
    # print(listform)
    # print(column_name_list)
    temp_df = pd.DataFrame(listform, columns=column_name_list)
    rdf.clean_df = temp_df
    # print(rdf.clean_df)

    if toggley==0:
        rdf.crepify_table(rdf.clean_df, rate=puncture_rate)
        # print("post crepfify\n", rdf.crepified_df)
        rdf.crepified_df.to_csv("regressionMVartest_train.csv", index=False)
    else:
        rdf.crepify_table(rdf.clean_df, rate=0)
        # print("post crepfify\n", rdf.crepified_df)
        rdf.crepified_df.to_csv("regressionMVartest_test.csv", index=False)

Pre-processing

Alright, we have created some random data tables. In the following code, we first load the csv files we have created and store them as pandas data frame. We load the training data set, extract only the clean part i.e. we drop all the defective data rows, and use them to train our model. Also, the train data are further split into training and testing sets using train_test_split() from scikit. We then have variables x_train, x_test, y_train, y_test, where x refers to the 8 columns, and y the 2 outputs, and each x and y are split to training part and testing part of the training set. The testing part is used to validate the result of training that we have performed along the way.

df_train = pd.read_csv(r"regressionMVartest_train.csv")
print('df train shape =',df_train.shape)
# print("train data:\n", df_train.head())
cleanD_train, crippD_train, _ = dt.data_sieve(df_train)  # cleanD, crippD, origD'
cleanD_train.get_list_from_df()
df_train_clean = cleanD_train.clean_df
df_train_crippled = crippD_train.crippled_df

print("clean train data:\n", df_train.head())
print('df train clean shape =',df_train_clean.shape)
if df_train_crippled is not None:
    print('df train crippled shape =',df_train_crippled.shape)
else:
    print('df train: no defect')

colname_set_train = df_train.columns

# prepare
dftr=df_train_clean[:]
print(list(dftr.columns))
train = dftr


COLUMNS = FEATURES + LABEL
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
training_set = train[COLUMNS]
prediction_set = train[LABEL]


x_train, x_test, y_train, y_test = train_test_split(training_set[FEATURES] , prediction_set, test_size=0.33, random_state=42)
y_train = pd.DataFrame(y_train, columns = LABEL)
training_set = pd.DataFrame(x_train, columns = FEATURES).merge(y_train, left_index = True, right_index = True)
print(training_set.head()) # NOT YET PREPROCESSED
y_test = pd.DataFrame(y_test, columns = LABEL)
testing_set = pd.DataFrame(x_test, columns = FEATURES).merge(y_test, left_index = True, right_index = True)
print(testing_set.head()) # NOT YET PREPROCESSED
print("training size = ", training_set.shape)
print("test size = ", testing_set.shape)

Here we show some rows of the training data.

    first     second      third    fourth  bool1  bool2  bool3  bool4  \
0    0.0  10.000000 -41.072858 -0.345291    1.0    1.0    1.0    1.0   
1    3.0  11.428571  49.719776 -0.076233    0.0    1.0    1.0    1.0   
2    0.0  20.000000  -4.243395 -0.766816    0.0    1.0    0.0    1.0   
3    3.0  17.142857  15.292234 -0.811659    0.0    1.0    0.0    0.0   
4    1.0  20.000000 -55.164131 -0.363229    1.0    0.0    0.0    1.0
     output1     output2  
0 -29.763441 -141.072858  
1  72.072114  -78.892469  
2  15.989789 -405.626803  
3  40.623432 -276.585317  
4 -33.527360 -453.345746

The next part of the code does the main pre-processing. In this example we scale columns “second”, “third” and “fourth” down. The original_scale [xmin,xmax] tells the algorithm we should treat the column as having maximum xmax and minimum xmin. In this example, we specify them through range_second etc. The scale specifies the final scale for the column. For example if the column is about temperature in K ranging from 0 to 1000, we can set original_scale=[0,1000], and scale it down to scale=[0,1]. But, say, if we anticipate further incoming data to take in higher temperature, we can set original_scale=[0,5000] as well, and the scaling will be performed accordingly. In the section, we transform the original data frame to the conjugate data frame. The latter is the data frame that the machine can understand better or process more easily. Details can be found here. We do transformation for both training and testing part of the training data set.

range_second = [10,20]
range_third = [-100,100]
range_fourth = [-1,1]
# range_input_set = [range_second, range_third, range_fourth]
range_output1 = [-200,200]
range_output2 = [-600,600]
range_output_set = {'output1': range_output1, 'output2': range_output2}
conj_command_set = {COLUMNS[0]: "",
                    COLUMNS[1]: "cont_to_scale",
                    COLUMNS[2]: "cont_to_scale",
                    COLUMNS[3]: "cont_to_scale",
                    COLUMNS[4]: "",
                    COLUMNS[5]: "",
                    COLUMNS[6]: "",
                    COLUMNS[7]: "",
                    # OUTPUT
                    COLUMNS[8]: "cont_to_scale",
                    COLUMNS[9]: "cont_to_scale",
                   }
scale_output1 = [0,1]
scale_output2 = [0,1]
scale_output_set = {'output1': scale_output1, 'output2': scale_output2}
cont_to_scale_settings_second = {"scale": [-1, 1], "mode": "uniform", "original_scale":range_second}
cont_to_scale_settings_third = {"scale": [0, 1], "mode": "uniform", "original_scale":range_third}
cont_to_scale_settings_fourth = {"scale": [0, 1], "mode": "uniform", "original_scale":range_fourth}
cont_to_scale_settings_output1 = {"scale": scale_output1 , "mode": "uniform", "original_scale":range_output1}
cont_to_scale_settings_output2 = {"scale": scale_output2 , "mode": "uniform", "original_scale":range_output2}
conj_command_setting_set = {COLUMNS[0]: None,
                            COLUMNS[1]: cont_to_scale_settings_second,
                            COLUMNS[2]: cont_to_scale_settings_third,
                            COLUMNS[3]: cont_to_scale_settings_fourth,
                            COLUMNS[4]: None,
                            COLUMNS[5]: None,
                            COLUMNS[6]: None,
                            COLUMNS[7]: None,
                            # OUTPUT
                            COLUMNS[8]: cont_to_scale_settings_output1,
                            COLUMNS[9]: cont_to_scale_settings_output2,
                            }
cleanD_training_set = dt.clean_data()   
cleanD_training_set.clean_df = training_set
cleanD_training_set.build_conj_dataframe(conj_command_set, conj_command_setting_set=conj_command_setting_set)

train_conj = cleanD_training_set.clean_df_conj[:]
print("scaled train=\n", train_conj.head(10))

# Same thing (preprocessing) but for the test part of the training set

cleanD_testing_set = dt.clean_data() 
cleanD_testing_set.clean_df = testing_set
cleanD_testing_set.build_conj_dataframe(conj_command_set, conj_command_setting_set=conj_command_setting_set)

test_conj = cleanD_testing_set.clean_df_conj[:]
print("scaled test=\n", test_conj.head(10))
print(test_conj.shape)
print(testing_set.shape)

Just a reminder: what are we doing above? Our data may come in many sort of shapes and values. What we are doing is pre-processing, i.e. to scale the numbers down to typically the scale of -1 to 1, so that the algorithm can process the data by treating each data fairly at the start**. Here we show some rows of the transformed table.

    first    second     third    fourth  bool1  bool2  bool3  bool4   output1  \
0    1.0 -0.142857  0.739792  0.139013    1.0    0.0    1.0    1.0  0.661305   
1    2.0  0.428571  0.343475  0.865471    0.0    1.0    0.0    0.0  0.476422   
2    0.0 -0.428571  0.633307  0.228700    0.0    0.0    0.0    0.0  0.598796   
3    1.0  1.000000  0.392314  0.941704    1.0    1.0    0.0    0.0  0.503074   
4    0.0 -0.428571  0.034428  0.448430    0.0    0.0    0.0    0.0  0.299357   
5    3.0  1.000000  0.973579  0.627803    1.0    0.0    1.0    0.0  0.812428   
6    1.0  0.714286  0.351481  0.385650    1.0    0.0    1.0    0.0  0.476597   
7    1.0  0.714286  0.052042  0.789238    1.0    1.0    0.0    1.0  0.330342   
8    2.0  0.428571  0.898319  0.170404    0.0    1.0    0.0    1.0  0.752868   
9    1.0 -0.142857  0.498799  0.035874    0.0    1.0    0.0    0.0  0.535293   

    output2  
0  0.371564  
1  0.229848  
2  0.384463  
3  0.149552  
4  0.284649  
5  0.248823  
6  0.189594  
7  0.139000  
8  0.322047  
9  0.329732

Training Machine Learning Model

The next portion does the model training. This is where Deep neural Network comes in. Notice that we define input_fn(), which is an artefact of tensorflow that requires a function to be defined for feeding data set into the algorithm. Do not sweat over it.

# Model
tf.logging.set_verbosity(tf.logging.ERROR)
regressor_set = []
for i in range(len(LABEL)): # in our example, we create 2 regressors, one for output1 and the other for output2.
    regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols, 
                                              activation_fn = activationfn_set[i], hidden_units=hiddenunit_set[i])#,
#                                             optimizer = tf.train.GradientDescentOptimizer(learning_rate= 0.1))
    regressor_set.append(regressor)

# Reset the index of training
training_set.reset_index(drop = True, inplace =True)
def input_fn(data_set, one_label, pred = False):
    # one_label is the element of LABEL
    if pred == False:
        
        feature_cols = {k: tf.constant(data_set[k].values) for k in FEATURES}
        labels = tf.constant(data_set[one_label].values)
        
        return feature_cols, labels

    if pred == True:
        feature_cols = {k: tf.constant(data_set[k].values) for k in FEATURES}
        
        return feature_cols

# TRAINING HERE
for i in range(len(LABEL)):
    regressor_set[i].fit(input_fn=lambda: input_fn(train_conj, LABEL[i]), steps=step_set[i])
    
# Evaluation on the test set created by train_test_split
print("Final Loss on the testing set: ")
predictions_prev_set = []
for i in range(len(LABEL)):
    ev = regressor_set[i].evaluate(input_fn=lambda: input_fn(test_conj,LABEL[i]), steps=1)
    loss_score1 = ev["loss"]
    print(LABEL[i],"{0:f}".format(loss_score1))
    # Predictions
    y = regressor_set[i].predict(input_fn=lambda: input_fn(test_conj,LABEL[i]))
    predictions_prev = list(itertools.islice(y, test_conj.shape[0]))
    predictions_prev_set.append(predictions_prev)
print("predictions_prev_set length = ",len(predictions_prev_set))

Recall that we have separated our training data set to training and testing parts. We use the training part for DNN model training, and use the testing parts to test the result of our model training. The plots are shown below; since they plot the real vs predicted values, the closer are the points y=x, the better the prediction is. Pearson correlation coefficients are used as auxiliary measures. A good prediction has coefficient close to 1.0. However, a prediction with systematic error can still have 1.0 coefficient, for example when the values are shifted by a constant. Note that we do not scale the output values back to their original values (this can be easily done), since the point is to show accuracy of the trained model.

corrcoeff_set = []
predictions_set = []
reality_set = []
print("pearson correlation coefficients  =  ")

for i in range(len(LABEL)):
    # print(LABEL[i]," : ",init_scale_max ,init_scale_min)
    # need to inverse transform #
    # since prediction is in conj form
    initial_scale = [range_output_set[LABEL[i]][0],range_output_set[LABEL[i]][1]]
    orig_scale = [scale_output_set[LABEL[i]][0],scale_output_set[LABEL[i]][1]]
    pred_inv = dt.conj_from_cont_to_scaled(predictions_prev_set[i], scale=initial_scale, mode="uniform",original_scale=orig_scale)
    #############################
    predictions = pd.DataFrame(pred_inv,columns = ['Prediction'])
    predictions_set = predictions_set + [pred_inv] # a list, or column
    # predictions.head()
    
    reality = testing_set[LABEL[i]].values # a list, or column
    reality_set = reality_set + [reality]
    # reality.head()
    corrcoeff=pearsonr(list(predictions.Prediction), list(reality))
    corrcoeff_set.append(corrcoeff)
    print(LABEL[i], " : ", corrcoeff)

matplotlib.rc('xtick', labelsize=20) 
matplotlib.rc('ytick', labelsize=20) 
for i in range(len(LABEL)):    
    fig, ax = plt.subplots()
#     plt.style.use('ggplot')
    plt.scatter(predictions_set[i], reality_set[i],s=3, c='r', lw=0) # ,'ro'
    plt.xlabel('Predictions', fontsize = 20)
    plt.ylabel('Reality', fontsize = 20)
    plt.title('Predictions x Reality on dataset Test: '+LABEL[i], fontsize = 20)

    ax.plot([reality_set[i].min(), reality_set[i].max()], [reality_set[i].min(), reality_set[i].max()], 'k--', lw=4)

The predictions for output 1 and 2 are shown here.

symvaroutput.PNG

Finally, the following save the output of the prediction in another separate csv called synregMVar_submission.csv. The result is compared with the theoretically correct outputs and plotted below.

df_test = pd.read_csv(r"regressionMVartest_test.csv")
print('df test shape =',df_test.shape)
# print("\ntest data", df_test.head())
cleanD_test, crippD_test, _ = dt.data_sieve(df_test) 
cleanD_test.get_list_from_df()
df_test_clean = cleanD_test.clean_df
df_test_crippled = crippD_test.crippled_df

print("\ntest data", df_test.head())
if df_test_crippled is not None:
    print('df test crippled shape =',df_test_crippled.shape)
else:
    print('df test: no defect')
print('df test clean shape =',df_test_clean.shape)

conj_command_set_test = {FEATURES[0]: "",
                    FEATURES[1]: "cont_to_scale",
                    FEATURES[2]: "cont_to_scale",
                    FEATURES[3]: "cont_to_scale",
                    FEATURES[4]: "",
                    FEATURES[5]: "",
                    FEATURES[6]: "",
                    FEATURES[7]: "",
                   }
conj_command_setting_set_test = {FEATURES[0]: None,
                            FEATURES[1]: cont_to_scale_settings_second,
                            FEATURES[2]: cont_to_scale_settings_third,
                            FEATURES[3]: cont_to_scale_settings_fourth,
                            FEATURES[4]: None,
                            FEATURES[5]: None,
                            FEATURES[6]: None,
                            FEATURES[7]: None,
                            }

# Same thing (preprocessing) but for the test set

cleanD_test.build_conj_dataframe(conj_command_set_test, conj_command_setting_set=conj_command_setting_set_test)

test_predict_conj = cleanD_test.clean_df_conj[:]
print("scaled test=\n", test_predict_conj.head(10))
print(df_test.shape)
print(test_predict_conj.shape)

filename = "synregMVar_submission.csv"
df_test_correct_ans = pd.read_csv(r"regressionMVartest_test_correctans.csv")

y_predict_inv_set = []
for i in range(len(LABEL)):
    y_predict = regressor_set[i].predict(input_fn=lambda: input_fn(test_predict_conj , LABEL[i], pred = True))
    # need to transform back
    y_predict_before = list(itertools.islice(y_predict, df_test.shape[0]))
    # !!
    initial_scale = [range_output_set[LABEL[i]][0],range_output_set[LABEL[i]][1]]
    orig_scale = [scale_output_set[LABEL[i]][0],scale_output_set[LABEL[i]][1]]
    y_predict_inv = dt.conj_from_cont_to_scaled(y_predict_before, scale=initial_scale, mode="uniform",original_scale=orig_scale)
    y_predict_inv_set = y_predict_inv_set + [y_predict_inv]
    
    fig2, ax2 = plt.subplots()
    real_test = np.array(list(df_test_correct_ans[LABEL[i]]))
    plt.scatter(y_predict_inv, real_test,s=3, c='r', lw=0) # ,'ro'
    plt.xlabel('Predictions', fontsize = 20)
    plt.ylabel('Reality', fontsize = 20)
    plt.title('Predictions x Reality on dataset Test: '+LABEL[i], fontsize = 20)

    ax2.plot([real_test.min(), real_test.max()], [real_test.min(), real_test.max()], 'k--', lw=4)
y_predict_inv_set = transpose_list(y_predict_inv_set)
# print(y_predict_inv_set)
y_predict_for_csv = pd.DataFrame(y_predict_inv_set, columns = LABEL)
y_predict_for_csv.to_csv(filename, index=False)

symvaroutput_subm

In the next installment, we split similar code into 2. The first part will train the model and save the model externally. Then in the second part, the model will be loaded, used for prediction, trained further, and again used for prediction. Cheers!

Some comments

** We scale the data with the intention of letting the algorithm train better. However, this is contestable and further research is necessary. Some columns may carry more weight than others, and it might be the case that more significant or “heavier” columns may be easier to process if they are scaled at larger values.