Deep Neural Network Regression part 2.1

home > kero > Machine Learning

DNN Regressor

  1. Part 1. Synthetic Regression
  2. Part 2.1. Synthetic Regression. Train and save the model.
  3. Part 2.2. Synthetic Regression. Load model and predict.

The following codes can be found here in folder DNN regression, python under the name synregMVar_init.ipynb (Jupyter notebook) and synregMVar_init.py format.

In the previous example, we used Deep Neural Network Regression to solve regression problems. Read it first to get the overall idea of what we are doing here. In this example, we also create random data for training and prediction similarly. The difference is, we split the problem into part 2.1 and part 2.2. We will go through part 2.1 here. We use small number of training data to train the model, save the model, and load this model in part 2.2. We will see that the model performs poorly, and then we further train the model and use it for prediction with better performance.

import kero.DataHandler.RandomDataFrame as RDF
import kero.DataHandler.DataTransform as dt
from kero.DataHandler.Generic import *

import numpy as np
import pandas as pd
import tensorflow as tf
import itertools
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
from scipy.stats.stats import pearsonr
from pylab import rcParams

Here, 3 layers are used for each of the two outputs. Increase the number of steps for each output to allow the algorithm to update the model further through the course of training.

hiddenunit_set=[[32,16,8],[32,16,8]]
step_set=[2400,2400] # [None, None]
activationfn_set=[tf.nn.relu, tf.nn.relu]

Preparing Data

The following section creates the random data. We initiate 1000 points for training, but intentionally make test_size_frac=0.98, i.e. only 0.02 * 1000 = 20 points are used for training and 980 used for testing. This is to demonstrate how the model will perform poorly if we do not train it further with more data.

no_of_training_data = 1000
no_of_test_data = 1000
test_size_frac = 0.98 # fraction of training data used for validation
puncture_rate=0.001

no_of_data_points = [no_of_training_data, no_of_test_data] # number of rows for training and testing data sets to be generated.
rdf = RDF.RandomDataFrame()
####################################################
# Specify the input variables here
####################################################
FEATURES = ["first","second","third", "fourth","bool1", "bool2", "bool3", "bool4"]
output_label="output1" # !! List all the output column names
output_label2= "output2"

col1 = {"column_name": FEATURES[0], "items": list(range(4))}
col2 = {"column_name": FEATURES[1], "items": list(np.linspace(10, 20, 8))}
col3 = {"column_name": FEATURES[2], "items": list(np.linspace(-100, 100, 1250))}
col4 = {"column_name": FEATURES[3], "items": list(np.linspace(-1, 1, 224))}
col5 = {"column_name": FEATURES[4], "items": [0, 1]}
col6 = {"column_name": FEATURES[5], "items": [0, 1]}
col7 = {"column_name": FEATURES[6], "items": [0, 1]}
col8 = {"column_name": FEATURES[7], "items": [0, 1]}

LABEL = [output_label, output_label2]
for toggley in [0, 1]: # once for train.csv, once for test.csv
    rdf.initiate_random_table(no_of_data_points[toggley], col1, col2, col3, col4, col5, col6, col7, col8, panda=True)
    # print("clean\n", rdf.clean_df)

    df_temp = rdf.clean_df
    listform, _ = dt.dataframe_to_list(df_temp)
    
    
    ########################################################
    # Specify the system of equations which determines
    # the output variables.
    ########################################################
    tempcol = []
    tempcol2 = []
    gg = listform[:]

    ########## Specifiy the name(s) of the output variable(s) ##########
    

    listform = list(listform)
    for i in range(len(listform[0])):
        # example 0 (very easy)
#             temp = gg[0][i] + gg[1][i] + gg[2][i] + gg[3][i] + gg[4][i] + gg[5][i] + gg[6][i] + gg[7][i]
#             temp2 = gg[0][i] - gg[1][i] + gg[2][i] - gg[3][i] + gg[4][i] - gg[5][i] + gg[6][i] - gg[7][i]

        # example 1
        temp = gg[0][i]**2 + gg[1][i] + gg[2][i] + (gg[4][i] + gg[5][i])*gg[3][i] + gg[6][i] + gg[7][i]
        temp2 = gg[0][i] - gg[1][i]**2 + gg[2][i] - gg[3][i]*(0.5*(gg[6][i] - gg[7][i])) + gg[4][i] - gg[5][i] 
        ########################################
        tempcol = tempcol + [temp]
        tempcol2 = tempcol2 + [temp2]
    if toggley==0:
        listform = listform + [tempcol, tempcol2]
        column_name_list = FEATURES + LABEL
    else:
        correct_test_df = pd.DataFrame(np.transpose([tempcol, tempcol2]),columns=LABEL)
        correct_test_df.to_csv("regressionMVartest_test_correctans.csv", index=False)
        column_name_list = FEATURES
    # for i in range(len(listform)):
    #     print(column_name_list[i], '-', listform[i])
    ########################################################
    

    listform = transpose_list(listform)
    # print(listform)
    # print(column_name_list)
    temp_df = pd.DataFrame(listform, columns=column_name_list)
    rdf.clean_df = temp_df
    # print(rdf.clean_df)

    if toggley==0:
        rdf.crepify_table(rdf.clean_df, rate=puncture_rate)
        # print("post crepfify\n", rdf.crepified_df)
        rdf.crepified_df.to_csv("regressionMVartest_train.csv", index=False)
    else:
        rdf.crepify_table(rdf.clean_df, rate=0)
        # print("post crepfify\n", rdf.crepified_df)
        rdf.crepified_df.to_csv("regressionMVartest_test.csv", index=False)

Pre-processing

We pre-process the data for training. In this part of the code, the training set is split into the training and testing part, 20 and 980 data points respectively, as mentioned before.

df_train = pd.read_csv(r"regressionMVartest_train.csv")
print('df train shape =',df_train.shape)
# print("train data:\n", df_train.head())
cleanD_train, crippD_train, _ = dt.data_sieve(df_train)  # cleanD, crippD, origD'
cleanD_train.get_list_from_df()
colname_set_train = df_train.columns
df_train_clean = cleanD_train.clean_df
df_train_crippled = crippD_train.crippled_df
print("clean train data:\n", df_train.head())
print('df train clean shape =',df_train_clean.shape)
if df_train_crippled is not None:
    print('df train crippled shape =',df_train_crippled.shape)
else:
    print('df train: no defect')
# prepare
dftr=df_train_clean[:]
train = dftr
print(FEATURES," -size = ", len(FEATURES))
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]

# Training set and Prediction set with the features to predict
training_set = train[FEATURES]
prediction_set = train[LABEL]

# Train and Test 
x_train, x_test, y_train, y_test = train_test_split(training_set[FEATURES] , prediction_set, test_size=test_size_frac, random_state=42)
y_train = pd.DataFrame(y_train, columns = LABEL)
training_set = pd.DataFrame(x_train, columns = FEATURES).merge(y_train, left_index = True, right_index = True)

training_sub = training_set[FEATURES]
print(training_set.head()) # NOT YET PREPROCESSED
# Same thing but for the test part of the training set
y_test = pd.DataFrame(y_test, columns = LABEL)
testing_set = pd.DataFrame(x_test, columns = FEATURES).merge(y_test, left_index = True, right_index = True)
print(testing_set.head()) # NOT YET PREPROCESSED
print("training size = ", training_set.shape)
print("test size = ", testing_set.shape)

Here we do the pre-processing of the training part of the training set.

range_second = [10,20]
range_third = [-100,100]
range_fourth = [-1,1]
# range_input_set = [range_second, range_third, range_fourth]
range_output1 = [-200,200]
range_output2 = [-600,600]
range_output_set = {'output1': range_output1, 'output2': range_output2}


conj_command_set = {FEATURES[0]: "",
                    FEATURES[1]: "cont_to_scale",
                    FEATURES[2]: "cont_to_scale",
                    FEATURES[3]: "cont_to_scale",
                    FEATURES[4]: "",
                    FEATURES[5]: "",
                    FEATURES[6]: "",
                    FEATURES[7]: "",
                    # OUTPUT
                    LABEL[0]: "cont_to_scale",
                    LABEL[1]: "cont_to_scale",
                   }
scale_output1 = [0,1]
scale_output2 = [0,1]
scale_output_set = {'output1': scale_output1, 'output2': scale_output2}
cont_to_scale_settings_second = {"scale": [-1, 1], "mode": "uniform", "original_scale":range_second}
cont_to_scale_settings_third = {"scale": [0, 1], "mode": "uniform", "original_scale":range_third}
cont_to_scale_settings_fourth = {"scale": [0, 1], "mode": "uniform", "original_scale":range_fourth}
cont_to_scale_settings_output1 = {"scale": scale_output1 , "mode": "uniform", "original_scale":range_output1}
cont_to_scale_settings_output2 = {"scale": scale_output2 , "mode": "uniform", "original_scale":range_output2}
conj_command_setting_set = {FEATURES[0]: None,
                            FEATURES[1]: cont_to_scale_settings_second,
                            FEATURES[2]: cont_to_scale_settings_third,
                            FEATURES[3]: cont_to_scale_settings_fourth,
                            FEATURES[4]: None,
                            FEATURES[5]: None,
                            FEATURES[6]: None,
                            FEATURES[7]: None,
                            # OUTPUT
                            LABEL[0]: cont_to_scale_settings_output1,
                            LABEL[1]: cont_to_scale_settings_output2,
                            }
cleanD_training_set = dt.clean_data()   
cleanD_training_set.clean_df = training_set
cleanD_training_set.build_conj_dataframe(conj_command_set, conj_command_setting_set=conj_command_setting_set)

train_conj = cleanD_training_set.clean_df_conj[:]
print("scaled train=\n", train_conj.head(10))

Now we begin writing the model and do the training.

# Model
tf.logging.set_verbosity(tf.logging.ERROR)
regressor_set = []
for i in range(len(LABEL)):
    regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols, 
                                              activation_fn = activationfn_set[i], hidden_units=hiddenunit_set[i],
                                             model_dir=LABEL[i])
#                                             optimizer = tf.train.GradientDescentOptimizer(learning_rate= 0.1))
    regressor_set.append(regressor)

# Reset the index of training
training_set.reset_index(drop = True, inplace =True)
def input_fn(data_set, one_label, pred = False):
    # one_label is the element of LABEL
    if pred == False:
        
        feature_cols = {k: tf.constant(data_set[k].values) for k in FEATURES}
        labels = tf.constant(data_set[one_label].values)
        
        return feature_cols, labels

    if pred == True:
        feature_cols = {k: tf.constant(data_set[k].values) for k in FEATURES}
        
        return feature_cols
    
# TRAINING HERE
for i in range(len(LABEL)):
    regressor_set[i].fit(input_fn=lambda: input_fn(train_conj, LABEL[i]), steps=step_set[i])

We are done here! The models are saved as output1 and output2 folders. We will load the models in the next part, in this link.