【TensorFlow学习小组】Week 2动手任务list

实践作业

#21

主要进行了以下3个步骤:

  • 1 泰坦尼克数据集特征分析、预处理
  • 2 基于tensorflow的逻辑回归
  • 3 训练、准确率测试

1 数据集特征分析、预处理

首先看下数据集的特征

  • 分析可得到以下数据特征信息:
  1. 训练数据中总共有891名乘客,但有些属性的数据不全,比如:

    -Age(年龄)属性只有714名乘客有记录

    -Cabin(客舱)只有204名乘客是已知的

  2. 有些属性为类目属性

    比如:Sex(男、女),需要将其转化为数值型特征

  3. 有些属性为无关属性/特征

    如:PassengerId 与是否获救无关,需删掉

  4. 数值型特征的scale不同,需要进行归一化处理
  • 据此,需要进行以下数据预处理操作:
  1. 属性值缺失处理:

-可采用RandomForestClassifier,根据已有数据,填补缺失的年龄属性

-Cabin特征缺失值太多,填补的数据可能不准确,考虑直接删掉

2. 类目属性转化为数值型特征

如:Sex属性中,“男”用“1”表示,“女”用“0”表示。

3. 删除无关属性/特征

删除特征:PassengerId、Name

(这里我把Ticket特征也删掉了,原因是:Ticket类型太多,暂时难以看出和是否获救的相关性,暂时先放一放)

4. 属性特征值归一化

2 基于tensorflow的逻辑回归

  • Logistic Regression 利用sigmoid函数,对于样本 tmp1 ,可以将二分类的函数写成

    tmp2

其中,tmp5为待学习的参数,该公式即我们熟悉的

tmp6

求解参数theta的步骤是:

  1. 先确定一个形如下式的整体损失函数

tmp7

需注意:在实际应用中,单个样本的损失函数tmp10常取对数似然函数,即 tmp8 2. 通过学习样本的特征,对参数theta进行迭代优化,找到损失函数最小时对应的一组theta值

因此,在本问题中,设置tensorflow训练时的损失函数为

  • 使用TensorFlow训练模型大致是这样的步骤:
  1. 设置各种参数,如:学习率,迭代次数
  2. 定义图:定义变量、模型、优化方式。如:x,y,loss function
  3. 初始化变量:init = tf.initialize_all_variables()
  4. 建立session,正式开始训练。

具体内容详见下方代码。


3 训练、准确率测试

模型训练过程如下:

得到的Loss随迭代次数变化如下:

最终,在测试数据上的准确率测试结果为82%

完整代码

from __future__ import print_function, division
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import random
from sklearn.ensemble import RandomForestRegressor
import sklearn.preprocessing as preprocessing
from numpy import array
from sklearn.model_selection import train_test_split

def set_missing_ages(data):#使用RandomForestClassifier填补缺失的年龄属性
    age_df = data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()
    # y --age
    y = known_age[:, 0]
    # X --feature
    X = known_age[:, 1:]
    # fit to RandomForestRegressor
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)

    predictedAges = rfr.predict(unknown_age[:, 1::])
    data.loc[(data.Age.isnull()), 'Age'] = predictedAges
    return data

def attribute_to_number(data):
    dummies_Pclass = pd.get_dummies(data['Pclass'], prefix='Pclass')
    dummies_Embarked = pd.get_dummies(data['Embarked'], prefix='Embarked')
    dummies_Sex = pd.get_dummies(data['Sex'], prefix='Sex')
    data = pd.concat([data, dummies_Pclass,dummies_Embarked, dummies_Sex], axis=1)
    data.drop(['Pclass','Sex', 'Embarked'], axis=1, inplace=True)
    return data


def Scales(data):
    scaler = preprocessing.StandardScaler()
    age_scale_param = scaler.fit(data['Age'].reshape(-1, 1))
    data['Age_scaled'] = scaler.fit_transform(data['Age'].reshape(-1, 1), age_scale_param)
    fare_scale_param = scaler.fit(data['Fare'].reshape(-1, 1))
    data['Fare_scaled'] = scaler.fit_transform(data['Fare'].reshape(-1, 1), fare_scale_param)
    SibSp_scale_param = scaler.fit(data['SibSp'].reshape(-1, 1))
    data['SibSp_scaled'] = scaler.fit_transform(data['SibSp'].reshape(-1, 1), SibSp_scale_param)
    Parch_scale_param = scaler.fit(data['Parch'].reshape(-1, 1))
    data['Parch_scaled'] = scaler.fit_transform(data['Parch'].reshape(-1, 1), Parch_scale_param)
    data.drop(['Parch', 'SibSp', 'Fare', 'Age'], axis=1, inplace=True)
    return data


def DataPreProcess(in_data): #数据预处理
    in_data.drop(['PassengerId','Name','Ticket','Cabin'], axis=1, inplace=True)
    data_ages_fitted = set_missing_ages(in_data) #填补缺失的年龄属性
    data = attribute_to_number(data_ages_fitted) #类目属性转化为数值型特征
    data_scaled = Scales(data) #数值归一化

    #划分特征X,和label Y
    data_copy = data_scaled.copy(deep=True) 
    data_copy.drop(
        ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Sex_female', 'Sex_male',
         'Age_scaled', 'Fare_scaled', 'SibSp_scaled', 'Parch_scaled'], axis=1, inplace=True)
    data_y = np.array(data_copy)
    data_scaled.drop(['Survived'], axis=1, inplace=True)
    data_X = np.array(data_scaled)

    return data_X,data_y


def LR(data_X,data_y):#tensorflow 实现 Logistic Regression
    X_train, X_test, y_train, y_test = train_test_split(data_X,data_y, test_size=0.4, random_state=0)

    y_train = tf.concat([1 - y_train, y_train], 1)
    y_test = tf.concat([1 - y_test, y_test], 1)

    learning_rate = 0.001
    training_epochs = 50
    batch_size = 50
    display_step = 10

    n_samples = X_train.shape[0] #sample_num
    n_features = X_train.shape[1] #feature_num
    n_class = 2 

    x = tf.placeholder(tf.float32, [None, n_features])
    y = tf.placeholder(tf.float32, [None, n_class])

    W = tf.Variable(tf.zeros([n_features, n_class]),name="weight")
    b = tf.Variable(tf.zeros([n_class]),name="bias")

    # predict label
    pred = tf.matmul(x, W) + b

    # accuracy
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # cross entropy
    cost = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))

    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

    init = tf.initialize_all_variables()

    # train
    with tf.Session() as sess:
        sess.run(init)
        for epoch in range(training_epochs):
            avg_cost = 0
            total_batch = int(n_samples / batch_size)
            for i in range(total_batch):
                _, c = sess.run([optimizer, cost],
                                feed_dict={x: X_train[i * batch_size: (i + 1) * batch_size],
                                           y: y_train[i * batch_size: (i + 1) * batch_size, :].eval()})
                avg_cost = c / total_batch
            plt.plot(epoch + 1, avg_cost, 'co')

            if (epoch + 1) % display_step == 0:
                print("Epoch:", "%04d" % (epoch + 1), "cost=", avg_cost)

        print("Optimization Finished!")
        print("Testing Accuracy:", accuracy.eval({x: X_train, y: y_train.eval()}))

        plt.xlabel("Epoch")
        plt.ylabel("Cost")
        plt.show()


if __name__ == "__main__":
    data = pd.read_csv("/home/yimi/LearnTF/logistic/data/train.csv")
    data_X,data_y = DataPreProcess(data)
    LR(data_X,data_y)

参考:

[1]https://www.cnblogs.com/zhizhan/p/5238908.html [2]http://blog.csdn.net/u010099080/article/details/53054519 [3]https://www.cnblogs.com/peghoty/p/3857839.html

由于编辑器的问题,有些内容显示不是很好,所有内容已同步至个人blog,欢迎访问^_^ http://blog.csdn.net/yimi_ac/article/details/79008555