Python 机器学习 Note 6
16 Jul 2017 |
Python
今天的笔记是中国大学mooc Pyhton机器学习应用 的程序设计作业。算是对之前学习 分类 的巩固。
题目要求
本次实验为分类任务,实验数据在附件中,共有2个文件,data_train.txt、data_test.txt,分别用于训练和测试,在训练文件中数据有55列,前54列是样本的特征(输入数据),最后一列是样本的类别(label),类别共有7种,对应为1~7。测试数据中没有类别(label),需要根据模型预测相应的类别(label),这些预测结果需要上传,并根据预测结果给出相应的分数。
作业要求:
数据处理
主要是数据处理上,pandas读取数据文件选择指定列卡了很久。
如果简单粗暴地用 df[:54]
选取数据,所选取的是行而非列。
read_table参数之colsuse
后来发现在pandas的 read_table() 函数中有参数 colsuse 。
官方文档中的说明如下:
usecols :array-like or callable, default None
Return a subset of the columns. If array-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid array-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’].
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in [‘AAA’, ‘BBB’, ‘DDD’]. Using this parameter results in much faster parsing time and lower memory usage.
使用方法
选取多列
可以像文档中所说直接列出列表,也可以用range直接生成列表。
df = pd . read_table ( file , delimiter = ' ' , header = None , usecols = range ( 0 , 54 ))
选取指定列
直接用[n]指定需读取的列。
程序
#coding=utf-8
import numpy as np
import pandas as pd
#导入数据预处理模块
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
#导入三种分类器
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
print ( "Initialized." )
def load_datasets1 ( data_path ):
feature = np . ndarray ( shape = ( 0 , 54 ))
#定义列数量与特征维度一致为54
label = np . ndarray ( shape = ( 0 , 1 ))
#定义列数量与标签维度一致为1
for file in data_path :
df = pd . read_table ( file , delimiter = ' ' , header = None , usecols = range ( 0 , 54 ))
#以空格分隔符读取数据文件的前54列,文件中不包含表头
feature = np . concatenate (( feature , df ))
#print(len(feature))
dl = pd . read_table ( file , delimiter = ' ' , header = None , usecols = [ 54 ])
#以空格分隔符读取数据文件的最后一列,文件中不包含表头
#将读取的数据的1-54行合并到特征集中
label = np . concatenate (( label , dl ))
#print(len(label))
#将读取的数据的第55行合并到标签集中
label = np . ravel ( label )
#将标签归整为一维向量
return feature , label
def load_datasets2 ( data_path ):
feature = np . ndarray ( shape = ( 0 , 54 ))
df = pd . read_table ( data_path [ 0 ], delimiter = ' ' , header = None )
feature = np . concatenate (( feature , df ))
return feature
if __name__ == '__main__' :
train_path = [ 'data_train.txt' ]
test_path = [ 'data_test.txt' ]
x_train , y_train = load_datasets1 ( train_path [: 1 ])
x_test = load_datasets2 ( test_path [: 1 ])
print ( len ( x_test ))
print ( 'Start training knn...' )
#训练k近邻分类器
knn = KNeighborsClassifier () . fit ( x_train , y_train )
print ( 'Training done' )
answer_knn = knn . predict ( x_test )
#预测
print ( answer_knn )
print ( 'Prediction done' )
print ( 'Start training DT...' )
#训练决策树分类器
dt = DecisionTreeClassifier () . fit ( x_train , y_train )
print ( 'Training done' )
answer_dt = dt . predict ( x_test )
print ( answer_dt )
print ( 'Prediction done' )
print ( 'Start training Bayes' )
#训练贝叶斯分类器
gnb = GaussianNB () . fit ( x_train , y_train )
print ( 'Training done' )
answer_gnb = gnb . predict ( x_test )
print ( answer_gnb )
print ( 'Prediction done' )
#把预测结果label分别保存在文件中
f1 = open ( 'model_1.txt' , 'w+' )
for i in answer_knn :
f1 . write ( str ( i ) + ' \n ' )
f1 . close ()
f2 = open ( 'model_2.txt' , 'w+' )
for i in answer_knn :
f2 . write ( str ( i ) + ' \n ' )
f2 . close ()
f3 = open ( 'model_3.txt' , 'w+' )
for i in answer_knn :
f3 . write ( str ( i ) + ' \n ' )
f3 . close ()
参考资料:
中国大学mooc
pandas.read_table()
文件读写
想做BCI有很久,一直没有动手。这两天看了恶补一些资料,对脑电的基本知识、应用和信号提取做一个综述。