前言
用朴素贝叶斯做分类问题。
准备数据集
直接用之前的隐形眼镜数据集,文本数据集处理起来太麻烦了。
读取数据集
def file2DataSet(_filename):
f = open(_filename)
arrayOfLines = f.readlines()
myDataSet = []
myCategory = []
for line in arrayOfLines:
line = line.strip()
line = line.replace(" ", " ")
listFromLine = line.split(" ")
myDataSet.append(listFromLine[1:-1])
myCategory.append(int(listFromLine[-1]))
return myDataSet, myCategory
然后按7:3的比例划分为训练集和测试集:
myDataSetForTrain, myDataSetForTest = split(myDataSet, (math.ceil(len(myDataSet) * 0.7),))
myCategoryForTrain, myCategoryForTest = split(myCategory, (math.ceil(len(myCategory) * 0.7),))
计算概率
因为用的训练集是一个多属性多种值的数据集,所以还要将其属性扩张先处理成01,此外处理完后各属性也不是相互独立的,最后也还是不怎么适合贝叶斯分类就是了。
def trainNBO(_dataSetForTrain, _categoryForTrain):
numTrainData = len(_dataSetForTrain)
numAttributes = len(_dataSetForTrain[0])
p1Num = zeros(numAttributes * 3); p2Num = zeros(numAttributes * 3); p3Num = zeros(numAttributes * 3)
p1Denom = 0.0; p2Denom = 0.0; p3Denom = 0.0
for x in range(numTrainData):
if _categoryForTrain[x] == 1:
for y in range(numAttributes):
p1Num[3 * y + _dataSetForTrain[x][y] - 1] += 1
p1Denom += numAttributes
if _categoryForTrain[x] == 2:
for y in range(numAttributes):
p2Num[3 * y + _dataSetForTrain[x][y] - 1] += 1
p2Denom += numAttributes
if _categoryForTrain[x] == 3:
for y in range(numAttributes):
p3Num[3 * y + _dataSetForTrain[x][y] - 1] += 1
p3Denom += numAttributes
p1Vec = p1Num / p1Denom
p2Vec = p2Num / p2Denom
p3Vec = p3Num / p3Denom
return p1Vec, p2Vec, p3Vec, (p1Denom / (4 * numTrainData)), (p2Denom / (4 * numTrainData)), (p3Denom / (4 * numTrainData))
处理测试集
跟训练集差不多的方式。
def prepareDataSetForTest(_dataSetForTest):
numTestData = len(_dataSetForTest)
numAttributes = len(_dataSetForTest[0])
preparedDataSetForTest = zeros((numTestData, numAttributes * 3))
for x in range(numTestData):
for y in range(numAttributes):
preparedDataSetForTest[x][3 * y + _dataSetForTest[x][y] - 1] += 1
return preparedDataSetForTest
分类
def classifyNB(_vec2classify, _p1Vec, _p2Vec, _p3Vec, _pClass1, _pClass2, _pClass3):
p1 = sum(_vec2classify * _p1Vec) + log(_pClass1)
p2 = sum(_vec2classify * _p2Vec) + log(_pClass2)
p3 = sum(_vec2classify * _p3Vec) + log(_pClass3)
if p1 > p2 and p1 > p3:
return 1
if p2 > p1 and p2 > p3:
return 2
return 3
测试
myDataSet, myCategory = file2DataSet("E:/dataSet/tree/lenses.data")
myDataSetForTrain, myDataSetForTest = split(myDataSet, (math.ceil(len(myDataSet) * 0.7),))
myCategoryForTrain, myCategoryForTest = split(myCategory, (math.ceil(len(myCategory) * 0.7),))
p1Vec, p2Vec, p3Vec, pClass1, pClass2, pClass3 = trainNBO(myDataSetForTrain, myCategoryForTrain)
myDataSetForTest = prepareDataSetForTest(myDataSetForTest)
errorCount = 0
numForTest = len(myDataSetForTest)
for x in range(numForTest):
result = classifyNB(myDataSetForTest[x], p1Vec, p2Vec, p3Vec, pClass1, pClass2, pClass3)
print("the classifier came back, with: %d, the real answer is: %d" % (result, myCategoryForTest[x]))
if (result != myCategoryForTest[x]):
errorCount += 1.0
print("the total error rate is: %f" % (errorCount / float(numForTest)))
结果:
(ml) PS D:\ml\bayes> python .\bayes.py
the classifier came back, with: 3, the real answer is: 3
the classifier came back, with: 3, the real answer is: 3
the classifier came back, with: 3, the real answer is: 1
the classifier came back, with: 3, the real answer is: 3
the classifier came back, with: 3, the real answer is: 2
the classifier came back, with: 3, the real answer is: 3
the classifier came back, with: 3, the real answer is: 3
the total error rate is: 0.285714
可能跟训练集中种类为3的数据太多也有关系,训练结果一般。
后记
无。