前言  继续学习。
matplotlib 用conda装不上,去手动下载 一个来安装:
1 pip install E:\dataSet\matplotlib-3.5.0-cp39-cp39-win_amd64.whl
数据集准备 这里用文档中使用的隐形眼镜数据集 。
读取数据集 数据集长这样:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1   1   1   1   1   3 2   1   1   1   2   2 3   1   1   2   1   3 4   1   1   2   2   1 5   1   2   1   1   3 6   1   2   1   2   2 7   1   2   2   1   3 8   1   2   2   2   1 9   2   1   1   1   3 10  2   1   1   2   2 11  2   1   2   1   3 12  2   1   2   2   1 13  2   2   1   1   3 14  2   2   1   2   2 15  2   2   2   1   3 16  2   2   2   2   3 17  3   1   1   1   3 18  3   1   1   2   3 19  3   1   2   1   3 20  3   1   2   2   1 21  3   2   1   1   3 22  3   2   1   2   2 23  3   2   2   1   3 24  3   2   2   2   3 
按照描述:
1 2 3 4 5 6 7 8 9 10 7. Attribute Information: --  3  Classes 1 :  the  patient  should  be  fitted  with  hard  contact  lenses, 2 :  the  patient  should  be  fitted  with  soft  contact  lenses, 3 :  the  patient  should  not  be  fitted  with  contact  lenses. 1. age of the patient:  (1)  young,  (2)  pre-presbyopic,  (3)  presbyopic 2. spectacle prescription:   (1)  myope,  (2)  hypermetrope 3. astigmatic:      (1)  no ,  (2)  yes 4. tear production rate:   (1)  reduced,  (2)  normal 
第一列是序号,最后一列是标签,中间是四个特征值。读取:
1 2 3 4 5 6 7 8 9 10 def  file2DataSet (filename ):open (filename)for  line in  arrayOfLines:"  " , " " )" " )1 :])return  myDataSet
划分数据集 构造决策树需要确定每一层所使用的特征,简单来说就是先判断什么,所以需要通过计算数据集本身的熵,以及根据每个特征划分后的数据集熵,最后通过信息增益确定使用什么特征。
1 2 3 4 5 6 7 8 def  splitDataSet (dataSet, axis, value ): for  row in  dataSet:if  row[axis] == value:1 :])return  returnDataSet
计算熵 香农熵:
1 2 3 4 5 6 7 8 9 10 11 12 13 def  calcShannonEnt (dataSet ):len (dataSet)for  row in  dataSet:1 ]if  currentLabel not  in  labelCounts.keys():0 1 0 for  label in  labelCounts:float (labelCounts[label]) / numberOfRowsreturn  shannonCount
选择用于划分的特征 计算各特征的信息增益(分割前后香农熵相减)然后取最大的一个:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def  chooseBestFeatureToSplit (dataSet ):len (dataSet[0 ]) - 1 0.0 1 for  x in  range (numberOfFeatures):1 ] for  row in  dataSet]set (featureList)0 for  value in  uniqueValue:1 , value)float (len (subDataSet)) / len (dataSet)if  infoGain > bestInfoGain:1 return  bestFeature
构造决策树 首先考虑的是叶子节点怎么构造,一共两种情况,第一种是数据集中都是同一个标签了,那就直接返回。
第二种是已经没有特征可以用来继续分割,数据集中只剩下标签,即特征值完全相同的数据中有多种标签。这里可以统计出现次数,直接返回出现最多次数的标签:
1 2 3 4 5 6 7 8 def  majorityCnt (labels ):for  label in  labels:if  label not  in  labelsCount.keys():0 1 sorted (labelsCount.items(), key=lambda  label: label[1 ], reverse=True )return  sortedLabelsCount[0 ][0 ]
然后建立决策树,注意复制数组时不要直接用:
这样就不是复制而是引用,而要写成:
决策树:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 def  createTree (dataSet, features, valueForFeatures, labels ):1 ] for  row in  dataSet]if  classList.count(classList[0 ]) == len (classList): return  labels[int (classList[0 ]) - 1 ]if  len (dataSet[0 ]) == 1 : return  labels[majorityCnt(classList) - 1 ]del (features[bestFeature])del (valueForFeatures[bestFeature])for  row in  dataSet]set (featureValues)for  value in  uniqueValue:int (value) - 1 ]return  myTree
最后整合一下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 labels = ["hard" ,"soft" ,"not be fitted with contact lenses" "age of the patient" ,"spectacle prescription" ,"astigmatic" ,"tear production rate" ,"young" , "pre-presbyopic" , "presbyopic" ],"myope" , "hypermetrope" ],"no" , "yes" ],"reduced" , "normal" ]"E:/dataSet/tree/lenses.data" )print (createTree(myDataSet, features, valueForFeatures, labels))
结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 { "tear production rate" :  { "normal" :  { "astigmatic" :  { "yes" :  { "spectacle prescription" :  { "hypermetrope" :  { "age of the patient" :  { "presbyopic" :  "not be fitted with contact lenses" , "pre-presbyopic" :  "not be fitted with contact lenses" , "young" :  "hard" } } , "myope" :  "hard" } } , "no" :  { "age of the patient" :  { "presbyopic" :  { "spectacle prescription" :  { "hypermetrope" :  "soft" , "myope" :  "not be fitted with contact lenses" } } , "pre-presbyopic" :  "soft" , "young" :  "soft" } } } } , "reduced" :  "not be fitted with contact lenses" } } 
后记 用matplotlib画画的事情晚点再做吧。