python实现的决策树的python实现怎么可视化

>> 基本的决策树可视化演示。
基本的决策树可视化演示。
所属分类:
下载地址:
decision-tree-demo-src.zip文件大小:2.36 MB
分享有礼! 》
请点击右侧的分享按钮,把本代码分享到各社交媒体。
通过您的分享链接访问Codeforge,每来2个新的IP,您将获得0.1 积分的奖励。
通过您的分享链接,每成功注册一个用户,该用户在Codeforge上所获得的每1个积分,您都将获得0.2 积分的分成奖励。
This project is mainly for teaching goal and helps fresh data mining learner understand the Decision Tree better. With data customizing and visual graph features, learners can experiment with their training data and have a quick start with basic decision tree. ID3 is choosen to be the decisiontree generaing algorithm. The algorithm is from the book Data Mining, Concepts and Techniques, Second Edition, which is written by Jiawei Han and published by China Machine PRESS.
Thanks for the SWT and JGraphX project. All contribution to this project is warmly welcome.
本项目主要应用于教学场景,目的在于帮助数据挖掘初学者快速理解决策树。使用者通过自定义各种训练数据生成不同的可视化决策树,能够更直观深入理解决策树的计算过程和结果。决策树生成算法采用了ID3,算法流程来自《数据挖掘:概念与技术(第2版)》一书,作者韩家炜,由机械工业出版社出版。
在此特别感谢SWT和JGraphX项目。欢迎有志之士共同参与开发,继续完善本项目。
Sponsored links
源码文件列表
温馨提示: 点击源码文件名可预览文件内容哦 ^_^
.classpath336.00 B 22:29
.project388.00 B 22:29
org.eclipse.core.resources.prefs88.00 B 22:29
1.57 kB 22:29
training0.csv477.00 B 22:29
training1.csv561.00 B 22:29
training2.csv620.00 B 22:29
training3.csv963.00 B 22:29
2.78 kB 22:29
2.45 kB 22:29
13.15 kB 22:29
14.46 kB 22:29
11.00 kB 22:29
12.49 kB 22:29
5.95 kB 22:29
16.39 kB 22:29
9.10 kB 22:29
5.93 kB 22:29
17.10 kB 22:29
10.06 kB 22:29
13.44 kB 22:29
14.68 kB 22:29
23.44 kB 22:29
14.30 kB 22:29
1.76 kB 22:29
8.11 kB 22:29
7.19 kB 22:29
9.61 kB 22:29
12.62 kB 22:29
14.23 kB 22:29
11.50 kB 22:29
13.39 kB 22:29
7.72 kB 22:29
8.68 kB 22:29
7.85 kB 22:29
5.95 kB 22:29
5.97 kB 22:29
13.14 kB 22:29
16.59 kB 22:29
20.04 kB 22:29
1.38 kB 22:29
7.43 kB 22:29
7.26 kB 22:29
7.25 kB 22:29
25.46 kB 22:29
35.51 kB 22:29
8.20 kB 22:29
5.95 kB 22:29
5.92 kB 22:29
1.15 kB 22:29
6.86 kB 22:29
6.29 kB 22:29
6.74 kB 22:29
14.00 kB 22:29
11.04 kB 22:29
11.62 kB 22:29
6.97 kB 22:29
5.21 kB 22:29
9.44 kB 22:29
8.60 kB 22:29
7.29 kB 22:29
9.43 kB 22:29
7.44 kB 22:29
7.28 kB 22:29
9.20 kB 22:29
8.06 kB 22:29
10.09 kB 22:29
12.13 kB 22:29
7.51 kB 22:29
7.96 kB 22:29
6.86 kB 22:29
7.03 kB 22:29
6.81 kB 22:29
11.38 kB 22:29
10.92 kB 22:29
7.24 kB 22:29
6.78 kB 22:29
11.21 kB 22:29
7.19 kB 22:29
7.02 kB 22:29
1.45 kB 22:29
1.31 kB 22:29
5.93 kB 22:29
8.39 kB 22:29
package-list62.00 B 22:29
inherit.gif57.00 B 22:29
7.53 kB 22:29
1.37 kB 22:29
13.28 kB 10:19
class-diagram-1.png11.07 kB 01:08
class-diagram-2.png15.47 kB 01:10
data-loaded.png34.49 kB 00:20
decision-tree.png16.07 kB 00:38
main.png32.42 kB 00:20
training-data-demo.jpg12.09 kB 23:03
jgraphx.jar606.98 kB 22:29
swt.jar1.66 MB 22:29
LICENSE15.20 kB 22:29
1,003.00 B 22:29
691.00 B 22:29
752.00 B 22:29
949.00 B 22:29
8.89 kB 10:06
897.00 B 22:29
152.00 B 22:29
537.00 B 22:29
2.01 kB 22:29
337.00 B 22:29
641.00 B 22:29
1.13 kB 22:29
11.10 kB 22:29
7.36 kB 22:29
149.00 B 22:29
2.43 kB 22:29
3.45 kB 22:29
155.00 B 22:29
1,022.00 B 22:29
859.00 B 22:29
1.25 kB 22:29
(提交有效评论获得积分)
评论内容不能少于15个字,不要超出160个字。
评价成功,多谢!
下载decision-tree-demo-src.zip
CodeForge积分(原CF币)全新升级,功能更强大,使用更便捷,不仅可以用来下载海量源代码马上还可兑换精美小礼品了
您的积分不足,优惠套餐快速获取 30 积分
10积分 / ¥100
30积分 / ¥200原价 ¥300 元
100积分 / ¥500原价 ¥1000 元
订单支付完成后,积分将自动加入到您的账号。以下是优惠期的人民币价格,优惠期过后将恢复美元价格。
支付宝支付宝付款
微信钱包微信付款
更多付款方式:、
您本次下载所消耗的积分将转交上传作者。
同一源码,30天内重复下载,只扣除一次积分。
鲁ICP备号-3 runtime:Elapsed:130.434ms - init:0.1;find:0.7;t:0.4;tags:0.2;related:20.7;comment:0.1; 27.69
登录 CodeForge
还没有CodeForge账号?
Switch to the English version?
^_^"呃 ...
Sorry!这位大神很神秘,未开通博客呢,请浏览一下其他的吧决策树的Python实现
决策树的实现:决策树是一种预测分类算法,通过对样本的训练得到。想到树这种数据结构就会想到分支、结点,同样,决策树的构建重点同样在分支与结点上。
首先,根节点是全部的数据集,我们需要建立根节点下的一个分支,对原始数据集进行划分。那么,我们需要首先筛选出能将数据集进行最好划分的特征,即区分度最大的特征。选出特征之后,对数据集进行划分,产生两个子节点。接下来,我们需要对子节点进行同样方法的递归划分,直至整个决策树构建完毕。那么,我们需要考虑的问题有两个,划分时如何找到最具区分度的特征以及递归的停止条件。
对特征进行区分度进行量化的方法就是信息熵,熵定义为信息的期望值。如果待分类的事务可能划分在多个分类之中,则符号xi的信息定义为:
l(xi)=?log2p(xi)
其中p(xi)是选择该分类的概率
为了计算熵,我们需要计算所有类别所有可能包含的信息期望值,计算公式如下:
&i=1np(xi)log2p(xi)
其中n是分类的数目,即信息熵。原始未分类数据信息熵与分类后的数据信息熵的差值成为信息增益,显然信息增益越大,表明数据由无序到有序的程度越高,也表明该特征区分度大。
递归构建决策树
根据信息熵得到最有区分度的特征之后,就可以根据数据的特征进行划分。即递归构建决策树。那么,递归的停止条件就是:程序遍历完所有划分数据集的特征,或者每个分支下所有的样本都属于相同的分类。这就是ID3算法,即选择信息熵下降最快的属性作为划分标准。
用到了字典列表部分高级应用,添加了一部分注解方便以后查看:
# -*- coding: utf-8 -*-
Created on Thu Oct
6 10:21:05 2016
@author: Administrator
from math import log
import numpy as np
import operator
#创造测试数据集dataset与标签label
def createDataset():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing','flippers']
return dataSet,labels
#计算数据集的信息熵(数据集)
def calcShannonEnt(dataSet):
numEntris = len(dataSet)
#labelcounts字典键为类别,值为该类别样本数量
labelcounts = {}
for featVec in dataSet:
#得到dataset中每个样本的最后一个元素,即类别
currentlabel = featVec[-1]
if currentlabel not in labelcounts.keys():
#当前样本类别labelcounts中没有,添加
labelcounts[currentlabel] = 0;
#有则当前样本所属类别数量加一
labelcounts[currentlabel] += 1
shannonEnt = 0.0
#计算香农熵
for key in labelcounts:
prob = (float)(labelcounts[key]/numEntris)
shannonEnt -= prob*log(prob,2)
return shannonEnt
#划分数据集(数据集,划分特征索引,特征值)
def spiltDataSet(dataSet,axis,value):
#python中函数参数按引用传递,因此函数中构建参数的复制
#防止传入的参数原值被修改
retDataSet = []
for featVec in dataSet:
if(featVec[axis]==value):
#去掉当前这个特征(因为划分中已用过)
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
#选择最好的划分特征(数据集)
def chooseBestFeatureToSplit(dataset):
numFeatures = len(dataset[0])-1
#原始数据集信息熵
bestEntropy = calcShannonEnt(dataset)
#最优的信息增益
bestInfoGain = 0.0
#最优的特征索引
bestFeature = -1
for i in range(numFeatures):
#取第i个特征
featList = [example[i] for example in dataset]
#set构建集合,将列表中重复元素合并
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
#按照所取当前特征的不同值划分数据集
subDataSet = spiltDataSet(dataset,i,value)
#计算当前划分的累计香农熵
prob = len(subDataSet)/float(len(dataset))
newEntropy += prob*calcShannonEnt(subDataSet)
#得到当前特征划分的信息增益
infoGain = bestEntropy-newEntropy
#选出最大的信息增益特征
if(infoGain&bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
#若特征用完后仍存在同一分支下有不同类别的样本
#则此时采用投票方式决定该分支隶属类别
#即该分支下哪个类别最多,该分支就属哪个类别
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
#字典排序(字典的迭代器,按照第1个域排序也就是值而不是键,True是降序)
sortedClassCount = sorted(classCount.iteritems(),key = operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
#递归构建决策树
def creatertree(dataset,labels):
classList = [example[-1] for example in dataset]
#如果classList中索引为0的类别数量和classList元素数量相等
#即分支下都属同一类,停止递归
if classList.count(classList[0]) == len(classList):
return classList[0]
#划分类别的特征已用完,停止递归,返回投票结果
if (len(dataset[0]) == 1):
return majorityCnt(classList)
#选择最具区分度特征
bestFeat = chooseBestFeatureToSplit(dataset)
bestFeatLabel = labels[bestFeat]
#树用嵌套的字典表示
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataset]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
#递归构建决策树
myTree[bestFeatLabel][value] = creatertree(spiltDataSet(dataset,bestFeat,value),subLabels)
return myTree
#分类函数(决策树,标签,待分类样本)
def classify(inputTree,featLabels,testVec):
firstSides = list(inputTree.keys())
#找到输入的第一个元素
firstStr = firstSides[0]
##这里表明了python3和python2版本的差别,上述两行代码在2.7中为:firstStr = inputTree.key()[0]
secondDict = inputTree[firstStr]
#找到在label中firstStr的下标
featIndex = featLabels.index(firstStr)
#for i in secondDict.keys():
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]) == dict:###判断一个变量是否为dict,直接type就好
classLabel = classify(secondDict[key],featLabels,testVec)
classLabel = secondDict[key]
##比较测试数据中的值和树上的值,最后得到节点
return classLabel
def retrieveTree(i):
listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
return listOfTrees[i]
myDat,labels = createDataset()
myTree = retrieveTree(0)
mylabel = classify(myTree,labels,[1,1])
print(mylabel)您还可以使用以下方式登录
当前位置:&>&&>&&>& > python3.4之决策树
python 决策树 python3.4之决策树
#!/usr/bin/env python# coding=utf-8import numpy as npfrom sklearn import treefrom sklearn.metrics import precision_recall_curvefrom sklearn.metrics import classification_reportfrom sklearn.cross_validation import train_test_splitimport pydotfrom sklearn.externals.six import StringIOdef loadDataSet():
label = []
with open('D:python/fat.txt') as file:
for line in file:
tokens = line.strip().split(' ')
data.append([float(tk) for tk in tokens[:-1]])
label.append(tokens[-1])
x = np.array(data)
print('x:')
label = np.array(label)
y = np.zeros(label.shape)
y[label == 'fat'] = 1
print('y:')
return x, ydef decisionTreeClf():
x, y = loadDataSet()
# 拆分数据集和训练集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print('x_train:');
print(x_train)
print('x_test:');
print(x_test)
print('y_train:');
print(y_train)
print('y_test:');
print(y_test)
# 使用信息熵作为划分标准
clf = tree.DecisionTreeClassifier(criterion='entropy')
print(clf)
clf.fit(x_train, y_train)
dot_data = StringIO()
with open(&iris.dot&, 'w') as f:
f=tree.export_graphviz(clf, out_file=f)
tree.export_graphviz(clf, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph[0].write_pdf(&ex.pdf&)
Image(graph.create_png())
# 打印特征在分类起到的作用性
print(clf.feature_importances_)
# 打印测试结果
answer = clf.predict(x_train)
print('x_train:')
print(x_train)
print('answer:')
print(answer)
print('y_train:')
print(y_train)
print('计算正确率:')
print(np.mean(answer == y_train))
# 准确率与召回率
precision, recall, thresholds = precision_recall_curve(y_train, clf.predict(x_train))
answer = clf.predict_proba(x)[:, 1]
print(classification_report(y, answer, target_names=['thin', 'fat']))decisionTreeClf()# print('ll')数据集fat.txt文件内容如下:1.5 50 thin1.5 60 fat1.6 40 thin1.6 60 fat1.7 60 thin1.7 80 fat1.8 60 thin1.8 90 fat1.9 70 thin1.9 80 fat所需要的Python包有:pygraphviz (1.3.1)pyparsing (2.1.10)scikit-learn (0.18.1)pygraphviz (1.3.1)包是可视化包。下载可视化工具:graphviz-2.38.msi百度搜索安装可视化工具。就爱阅读网友整理上传,为您提供最全的知识大全,期待您的分享,转载请注明出处。
欢迎转载:
推荐:    o &nbsp,&nbsp
决策树也是最经常使用的数据挖掘算法,决策树分类器就像判断模块和终止块组成的流程图,终止块表示分类结果(也就是树的叶子)。判断模块表示对一个特征取值的判断(该特征有几个值,判断模块就有几个分支)。决策树的生成过程就是一个数据集不断被划分的过程,划分数据集的最大原则是:使无序的数据变的有序。如果一个训练数据中有20个特征,那么选取哪个做划分依据?这就必须采用量化的方法来判断,量化划分方法有多重,其中一项就是“信息论度量信息分类”。基于信息论的决策树算法有ID3、CART和C4.5等算法,其中C4.5和CART两种算法从ID3算法中衍生而来。本文介绍了ID3和C4.5算法,用Pytho实现了该算法,并采用决策树算法来判断病患是否能够佩戴隐形眼镜。
关于伯乐头条
专注于IT互联网,分享业界最新动态。
新浪微博:
推荐微信号
(加好友请注明来意)
- 好的话题、有启发的回复、值得信赖的圈子
- 分享和发现有价值的内容与观点
- 为IT单身男女服务的征婚传播平台
- 优秀的工具资源导航
- 翻译传播优秀的外文文章
- 国内外的精选博客文章
- UI,网页,交互和用户体验
- 专注iOS技术分享
- 专注Android技术分享
- JavaScript, HTML5, CSS
- 专注Java技术分享
- 专注Python技术分享
& 2017 伯乐在线3057人阅读
机器学习(14)
本文旨在对决策树算法的python实现及利用matplotlib绘制树进行学习。
(1)最小二乘回归树生成算法
(2)CART生成算法
其中,5.25如下
1.计算给定数据集的香农熵
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2)
return shannonEnt
2.按照给定特征划分数据集
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
3.选择最好的数据集划分方式(特征)
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain & bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
4.多数表决,返回分类名称
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
6.得到叶节点数目
def getNumLeafs(myTree):
numLeafs = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
numLeafs += getNumLeafs(secondDict[key])
numLeafs +=1
return numLeafs
7.得到树的层数
def getTreeDepth(myTree):
maxDepth = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
thisDepth = 1 + getTreeDepth(secondDict[key])
thisDepth = 1
if thisDepth & maxDepth: maxDepth = thisDepth
return maxDepth
8.计算父、子节点的中间位置,并添加简单的文本标签信息
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
9.使用文本注解,绘制节点
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="&-")
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt,
xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
def plotTree(myTree, parentPt, nodeTxt):
numLeafs = getNumLeafs(myTree)
depth = getTreeDepth(myTree)
firstStr = myTree.keys()[0]
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
plotTree(secondDict[key],cntrPt,str(key))
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()
fr = open('lenses.txt')
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age','prescript','astigmatic','tearRate']
lensesTree = createTree(lenses,lensesLabels)
createPlot(lensesTree)
12.数据lenses.txt
reduced no lenses
yes reduced no lenses
yes normal
reduced no lenses
yes reduced no lenses
yes normal
reduced no lenses
yes reduced no lenses
yes normal
reduced no lenses
yes reduced no lenses
yes normal
presbyopic
reduced no lenses
presbyopic
presbyopic
yes reduced no lenses
presbyopic
yes normal
presbyopic
reduced no lenses
presbyopic
presbyopic
yes reduced no lenses
presbyopic
yes normal
(1)机器学习实战
(2)统计学习方法
&&相关文章推荐
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
访问:304481次
积分:3325
积分:3325
排名:第10037名
原创:70篇
转载:10篇
评论:131条
机器学习、数据挖掘爱好者
阿里妈妈算法工程师
期待认识志同道合的你
文章:13篇
阅读:109580

我要回帖

更多关于 python3实现决策树 的文章

 

随机推荐