kmeans算法java实现用Python怎么实现

羽毛球技术 | 体育赛事 | 英文歌曲 | 住宅风水 | 用户界面设计师 | 六爻 | 书籍改编电影 | 德国足球甲级联赛 | 欧美明星 | PLC | 中国足球 | aj1 | 国家队 | 拜仁慕尼黑足球俱乐部 | 小说创作 | 配音 | iOS应用 | NBA 2K | 古典音乐 | 面相 | 火影忍者 | 武汉大学 | 土拨鼠 | 营销策划 | 秦时明月之天行九歌 | 设计师 | 巴塞罗那足球俱乐部 | 尤文图斯 | 实况足球（游戏） | 少帅 | 罗玉凤 | 比利时 | 跑鞋 | 冷知识 | 肖战 | 李元胜 | 古琴 | 按键精灵 | 罗兰 | 徐波 | 激光手术 | 角色扮演 | 关晓彤 | 微电影 | safari | 北京国安 | 古汉语 | 曼彻斯特联 | 玄幻小说 | 科幻小说 | 双眼皮手术 | 主题曲 | 年会 | 检测仪 | 徒步 | 互联网公司 | 百度输入法 | 镜头 | 宜昌市 | 自拍 | 金蝶 | 电子烟 | 网站建设 | 广播体操 | 文身 | nba篮球 | 索尼(sony) | 天体物理学 | 痛风 | 象棋 | 牛皮癣 | 皮肤护理 | 周星驰（人物） | 试管婴儿 | 亚足联亚洲杯（AFC Asian Cup） | 健美 | 美术生 | 迅雷（软件） | 战斗机 | 穿越小说 | 张璐 | 姓氏 | 诸葛亮 | 后宫·甄嬛传（书籍） | 虎牙直播 | snh48 | 阿迪达斯 | 投影仪 | 组装机 | 微信群 | 阿迪达斯(adidas) | 网球王子 | 分子生物学 | 耽美 | 武磊 | 婚礼 | 表演 | 中国武术 | 动画电影 | Air Jordan | 张子枫 | 免费软件 | 相声演员 | 摩羯座 | 宿舍 | ansys | 法国足球甲级联赛 | 户外 | 剧场版 | 杨凡 | 科幻电影 | galgame | 融资 | 关节炎 | NBA季后赛 | 神话 | 王力宏（人物） | 建模 | 计算机病毒 | 广州恒大淘宝足球俱乐部 | 北京奥运会 | 电脑电源 | 百度翻译 | 字幕 | 讯飞输入法 | 海关 | 易烊千玺 | 深度学习 | 编辑器 | 澳门特别行政区 | 直播 | 流氓软件 | 事故 | 大片 | 李景亮 | 郭富城 | 日语歌曲 | 卡牌游戏 | 小品 | 东京 | 花卉 | 音乐剧 | 互联网创业 | 占卜 | 羽毛球拍 | 婆媳关系 | 日本动画 | 巴黎 | 拳击比赛 | 东南亚 | 足球经理（FM）（游戏） | youtube | 胡歌（演员） | 地铁跑酷 | 植发 | 张继科 | 三国 | 用户界面 | 演技 | 百度竞价 | 青梅竹马 | 移动硬盘 | 韩晓鹏 | 马龙 | 瘦腿 | 宠物医疗 | 巨蟹座 | 徐峥 | 天蝎座 | 胸肌 | 赵丽颖（演员） | adidas阿迪达斯 | 低音炮 | 星际争霸（游戏） | 豆瓣电影 | 微信开放平台 | 手绘 | 吉他学习 | 江苏卫视 | 模特 | 创意 | 团队管理 | 奢侈品 | 王源 | TANK | 笛子 | 偶像 | 莱斯特城 | 维生素 | 新百伦 | 国际物流 | 前女友 | 李小龙 | 华语流行音乐 | 猎头公司 | crm | 搏击项目 | 网站运营 | 鼻炎 | 篮球游戏 |

你的位置：网站首页 >> 频道首页 >>python >>kmeans算法java实现用Python怎么实现

kmeans算法java实现用Python怎么实现

来源：蜘蛛抓取(WebSpider) 时间：2017-07-08 09:41 标签： kmeans算法应用实例

Python实现的Kmeans++算法实例_python
1、从Kmeans说起
Kmeans是一个非常基础的聚类算法，使用了迭代的思想，关于其原理这里不说了。下面说一下如何在matlab中使用kmeans算法。
创建7个二维的数据点：
复制代码代码如下:
x=[randn(3,2)*.4;randn(4,2)*.5+ones(4,1)*[4 4]];
使用kmeans函数：
复制代码代码如下:
class = kmeans(x, 2);
x是数据点，x的每一行代表一个数据；2指定要有2个中心点，也就是聚类结果要有2个簇。 class将是一个具有70个元素的列向量，这些元素依次对应70个数据点，元素值代表着其对应的数据点所处的分类号。某次运行后，class的值是：
复制代码代码如下:
2 2 1 1 1 1
这说明x的前三个数据点属于簇2，而后四个数据点属于簇1。 kmeans函数也可以像下面这样使用：
复制代码代码如下:
&& [class, C, sumd, D] = kmeans(x, 2)
class依旧代表着每个数据点的分类;C包含最终的中心点，一行代表一个中心点；sumd代表着每个中心点与所属簇内各个数据点的距离之和；D的每一行也对应一个数据点，行中的数值依次是该数据点与各个中心点之间的距离，Kmeans默认使用的距离是欧几里得距离（参考资料[3]）的平方值。kmeans函数使用的距离，也可以是曼哈顿距离（L1-距离），以及其他类型的距离，可以通过添加参数指定。
kmeans有几个缺点（这在很多资料上都有说明）：
1、最终簇的类别数目（即中心点或者说种子点的数目）k并不一定能事先知道，所以如何选一个合适的k的值是一个问题。2、最开始的种子点的选择的好坏会影响到聚类结果。3、对噪声和离群点敏感。4、等等。
2、kmeans++算法的基本思路
kmeans++算法的主要工作体现在种子点的选择上，基本原则是使得各个种子点之间的距离尽可能的大，但是又得排除噪声的影响。以下为基本思路：
1、从输入的数据点集合（要求有k个聚类）中随机选择一个点作为第一个聚类中心2、对于数据集中的每一个点x，它与最近聚类中心(指已选择的聚类中心)的距离D(x)3、选择一个新的数据点作为新的聚类中心，选择的原则是：D(x)较大的点，被选取作为聚类中心的概率较大4、重复2和3直到k个聚类中心被选出来5、利用这k个初始的聚类中心来运行标准的k-means算法
假定数据点集合X有n个数据点，依次用X(1)、X(2)、……、X(n)表示，那么，在第2步中依次计算每个数据点与最近的种子点（聚类中心）的距离，依次得到D(1)、D(2)、……、D(n)构成的集合D。在D中，为了避免噪声，不能直接选取值最大的元素，应该选择值较大的元素，然后将其对应的数据点作为种子点。
如何选择值较大的元素呢，下面是一种思路（暂未找到最初的来源，在资料[2]等地方均有提及，笔者换了一种让自己更好理解的说法）：把集合D中的每个元素D(x)想象为一根线L(x)，线的长度就是元素的值。将这些线依次按照L(1)、L(2)、……、L(n)的顺序连接起来，组成长线L。L(1)、L(2)、……、L(n)称为L的子线。根据概率的相关知识，如果我们在L上随机选择一个点，那么这个点所在的子线很有可能是比较长的子线，而这个子线对应的数据点就可以作为种子点。下文中kmeans++的两种实现均是这个原理。
3、python版本的kmeans++
在http://rosettacode.org/wiki/K-means%2B%2B_clustering 中能找到多种编程语言版本的Kmeans++实现。下面的内容是基于python的实现（中文注释是笔者添加的）：
复制代码代码如下:
from math import pi, sin, cosfrom collections import namedtuplefrom random import random, choicefrom copy import copy
import psyco
psyco.full()except ImportError:
FLOAT_MAX = 1e100
class Point:
__slots__ = ["x", "y", "group"]
def __init__(self, x=0.0, y=0.0, group=0):
self.x, self.y, self.group = x, y, group
def generate_points(npoints, radius):
points = [Point() for _ in xrange(npoints)]
# note: this is not a uniform 2-d distribution
for p in points:
r = random() * radius
ang = random() * 2 * pi
p.x = r * cos(ang)
p.y = r * sin(ang)
return points
def nearest_cluster_center(point, cluster_centers):
"""Distance and index of the closest cluster center"""
def sqr_distance_2D(a, b):
return (a.x - b.x) ** 2
(a.y - b.y) ** 2
min_index = point.group
min_dist = FLOAT_MAX
for i, cc in enumerate(cluster_centers):
d = sqr_distance_2D(cc, point)
if min_dist & d:
min_dist = d
min_index = i
return (min_index, min_dist)
'''points是数据点，nclusters是给定的簇类数目cluster_centers包含初始化的nclusters个中心点，开始都是对象-&(0,0,0)'''
def kpp(points, cluster_centers):
cluster_centers[0] = copy(choice(points)) #随机选取第一个中心点
d = [0.0 for _ in xrange(len(points))]
#，长度为len(points)，保存每个点离最近的中心点的距离
for i in xrange(1, len(cluster_centers)):
# i=1...len(c_c)-1
for j, p in enumerate(points):
d[j] = nearest_cluster_center(p, cluster_centers[:i])[1] #第j个数据点p与各个中心点距离的最小值
sum += d[j]
sum *= random()
for j, di in enumerate(d):
if sum & 0:
cluster_centers[i] = copy(points[j])
for p in points:
p.group = nearest_cluster_center(p, cluster_centers)[0]
'''points是数据点，nclusters是给定的簇类数目'''def lloyd(points, nclusters):
cluster_centers = [Point() for _ in xrange(nclusters)]
#根据指定的中心点个数，初始化中心点，均为(0,0,0)
# call k++ init
kpp(points, cluster_centers)
#选择初始种子点
# 下面是kmeans
lenpts10 = len(points) && 10
changed = 0
while True:
# group element for centroids are used as counters
for cc in cluster_centers:
cc.group = 0
for p in points:
cluster_centers[p.group].group += 1
#与该种子点在同一簇的数据点的个数
cluster_centers[p.group].x += p.x
cluster_centers[p.group].y += p.y
for cc in cluster_centers:
#生成新的中心点
cc.x /= cc.group
cc.y /= cc.group
# find closest centroid of each PointPtr
changed = 0
#记录所属簇发生变化的数据点的个数
for p in points:
min_i = nearest_cluster_center(p, cluster_centers)[0]
if min_i != p.group:
changed += 1
p.group = min_i
# stop when 99.9% of points are good
if changed &= lenpts10:
for i, cc in enumerate(cluster_centers):
cc.group = i
return cluster_centers
def print_eps(points, cluster_centers, W=400, H=400):
Color = namedtuple("Color", "r g b");
colors = []
for i in xrange(len(cluster_centers)):
colors.append(Color((3 * (i + 1) % 11) / 11.0,
(7 * i % 11) / 11.0,
(9 * i % 11) / 11.0))
max_x = max_y = -FLOAT_MAX
min_x = min_y = FLOAT_MAX
for p in points:
if max_x & p.x: max_x = p.x
if min_x & p.x: min_x = p.x
if max_y & p.y: max_y = p.y
if min_y & p.y: min_y = p.y
scale = min(W / (max_x - min_x),
H / (max_y - min_y))
cx = (max_x + min_x) / 2
cy = (max_y + min_y) / 2
print "%%!PS-Adobe-3.0\n%%%%BoundingBox: -5 -5 %d %d" % (W + 10, H + 10)
print ("/l {rlineto} def /m {rmoveto} def\n" +
"/c { .25 sub exch .25 sub exch .5 0 360 arc fill } def\n" +
"/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath " +
gsave 1 setgray fill grestore gsave 3 setlinewidth" +
" 1 setgray stroke grestore 0 setgray stroke }def")
for i, cc in enumerate(cluster_centers):
print ("%g %g %g setrgbcolor" %
(colors[i].r, colors[i].g, colors[i].b))
for p in points:
if p.group != i:
print ("%.3f %.3f c" % ((p.x - cx) * scale + W / 2,
(p.y - cy) * scale + H / 2))
print ("\n0 setgray %g %g s" % ((cc.x - cx) * scale + W / 2,
(cc.y - cy) * scale + H / 2))
print "\n%%%%EOF"
def main():
npoints = 30000
k = 7 # # clusters
points = generate_points(npoints, 10)
cluster_centers = lloyd(points, k)
print_eps(points, cluster_centers)
上述代码实现的算法是针对二维数据的，所以Point对象有三个属性，分别是在x轴上的值、在y轴上的值、以及所属的簇的标识。函数lloyd是kmeans++算法的整体实现，其先是通过kpp函数选取合适的种子点，然后对数据集实行kmeans算法进行聚类。kpp函数的实现完全符合上述kmeans++的基本思路的2、3、4步。
4、matlab版本的kmeans++
复制代码代码如下:
function [L,C] = kmeanspp(X,k)%KMEANS Cluster multivariate data using the k-means++ algorithm.%
[L,C] = kmeans_pp(X,k) produces a 1-by-size(X,2) vector L with one class%
label per column in X and a size(X,1)-by-k matrix C containing the%
centers corresponding to each class.
Version: %
Authors: Laurent Sorber (Laurent.Sorber@cs.kuleuven.be)
L = [];L1 = 0;
while length(unique(L)) ~= k
% The k-means++ initialization.
C = X(:,1+round(rand*(size(X,2)-1))); %size(X,2)是数据集合X的数据点的数目，C是中心点的集合
L = ones(1,size(X,2));
for i = 2:k
D = X-C(:,L); %-1
D = cumsum(sqrt(dot(D,D,1))); %将每个数据点与中心点的距离，依次累加
if D(end) == 0, C(:,i:k) = X(:,ones(1,k-i+1)); end
C(:,i) = X(:,find(rand & D/D(end),1)); %find的第二个参数表示返回的索引的数目
[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).')); %碉堡了，这句，将每个数据点进行分类。
% The k-means algorithm.
while any(L ~= L1)
for i = 1:k, l = L==i; C(:,i) = sum(X(:,l),2)/sum(l); end
[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).'),[],1);
这个函数的实现有些特殊，参数X是数据集，但是是将每一列看做一个数据点，参数k是指定的聚类数。返回值L标记了每个数据点的所属分类，返回值C保存了最终形成的中心点（一列代表一个中心点）。测试一下：
复制代码代码如下:
&& x=[randn(3,2)*.4;randn(4,2)*.5+ones(4,1)*[4 4]]x =
&& [L, C] = kmeanspp(x',2)L =
好了，现在开始一点点理解这个实现，顺便巩固一下matlab知识。
unique函数用来获取一个矩阵中的不同的值，示例：
复制代码代码如下:
&& unique([1 3 3 4 4 5])ans =
5&& unique([1 3 3 ; 4 4 5])ans =
所以循环 while length(unique(L)) ~= k 以得到了k个聚类为结束条件，不过一般情况下，这个循环一次就结束了，因为重点在这个循环中。
rand是返回在(0,1)这个区间的一个随机数。在注释%-1所在行，C被扩充了，被扩充的方法类似于下面：
复制代码代码如下:
&& C =[];&& C(1,1) = 1C =
1&& C(2,1) = 2C =
2&& C(:,[1 1 1 1])ans =
2&& C(:,[1 1 1 1 2])Index exceeds matrix dimensions.
C中第二个参数的元素1，其实是代表C的第一列数据，之所以在值2时候出现Index exceeds matrix dimensions.的错误，是因为C本身没有第二列。如果C有第二列了：
复制代码代码如下:
&& C(2,2) = 3;&& C(2,2) = 4;&& C(:,[1 1 1 1 2])ans =
dot函数是将两个矩阵点乘，然后把结果在某一维度相加：
复制代码代码如下:
&& TT = [1 2 3 ; 4 5 6];&& dot(TT,TT)ans =
45&& dot(TT,TT,1 )ans =
&code&cumsum&/code&是累加函数：
复制代码代码如下:
&& cumsum([1 2 3])ans =
6&& cumsum([1 2 3; 4 5 6])ans =
max函数可以返回两个值，第二个代表的是max数的索引位置：
复制代码代码如下:
&& [~, L] = max([1 2 3])L =
3&& [~,L] = max([1 2 3;2 3 4])L =
其中~是占位符。
关于bsxfun函数，官方指出：
复制代码代码如下:
C = bsxfun(fun,A,B) applies the element-by-element binary operation specified by the function handle fun to arrays A and B, with singleton expansion enabled
其中参数fun是函数句柄，关于函数句柄见资料[9]。下面是bsxfun的一个示例：
复制代码代码如下:
&& A= [1 2 3;2 3 4]A =
4&& B=[6;7]B =
7&& bsxfun(@minus,A,B)ans =
复制代码代码如下:
[~,L] = max(bsxfun(@minus,2*real(C'*X),dot(C,C,1).'));
max的参数是这样一个矩阵，矩阵有n列，n也是数据点的个数，每一列代表着对应的数据点与各个中心点之间的距离的相反数。不过这个距离有些与众不同，算是欧几里得距离的变形。
假定数据点是2维的，某个数据点为(x1,y1)，某个中心点为(c1,d1)，那么通过bsxfun(@minus,2real(C'X),dot(C,C,1).')的计算，数据点与中心点的距离为2c1x1 + 2d1y1 -c1.^2 - c2.^2，可以变换为x1.^2 + y1.^2 - (c1-x1).^2 - (d1-y1).^2。对于每一列而言，由于是数据点与各个中心点之间的计算，所以可以忽略x1.^2 + y1.^2，最终计算结果是欧几里得距离的平方的相反数。这也说明了使用max的合理性，因为一个数据点的所属簇取决于与其距离最近的中心点，若将距离取相反数，则应该是值最大的那个点。用K-means聚类算法实现音调的分类与可视化 - Python - 伯乐在线
& 用K-means聚类算法实现音调的分类与可视化
Galvanize 数据科学课程包括了一系列在科技产业的数据科学家中流行的机器学习课题，但是学生在 Galvanize 获得的技能并不仅限于那些最流行的科技产业应用。例如，在中，音频信号和音乐分析较少被讨论，却它是一个有趣的机器学习概念应用。借用 Galvanize 课程中的课题，本篇教程为大家展示了如何利用
聚类算法从录音中分类和可视化音调，该方法会用到以下几个 python 工具包： ,
K-means 聚类是什么
k-means 聚类算法是基于未标识数据集将相关项聚类的常用技术。给定 K 值后，该算法会将每个数据点划分到离其最近的中心点对应的簇，从而将整个数据集分成 k 组。k-means 算法有很广泛的应用，比如识别手机发射塔的有效位置，或为制造商选择服装的型号。而本教程将会为大家展示如何应用 k-means 根据音调来给音频分类。
音调的简单入门
一个音符是一串叠加的不同频率的 Sine 型波，而识别音符的音调需要识别那些听上去最突出的 Sine 型波的频率。
最简单的音符仅包含一个 Sine 型波：
绘制的强度图谱中，每个组成要素频率的大小显示了上面波形的一个单独的频率。
主流乐器制造出来的声音是由很多 sine 型波元素构成的，所以他们比上面展示的纯 sine 型波听起来更复杂。同样的音符(E3)，由吉他弹奏出来的波形听看起来如下：
它的强度图谱显示了一个更大的基础频率的集合：
k-means 可以运用样例音频片段的强度图谱来给音调片段分类。给定一个有 n 个不同频率的强度图谱集合，k-means 将会给样例图谱分类，从而使在 n 维空间中每个图谱到它们组中心的欧式距离最小。
使用Numpy/SciPy从一个录音中创建数据集
本教程将会使用一个有 3 个不同音调的录音小样，每个音调是由吉他弹奏了 2 秒。
运用 SciPy 的 wavfile 模块可以轻松将一个 .wav 文件转化为 NumPy 数值。
import scipy.io.wavfile as wav
filename = 'Guitar - Major Chord - E Gsharp B.wav'
# wav.read returns the sample_rate and a numpy array containing each audio sample from the .wav file
sample_rate, recording = wav.read(filename)
import scipy.io.wavfile as wavfilename = 'Guitar - Major Chord - E Gsharp B.wav'# wav.read returns the sample_rate and a numpy array containing each audio sample from the .wav filesample_rate, recording = wav.read(filename)
这段录音应该被分为多个小段，从而使每段的音调都可以被独立地分类。
def split_recording(recording, segment_length, sample_rate):
segments = []
while index & len(recording):
segment = recording[index:index + segment_length&em&sample_rate]
segments.append(segment)
index += segment_length&/em&sample_rate
return segments
segment_length = .5 # length in seconds
segments = split_recording(recording, segment_length, sample_rate)
1234567891011
def split_recording(recording, segment_length, sample_rate):&&&&segments = []&&&&index = 0&&&&while index & len(recording):&&&&&&&&segment = recording[index:index + segment_length&em&sample_rate]&&&&&&&&segments.append(segment)&&&&&&&&index += segment_length&/em&sample_rate&&&&return segments&segment_length = .5 # length in secondssegments = split_recording(recording, segment_length, sample_rate)
每一段的强度图谱可以通过傅里叶变换获得；傅里叶变换会将波形数据从时间域转换到频率域。以下的代码展示了如何使用 NumPy 实现傅里叶变换(Fourie transform)模块。
def calculate_normalized_power_spectrum(recording, sample_rate):
# np.fft.fft returns the discrete fourier transform of the recording
fft = np.fft.fft(recording)
number_of_samples = len(recording)
# sample_length is the length of each sample in seconds
sample_length = 1./sample_rate
# fftfreq is a convenience function which returns the list of frequencies measured by the fft
frequencies = np.fft.fftfreq(number_of_samples, sample_length)
positive_frequency_indices = np.where(frequencies&0)
# positive frequences returned by the fft
frequencies = frequencies[positive_frequency_indices]
# magnitudes of each positive frequency in the recording
magnitudes = abs(fft[positive_frequency_indices])
# some segments are louder than others, so normalize each segment
magnitudes = magnitudes / np.linalg.norm(magnitudes)
return frequencies, magnitudes
12345678910111213141516
def calculate_normalized_power_spectrum(recording, sample_rate):&&&&# np.fft.fft returns the discrete fourier transform of the recording&&&&fft = np.fft.fft(recording) &&&&number_of_samples = len(recording)&&&&# sample_length is the length of each sample in seconds&&&&sample_length = 1./sample_rate &&&&# fftfreq is a convenience function which returns the list of frequencies measured by the fft&&&&frequencies = np.fft.fftfreq(number_of_samples, sample_length)&&&&positive_frequency_indices = np.where(frequencies&0) &&&&# positive frequences returned by the fft&&&&frequencies = frequencies[positive_frequency_indices]&&&&# magnitudes of each positive frequency in the recording&&&&magnitudes = abs(fft[positive_frequency_indices]) &&&&# some segments are louder than others, so normalize each segment&&&&magnitudes = magnitudes / np.linalg.norm(magnitudes)&&&&return frequencies, magnitudes
一些辅助函数会创建一个空的 NumPy 数值并将我们的样例强度图谱放入其中。
def create_power_spectra_array(segment_length, sample_rate):
number_of_samples_per_segment = int(segment_length * sample_rate)
time_per_sample = 1./sample_rate
frequencies = np.fft.fftfreq(number_of_samples_per_segment, time_per_sample)
positive_frequencies = frequencies[frequencies&0]
power_spectra_array = np.empty((0, len(positive_frequencies)))
return power_spectra_array
def fill_power_spectra_array(splits, power_spectra_array, fs):
filled_array = power_spectra_array
for segment in splits:
freqs, mags = calculate_normalized_power_spectrum(segment, fs)
filled_array = np.vstack((filled_array, mags))
return filled_array
power_spectra_array = create_power_spectra_array(segment_length,sample_rate)
power_spectra_array = fill_power_spectra_array(segments, power_spectra_array, sample_rate)
1234567891011121314151617
def create_power_spectra_array(segment_length, sample_rate):&&&&number_of_samples_per_segment = int(segment_length * sample_rate)&&&&time_per_sample = 1./sample_rate&&&&frequencies = np.fft.fftfreq(number_of_samples_per_segment, time_per_sample)&&&&positive_frequencies = frequencies[frequencies&0]&&&&power_spectra_array = np.empty((0, len(positive_frequencies)))&&&&return power_spectra_array&def fill_power_spectra_array(splits, power_spectra_array, fs):&&&&filled_array = power_spectra_array&&&&for segment in splits:&&&&&&&&freqs, mags = calculate_normalized_power_spectrum(segment, fs)&&&&&&&&filled_array = np.vstack((filled_array, mags))&&&&return filled_array&power_spectra_array = create_power_spectra_array(segment_length,sample_rate)power_spectra_array = fill_power_spectra_array(segments, power_spectra_array, sample_rate)
“power_spectra_array “是我们的训练数据集，它包含了一个强度图谱，在此图谱中录音按每 0.5 秒的间隔进行了分段。
利用 Scikit-learn 来执行 k-means
Scikit-learn 有一个易用的 k-means 实现。我们的音频样例包括 3 个不同的音调，所以将 k 设置为 3。
from sklearn.cluster import KMeans
kmeans = KMeans(3, max&em&iter = 1000, n_init = 100)
kmeans.fit_transform(power_spectra_array)
predictions = kmeans.predict(power_spectra_array)
from sklearn.cluster import KMeanskmeans = KMeans(3, max&em&iter = 1000, n_init = 100)kmeans.fit_transform(power_spectra_array)predictions = kmeans.predict(power_spectra_array)
“predictions”是一个 Python 数据，它包含了 12 个音频分段的分组标签(一个任意的整数)。
print predictions
=& [2 2 2 2 0 0 0 0 1 1 1 1]
print predictions=& [2 2 2 2 0 0 0 0 1 1 1 1]
这个数组说明了在听这段音频时连续音频分段被正确地分在了一起。
使用 Plotly 可视化结果
为了更好的理解预测结果，需要绘制每个样例的强度图谱，每个样例均用颜色来标记出其对应的 k-means 分组结果。
# find x-values for plot (frequencies)
number&em&of_samples = int(segment_length*sample_rate)
sample_length = 1./sample_rate
frequencies = np.fft.fftfreq(number_of_samples, sample_length)
# create plot
traces = []
for pitch_id, color in enumerate(['red','blue','green']):
for power_spectrum in power_spectra_array[predictions == pitch_id]:
trace = Scatter(x=frequencies[0:500],
y=power_spectrum[0:500],
mode='lines',
showlegend=False,
line=Line(shape='linear',
color=color,
opacity = .01,
width = 1))
traces.append(trace)
layout = Layout(xaxis=XAxis(title='Frequency (Hz)'),
yaxis=YAxis(title = 'Amplitude (normalized)'),
title = 'Power Spectra of Sample Audio Segments')
data_to_plot = Data(traces)
fig = Figure(data=data_to_plot, layout=layout)
# py.iplot plots inline using IPython Notebook
py.iplot(fig, filename = 'K-Means Classification of Power Spectrum')
12345678910111213141516171819202122232425
# find x-values for plot (frequencies)number&em&of_samples = int(segment_length*sample_rate)sample_length = 1./sample_rate frequencies = np.fft.fftfreq(number_of_samples, sample_length)&# create plottraces = []for pitch_id, color in enumerate(['red','blue','green']):&&&&for power_spectrum in power_spectra_array[predictions == pitch_id]:&&&&&&&&trace = Scatter(x=frequencies[0:500],&&&&&&&&&&&&&&&&&&&&&&&&y=power_spectrum[0:500],&&&&&&&&&&&&&&&&&&&&&&&&mode='lines',&&&&&&&&&&&&&&&&&&&&&&&&showlegend=False,&&&&&&&&&&&&&&&&&&&&&&&&line=Line(shape='linear',&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&color=color,&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&opacity = .01,&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&width = 1))&&&&&&&&traces.append(trace)layout = Layout(xaxis=XAxis(title='Frequency (Hz)'),&&&&&&&&&&&&&&&&yaxis=YAxis(title = 'Amplitude (normalized)'),&&&&&&&&&&&&&&&&title = 'Power Spectra of Sample Audio Segments')data_to_plot = Data(traces)fig = Figure(data=data_to_plot, layout=layout)# py.iplot plots inline using IPython Notebookpy.iplot(fig, filename = 'K-Means Classification of Power Spectrum')
下面的图中每个有色的细线代表了样例 .wav 文件中 12 个音频分段的强度图谱。不同颜色的线表示了 k-means 预测出来的分段音调。其中蓝色，绿色，红色图谱的高峰分别在 82.41 Hz (E), 103.83 Hz (G#), and 123.47 Hz (B)，这些是音频小样的音符。音频小样中频率最强的是低频，所以只有由 FFT (快速傅里叶变换)测量出的最低的 500 个频率被包含进了以下图表。
绘制在 3 个采样音调中共有的 2 个最强泛音的振幅，这种自然的聚类过程便十分明显了。
Learn More at Galvanize!
k-means 是 Galvanize 数据科学强化项目中众多机器学习课题的一个。如果感兴趣，可以在学到更多。
关于作者：
可能感兴趣的话题
o 256 回复
关于 Python 频道
Python频道分享 Python 开发技术、相关的行业动态。
新浪微博：
推荐微信号
（加好友请注明来意）
– 好的话题、有启发的回复、值得信赖的圈子
– 分享和发现有价值的内容与观点
– 为IT单身男女服务的征婚传播平台
– 优秀的工具资源导航
– 翻译传播优秀的外文文章
– 国内外的精选文章
– UI,网页，交互和用户体验
– 专注iOS技术分享
– 专注Android技术分享
– JavaScript, HTML5, CSS
– 专注Java技术分享
– 专注Python技术分享
& 2017 伯乐在线

kmeans算法java实现用Python怎么实现

我要回帖

更多关于 kmeans算法应用实例的文章

随机推荐

kmeans算法java实现用Python怎么实现

我要回帖

更多关于 kmeans算法应用实例 的文章

随机推荐

更多关于 kmeans算法应用实例的文章