python怎么实现canopy聚类-Python教程

资源魔 2020-07-23 20:08:51 45 0

Canopy算法是2000年由Andrew McCallum, Kamal Nigam and Lyle Ungar提进去的，它是对k-means聚类算法以及条理聚类算法的预解决。家喻户晓，kmeans的一个有余的地方正在于k值需求经过工钱的进行调整，前期能够经过肘部规律（Elbow Method）以及轮廓系数（Silhouette Coefficient）来对k值进行终极确实定，然而这些办法都是属于“预先”判别的，而Canopy算法的作用就正在于它是经过事前粗聚类的形式，为k-means算法确定初始聚类中心个数以及聚类中心点。

应用的包：

import math
import random
import numpy as np
from datetime import datetime
from pprint import pprint as p
import matplotlib.pyplot as plt

1.起首我正在算法中预设了一个二维（为了不便前期绘图出现正在二维立体上）数据dataset。

当然也能够应用高纬度的数据，而且我将canopy外围算法写入了类中，前期能够经过间接挪用的形式对任何维度的数据进行解决，当然只是小批量的，少量量的数据能够移步Mahout以及Hadoop了。

# 随机天生500个二维[0,1)立体点
dataset = np.random.rand(500, 2)

相干保举：《Python视频教程》

2.而后天生个两类，类的属性以下：

class Canopy:
    def __init__(self, dataset):        
        self.dataset = dataset        
        self.t1 = 0
      self.t2 = 0

退出设定t1以及t2初始值和判别巨细函数

   # 设置初始阈值  
def setThreshold(self, t1, t2):        
    if t1 > t2:
        self.t1 = t1            
        self.t2 = t2        
    else:
        print('t1 needs to be larger than t2!')

3.间隔较量争论，各个中心点之间的间隔较量争论办法我应用的欧式间隔。

#应用欧式间隔进行间隔的较量争论
def euclideanDistance(self, vec1, vec2):        
    return math.sqrt(((vec1 - vec2)**2).sum())

4.再写个从dataset中依据dataset的长度随机抉择下标的函数

# 依据以后dataset的长度随机抉择一个下标 
def getRandIndex(self):        
    return random.randint(0, len(self.dataset) - 1)

5.外围算法

def clustering(self):        
        if self.t1 == 0:
            print('Please set the threshold.')        
        else:
            canopies = []  # 用于寄存终极归类后果
            while len(self.dataset) != 0:
                rand_index = self.getRandIndex()
                current_center = self.dataset[rand_index]  # 随机猎取一个中心点，定为P点
                current_center_list = []  # 初始化P点的canopy类容器
                delete_list = []  # 初始化P点的删除了容器
                self.dataset = np.delete(                    
                     self.dataset, rand_index, 0)  # 删除了随机抉择的中心点P
                for datum_j in range(len(self.dataset)):
                    datum = self.dataset[datum_j]
                    distance = self.euclideanDistance(
                        current_center, datum)  # 较量争论拔取的中心点P到每一个点之间的间隔
                    if distance < self.t1:
                        # 若间隔小于t1，则将点纳入P点的canopy类
                        current_center_list.append(datum)                    
                    if distance < self.t2:
                        delete_list.append(datum_j)  # 若小于t2则纳入删除了容器
                # 依据删除了容器的下标，将元素从数据集中删除了
                self.dataset = np.delete(self.dataset, delete_list, 0)
                canopies.append((current_center, current_center_list))        
          return canopies

为了不便前面的数据可视化，我这里的canopies界说的是一个数组，当然也能够应用dict。
6.main()函数

def main():
    t1 = 0.6
    t2 = 0.4
    gc = Canopy(dataset)
    gc.setThreshold(t1, t2)
    canopies = gc.clustering()
    print('Get %s initial centers.' % len(canopies))    
    #showCanopy(canopies, dataset, t1, t2)

Canopy聚类可视化代码

def showCanopy(canopies, dataset, t1, t2):
    fig = plt.figure()
    sc = fig.add_subplot(111)
    colors = ['brown', 'green', 'blue', 'y', 'r', 'tan', 'dodgerblue', 'deeppink', 'orangered', 'peru', 'blue', 'y', 'r',              'gold', 'dimgray', 'darkorange', 'peru', 'blue', 'y', 'r', 'cyan', 'tan', 'orchid', 'peru', 'blue', 'y', 'r', 'sienna']
    markers = ['*', 'h', 'H', '+', 'o', '1', '2', '3', ',', 'v', 'H', '+', '1', '2', '^',               '<', '>', '.', '4', 'H', '+', '1', '2', 's', 'p', 'x', 'D', 'd', '|', '_']    for i in range(len(canopies)):
        canopy = canopies[i]
        center = canopy[0]
        components = canopy[1]
        sc.plot(center[0], center[1], marker=markers[i],
                color=colors[i], markersize=10)
        t1_circle = plt.Circle(
            xy=(center[0], center[1]), radius=t1, color='dodgerblue', fill=False)
        t2_circle = plt.Circle(
            xy=(center[0], center[1]), radius=t2, color='skyblue', alpha=0.2)
        sc.add_artist(t1_circle)
        sc.add_artist(t2_circle)        for component in components:
            sc.plot(component[0], component[1],
                    marker=markers[i], color=colors[i], markersize=1.5)
    maxvalue = np.amax(dataset)
    minvalue = np.amin(dataset)
    plt.xlim(minvalue - t1, maxvalue + t1)
    plt.ylim(minvalue - t1, maxvalue + t1)
    plt.show()

成果图以下：