在数据分析和机器学习领域,聚类作为一种无监督学习的方法,旨在将数据集划分成若干个类或簇,使得同一个簇内的数据点具有较高的相似度,而不同簇之间的数据点则具有较低的相似度。为了评估聚类效果的好坏,以下五大评分指标可以为你提供有力的参考。
1. 调和平均值(Adjusted Rand Index, ARI)
调和平均值是一种常用的聚类评价方法,它考虑了聚类结果的一致性和稳定性。ARI的值介于-1和1之间,值越接近1表示聚类结果与真实标签越一致,值越接近-1则表示聚类结果与真实标签完全不一致。
ARI 计算公式:
def adjusted_rand_index(true_labels, predicted_labels):
n = len(true_labels)
contingency_table = [[0 for _ in range(max(true_labels) + 1)] for _ in range(max(predicted_labels) + 1)]
for i in range(n):
contingency_table[true_labels[i]][predicted_labels[i]] += 1
total = sum(contingency_table[i][i] for i in range(len(contingency_table)))
a = sum(contingency_table[i][j] for i in range(len(contingency_table)) for j in range(i))
b = sum(contingency_table[i][j] for i in range(len(contingency_table)) for j in range(i+1, len(contingency_table)))
return (2 * total - a - b) / (total * (total - 1))
2. 肯德尔系数(Kendall’s tau)
肯德尔系数是一种非参数统计方法,用于衡量两个排名序列之间的相关性。在聚类分析中,肯德尔系数可以用来评估聚类结果的好坏。
肯德尔系数计算公式:
def kendall_tau(true_labels, predicted_labels):
concordant = 0
discordant = 0
for i in range(len(true_labels)):
for j in range(i+1, len(true_labels)):
concordant += 1 if predicted_labels[i] == predicted_labels[j] and true_labels[i] == true_labels[j] else 0
discordant += 1 if predicted_labels[i] == predicted_labels[j] and true_labels[i] != true_labels[j] else 0
return (concordant - discordant) / min(len(true_labels) * (len(true_labels) - 1) / 2, len(true_labels))
3. 聚类轮廓系数(Silhouette Coefficient)
聚类轮廓系数是另一个常用的聚类评价方法,它综合考虑了聚类内聚性和聚类分离性。聚类轮廓系数的值介于-1和1之间,值越接近1表示聚类结果越好。
聚类轮廓系数计算公式:
def silhouette_coefficient(true_labels, predicted_labels):
n = len(true_labels)
silhouette_scores = []
for i in range(n):
a = sum([euclidean_distance(true_labels[k], true_labels[j]) for k in range(n) if predicted_labels[k] == predicted_labels[i] and k != i])
b = sum([euclidean_distance(true_labels[k], true_labels[j]) for k in range(n) if predicted_labels[k] != predicted_labels[i]])
silhouette_scores.append((b - a) / max(a, b))
return sum(silhouette_scores) / n
其中,euclidean_distance函数用于计算两点之间的欧氏距离。
4.Davies-Bouldin指数(Davies-Bouldin Index)
Davies-Bouldin指数是一种用于衡量聚类结果好坏的指标,其值越小表示聚类结果越好。该指数考虑了聚类内聚性和聚类分离性。
Davies-Bouldin指数计算公式:
def davies_bouldin_index(true_labels, predicted_labels):
n_clusters = max(predicted_labels) + 1
cluster_sizes = [0 for _ in range(n_clusters)]
for label in predicted_labels:
cluster_sizes[label] += 1
sum = 0
for i in range(n_clusters):
for j in range(i + 1, n_clusters):
sum += 2 * (sum([euclidean_distance(true_labels[k], true_labels[l]) for k in range(n) if predicted_labels[k] == i and l in range(n) and predicted_labels[l] == j]) / (cluster_sizes[i] * cluster_sizes[j]))
return sum / (n_clusters * (n_clusters - 1))
5. Calinski-Harabasz指数(Calinski-Harabasz Index)
Calinski-Harabasz指数是一种衡量聚类结果好坏的指标,其值越大表示聚类结果越好。该指数综合考虑了聚类内聚性和聚类分离性。
Calinski-Harabasz指数计算公式:
def calinski_harabasz_index(true_labels, predicted_labels):
n_clusters = max(predicted_labels) + 1
cluster_sizes = [0 for _ in range(n_clusters)]
for label in predicted_labels:
cluster_sizes[label] += 1
sum_within_cluster = 0
sum_between_cluster = 0
for i in range(n_clusters):
for j in range(i, n_clusters):
sum_within_cluster += sum([euclidean_distance(true_labels[k], true_labels[l]) for k in range(n) if predicted_labels[k] == i and l in range(n) and predicted_labels[l] == j])
sum_between_cluster += sum([euclidean_distance(true_labels[k], true_labels[l]) for k in range(n) if predicted_labels[k] == i and l in range(n) and predicted_labels[l] == j])
return (sum_between_cluster ** 2) / (n_clusters - 1) / (sum_within_cluster ** 2) / (n_clusters - 2)
综上所述,通过以上五大评分指标,可以较为全面地评估聚类分析成果。在实际应用中,可以根据具体情况选择合适的指标进行评估。
