在数据分析和机器学习领域,聚类作为一种无监督学习的方法,旨在将数据集划分成若干个类或簇,使得同一个簇内的数据点具有较高的相似度,而不同簇之间的数据点则具有较低的相似度。为了评估聚类效果的好坏,以下五大评分指标可以为你提供有力的参考。

1. 调和平均值(Adjusted Rand Index, ARI)

调和平均值是一种常用的聚类评价方法,它考虑了聚类结果的一致性和稳定性。ARI的值介于-1和1之间,值越接近1表示聚类结果与真实标签越一致,值越接近-1则表示聚类结果与真实标签完全不一致。

ARI 计算公式:

def adjusted_rand_index(true_labels, predicted_labels):
    n = len(true_labels)
    contingency_table = [[0 for _ in range(max(true_labels) + 1)] for _ in range(max(predicted_labels) + 1)]
    for i in range(n):
        contingency_table[true_labels[i]][predicted_labels[i]] += 1

    total = sum(contingency_table[i][i] for i in range(len(contingency_table)))
    a = sum(contingency_table[i][j] for i in range(len(contingency_table)) for j in range(i))
    b = sum(contingency_table[i][j] for i in range(len(contingency_table)) for j in range(i+1, len(contingency_table)))
    return (2 * total - a - b) / (total * (total - 1))

2. 肯德尔系数(Kendall’s tau)

肯德尔系数是一种非参数统计方法,用于衡量两个排名序列之间的相关性。在聚类分析中,肯德尔系数可以用来评估聚类结果的好坏。

肯德尔系数计算公式:

def kendall_tau(true_labels, predicted_labels):
    concordant = 0
    discordant = 0
    for i in range(len(true_labels)):
        for j in range(i+1, len(true_labels)):
            concordant += 1 if predicted_labels[i] == predicted_labels[j] and true_labels[i] == true_labels[j] else 0
            discordant += 1 if predicted_labels[i] == predicted_labels[j] and true_labels[i] != true_labels[j] else 0
    return (concordant - discordant) / min(len(true_labels) * (len(true_labels) - 1) / 2, len(true_labels))

3. 聚类轮廓系数(Silhouette Coefficient)

聚类轮廓系数是另一个常用的聚类评价方法,它综合考虑了聚类内聚性和聚类分离性。聚类轮廓系数的值介于-1和1之间,值越接近1表示聚类结果越好。

聚类轮廓系数计算公式:

def silhouette_coefficient(true_labels, predicted_labels):
    n = len(true_labels)
    silhouette_scores = []
    for i in range(n):
        a = sum([euclidean_distance(true_labels[k], true_labels[j]) for k in range(n) if predicted_labels[k] == predicted_labels[i] and k != i])
        b = sum([euclidean_distance(true_labels[k], true_labels[j]) for k in range(n) if predicted_labels[k] != predicted_labels[i]])
        silhouette_scores.append((b - a) / max(a, b))
    return sum(silhouette_scores) / n

其中,euclidean_distance函数用于计算两点之间的欧氏距离。

4.Davies-Bouldin指数(Davies-Bouldin Index)

Davies-Bouldin指数是一种用于衡量聚类结果好坏的指标,其值越小表示聚类结果越好。该指数考虑了聚类内聚性和聚类分离性。

Davies-Bouldin指数计算公式:

def davies_bouldin_index(true_labels, predicted_labels):
    n_clusters = max(predicted_labels) + 1
    cluster_sizes = [0 for _ in range(n_clusters)]
    for label in predicted_labels:
        cluster_sizes[label] += 1

    sum = 0
    for i in range(n_clusters):
        for j in range(i + 1, n_clusters):
            sum += 2 * (sum([euclidean_distance(true_labels[k], true_labels[l]) for k in range(n) if predicted_labels[k] == i and l in range(n) and predicted_labels[l] == j]) / (cluster_sizes[i] * cluster_sizes[j]))
    return sum / (n_clusters * (n_clusters - 1))

5. Calinski-Harabasz指数(Calinski-Harabasz Index)

Calinski-Harabasz指数是一种衡量聚类结果好坏的指标,其值越大表示聚类结果越好。该指数综合考虑了聚类内聚性和聚类分离性。

Calinski-Harabasz指数计算公式:

def calinski_harabasz_index(true_labels, predicted_labels):
    n_clusters = max(predicted_labels) + 1
    cluster_sizes = [0 for _ in range(n_clusters)]
    for label in predicted_labels:
        cluster_sizes[label] += 1

    sum_within_cluster = 0
    sum_between_cluster = 0
    for i in range(n_clusters):
        for j in range(i, n_clusters):
            sum_within_cluster += sum([euclidean_distance(true_labels[k], true_labels[l]) for k in range(n) if predicted_labels[k] == i and l in range(n) and predicted_labels[l] == j])
            sum_between_cluster += sum([euclidean_distance(true_labels[k], true_labels[l]) for k in range(n) if predicted_labels[k] == i and l in range(n) and predicted_labels[l] == j])

    return (sum_between_cluster ** 2) / (n_clusters - 1) / (sum_within_cluster ** 2) / (n_clusters - 2)

综上所述,通过以上五大评分指标,可以较为全面地评估聚类分析成果。在实际应用中,可以根据具体情况选择合适的指标进行评估。