标签传播的半监督学习及其Python实践学习

【翻译自： Semi-Supervised Learning With Label Propagation】

【说明：Jason Brownlee PhD大神的文章个人很喜欢，所以闲暇时间里会做一点翻译和学习实践的工作，这里是相应工作的实践记录，希望能帮到有需要的人！】

标签传播的半监督学习

半监督学习是指尝试利用标记和未标记训练数据的算法。半监督学习算法不同于只能从标记的训练数据中学习的监督学习算法。半监督学习的一种流行方法是创建一个图，该图连接训练数据集中的示例，并通过该图的边缘传播已知标签以标记未标记的示例。这种半监督学习方法的一个示例是用于分类预测建模的标签传播算法。

在本教程中，您将发现如何将标签传播算法应用于半监督学习分类数据集。完成本教程后，您将知道：

标签传播半监督学习算法如何工作的直觉。
如何使用监督学习算法开发半监督分类数据集并建立性能基线。
如何开发和评估标签传播算法，以及如何使用模型输出来训练监督学习算法。

教程概述

本教程分为三个部分：他们是：

标签传播算法
半监督分类数据集
半监督学习的标签传播

标签传播算法

标签传播是一种半监督学习算法。该算法是在2002年由Xiaojin Zhu和Zoubin Ghahramani撰写的技术报告中提出的，其标题为“通过标签传播从标记的和未标记的数据中学习”。该算法的直觉是创建了一个图，该图基于数据集中所有示例（行）的距离（例如，欧几里得距离）将它们连接起来。然后，基于图中附近连接的示例的标签或标签分布，图形中的节点将具有标签软标签或标签分布。

现在我们已经熟悉了标签传播算法，接下来让我们看一下如何在项目中使用它。首先，我们必须定义一个半监督分类数据集。

半监督分类数据集

在本节中，我们将为半监督学习定义一个数据集，并在该数据集上建立性能基准。首先，我们可以使用make_classification（）函数定义一个综合分类数据集。我们将使用两个类（二进制分类）和两个输入变量以及1,000个示例来定义数据集。

# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

接下来，我们将数据集拆分为训练数据集和测试数据集，并分别进行50-50的拆分（例如，每列500行）。

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

最后，我们将训练数据集再次分成两部分，一部分将带有标签，另一部分则是未贴标签的。

# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

结合在一起，下面列出了准备半监督学习数据集的完整示例。

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

运行示例将准备数据集，然后总结三个部分中每个部分的形状。

结果证实我们有一个500行的测试数据集，一个250行的带标签的训练数据集和250行的未标记的数据。

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

监督学习算法将只有250行用于训练模型。半监督学习算法将具有250个标记的行以及250个未标记的行，可以用多种方式来改进标记的训练数据集。接下来，我们可以使用仅适用于标记训练数据的监督学习算法，在半监督学习数据集上建立性能基准。这很重要，因为我们希望半监督学习算法的性能优于仅适用于标记数据的监督学习算法。如果不是这种情况，则半监督学习算法没有技巧。在这种情况下，我们将使用适合于训练数据集标记部分的逻辑回归算法。

# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

然后可以使用该模型对整个测试数据集进行预测，并使用分类精度进行评估。

# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

综上所述，下面列出了在半监督学习数据集上评估监督学习算法的完整示例。

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

运行算法将模型拟合到标记的训练数据集上，并在保持数据集上对其进行评估，并打印分类准确性。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

在这种情况下，我们可以看到该算法实现了约84.8％的分类精度。我们期望一种有效的半监督学习算法能够比此算法获得更好的准确性。

Accuracy: 84.800

半监督学习的标签传播

可以通过LabelPropagation类在scikit-learn Python机器学习库中使用Label Propagation算法。通过调用fit（）函数，该模型可以像其他任何分类模型一样进行拟合，并通过predict（）函数对新数据进行预测。

# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

重要的是，提供给fit（）函数的训练数据集必须包括经过整数编码（按照常规）的带标签的示例和带标签-1的未标签的示例。然后，模型将为未标记的示例确定标签，作为拟合模型的一部分。拟合模型后，可通过LabelPropagation类的“ transduction_”属性获得训练数据集中标记和未标记数据的估计标签。

# get labels for entire training dataset data
tran_labels = model.transduction_

既然我们已经熟悉了如何在scikit-learn中使用标签传播算法，那么让我们看一下如何将其应用于半监督学习数据集。首先，我们必须准备训练数据集。我们可以将训练数据集的输入数据连接到单个数组中。

# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

然后，我们可以为训练数据集的未标记部分中的每一行创建一个值-1（未标记）的列表。

# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

然后，可以将该列表与来自训练数据集的标记部分的标签连接起来，以与训练数据集的输入数组相对应。

# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

现在，我们可以在整个训练数据集上训练LabelPropagation模型。

# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

接下来，我们可以使用该模型对保持数据集进行预测，并使用分类精度对模型进行评估。

# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

综上所述，下面列出了评估标签在半监督学习数据集上传播的完整示例。

# evaluate label propagation on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

运行算法可使模型适合整个训练数据集，并在保持数据集上对其进行评估并打印分类准确性。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。在这种情况下，我们可以看到标签传播模型实现了约85.6％的分类精度，这比仅对实现了约84.8％的标记训练数据集的逻辑回归拟合稍高。

Accuracy: 85.600

我们可以在半监督模型中使用的另一种方法是获取训练数据集的估计标签，并拟合监督学习模型。

回想一下，我们可以从标签传播模型中检索整个训练数据集的标签，如下所示：

# get labels for entire training dataset data
tran_labels = model.transduction_

然后，我们可以使用这些标签以及所有输入数据来训练和评估监督学习算法，例如逻辑回归模型。希望能够将适合于整个训练数据集的监督学习模型实现比单独的半监督学习模型更好的性能。

# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

结合在一起，下面列出了使用估计的训练集标签来训练和评估监督学习模型的完整示例。

# evaluate logistic regression fit on label propagation for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

运行该算法将半监督模型拟合到整个训练数据集上，然后将具有推断标签的整个监督数据集拟合到一个监督学习模型，并在保持数据集上对其进行评估，从而打印分类准确性。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。在这种情况下，我们可以看到，半监督模型和监督模型的这种分层方法在保持数据集上实现了约86.2％的分类精度，甚至比单独使用的半监督学习获得了约6％的精度更好。 85.6％。

Accuracy: 86.200

最后给出来一些API参考链接：

sklearn.semi_supervised.LabelPropagation API.
Section 1.14. Semi-Supervised, Scikit-Learn User Guide.
sklearn.model_selection.train_test_split API.
sklearn.linear_model.LogisticRegression API.
sklearn.datasets.make_classification API.