A Survey on the 20 Years Journey of Semi-Supervised Learning
This post is a brief review of Springer’s publication of survey paper “A survey on semi-supervised” (Jesper et al. 2019), which provides an in-depth understanding of key approaches and algorithms being practiced over the past two decades. Moreover, this paper proposes a new taxonomy for semi-supervised classification algorithms which sheds light on the most prominent and successful relevant works.
This article covers background study, assumptions of semi-supervised approaches, concisely narrates the new proposed taxonomy, and finally shows the future scope and direction. My aim is to provide a solid concept of the 20 years of evolution of semi-supervised techniques to the new researchers by summarizing this 68-page long survey paper.
1. Basic Concepts and Assumptions
1.1 Assumptions of semi-supervised learning
An important pre-condition of semi-supervised learning is that the underlying data distribution, p(x), contains information about the posterior distribution p(y|x). Considering this condition, we can use unlabeled data to gain information about p(x), and thereby about p(y|x). However, p(x) and p(y|x) do not always interact in the same way. This has been the source of semi-supervised learning assumptions (Chapelle et al. 2006b). The most widely used assumptions are the smoothness assumption (if two samples x and x1 are close in the input space, their labels y and y1 should be the same), the low-density assumption (the decision boundary should not pass through high-density areas in the input space), and the manifold assumption (same low-dimensional manifold datapoints should have the same label). Fig. 1 shows a visual representation of the discussed assumptions.

1.2 Connection to Clustering
According to Chapelle et al. 2006b, data points that belong to the same cluster come from the same class. This assumption is called the semi-supervised learning assumption of clusters. However, this paper (Jesper et al. 2019), argues with the fact saying that the cluster assumption is a generalization of the other assumptions. In another meaning, if both the labeled and unlabeled data points cannot be clustered meaningfully, then the semi-supervised learning method cannot be improved on a supervised method.
2. Taxonomy of Semi-supervised learning methods
A wide variety of semi-supervised classification algorithms has been proposed and established over the past two decades. These methods vary based on different criteria, such as the semi-supervised learning assumptions, methods of using unlabeled data, and how these are related to semi-supervised techniques. Present semi-supervised learning methods are visualized in Fig. 2 taxonomy. Here, a higher level shows the difference between the inductive and transductive methods, which gives a distinguishable idea about optimization procedures. Inductive methods attempt to find a classification model, on the other hand, transductive methods are about obtaining label predictions for provided unlabeled data points. The next two sections briefly discuss these two methods along with their subdivisions.

3. Inductive methods
By definition, inductive methods build a classifier to generate predictions for any point in the input space. To train the classifier, unlabeled data can be used. Once, the training is complete, the prediction for unseen data points is independent of each other. This gives a new objective to the semi-supervised learning methods which are, a model is built from the training phase and later can be used to predict the new data point labels.
3.1 Wrapper methods
Wrapper methods (Zhu 2008) structure the initial segment of the inductive side of the taxonomy. A significant advantage of wrapper methods is it can deal with any kind of supervised regression problems. They naturally utilize one or more supervised base learners and iteratively train these with labeled data from the original data set. In the same way, they can train the previously unlabeled data that is augmented with predictions from the learner’s initial iterations. The latter part is commonly known as pseudo-labeled data. Usually, the method consists of two alternation steps of training and pseudo-labeling. In the training step, one or more supervised classifiers are trained using the labeled data and pseudo-labeled data from earlier iteration steps. Next, to infer labels from the previously trained unlabeled data, the resulting classifiers are used in the pseudo-labeled step. Most confident predictions are used as the pseudo-label at the next iteration.
3.1.1 Self-training
Self-training is one of the three types of wrapper methods. Self-training was first proposed by Yarowsky (1995) as an approach to predict the meaning of the ambiguous words from the context of the text document. This approach uses one supervised classifier that is re-trained iteratively based on its most confident predictions. Self-training methods or self-learning approaches (Triguero et al. 2015) are widely used as one of the basic pseudo-labeling approaches. This procedure is typically iterated till the last unlabeled data.
3.1.2 Co-training
Co-training, an extension of self-training to multiple, is the second variation wrapper methods. Here classifiers are iteratively re-trained on the most confident predictions of each other. For co-training to succeed, the classifiers are supposed to be adequately different, which is usually accomplished by operating on diverse subsets of the given data objects or features. In the literature study, this condition is usually referred to as the diversity criterion (Wang and Zhou 2010).
3.1.3 Boosting
The last sub-type of wrapper methods is pseudo-labeled boosting methods. Like conventional boosting methods, they build ensemble classifiers by sequentially building individual classifiers. Moreover, each individual classifier is trained using both the labeled data and the most confident predictions of previously trained unlabeled data like before. There exist two main branches of supervised ensemble learning: bagging and boosting (Zhou 2012). In boosting methods, each base learner is conditioned on the previous base learners.
3.2 Unsupervised pre-processing
The second category of inductive methods, Unsupervised preprocessing, either extract useful features from unlabeled data, pre-cluster the data or use an unsupervised manner to compute the initial parameters of a supervised learning procedure. They can be used with any supervised classifier like wrapper methods, but the supervised classifier is only trained with originally provided labeled data. Generally, the unsupervised technique includes either the automated extraction or transformation of sample features from the unlabeled data (feature extraction), the unsupervised clustering of the data (cluster-then-label), or the initialization of the parameters of the learning procedure (pre-training).
3.2.1 Feature Extraction
The feature extraction method plays a vital role in forming classifiers. These methods try to find a transformation of the input data that helps to improve classification and makes it computationally efficient. This survey paper discusses a small portion of prominent techniques from different works of literature, such as Guyon and Elisseeff 2006 and Sheikhpour et al. 2017.
3.2.2 Cluster-then-label
Demiriz et al. (1999) first cluster the data in a semi-supervised way. In those techniques, he favors clusters with limited label impurity and a high degree of consistency in the label within a resulting cluster. Later, Goldberg et al. (2009) first cluster the labeled data and a subset of the unlabeled data. Many semi-supervised learning algorithms use techniques inspired by clustering to conduct the classification process. Cluster-then-label procedures form a group of methods that explicitly join the clustering and classification processes. At first, it uses an unsupervised or semi-supervised clustering algorithm to all the input data, and then use the resulting clusters lead the classification process.
3.1.3 Pre-training
Pre-training approaches have deep connections in the field of deep learning. This approach naturally fits in deep learning areas where each layer of the hierarchical model is a latent representation of the input data. In pre-training methods, a decision boundary is conducted by unlabeled data before applying supervised training. However, due to the challenge of fine-tuning numerous parameters, convergence is comparatively slow here and that ends up producing a poor generalized training network (Erhan et al. 2010). Consequently, this survey paper only covers the stems from the first decade of the 2000s.
3.3 Intrinsically semi-supervised
Intrinsically semi-supervised methods are the last class of inductive methods. Here unlabeled data is directly incorporated into the objective function or optimization procedure of the learning part. Most of these methods are direct extensions of supervised learning methods by transforming objective functions to include unlabeled data. There are intrinsically semi-supervised expansions of many leading supervised learning approaches, including SVMs, Gaussian processes, and neural networks. For example, semi-supervised support vector machines (S3VMs) extend supervised SVMs by maximizing the margin on labeled and unlabeled data. Based on semi-supervised learning assumptions these methods are divided into four categories: maximum-margin methods, perturbation-based methods, manifold-based techniques, and generative models.
3.3.1 Maximum Margin
Maximum-margin methods are the earliest intrinsically semi-supervised classification methods. These methods aim to maximize the distance between the given data points and the decision boundary. The semi-supervised low-density assumption is the basis of these approaches. Therefore, the margin between all data points and the decision boundary is large except for some outliers. Ben-David et al. 2009 suggest that this decision boundary must fall in a low-density area.
3.3.2 Perturbation-based
The perturbation-based methods directly incorporate the smoothness assumption. The smoothness assumption holds the idea that a predictive model should be robust to local perturbations in its input domain. This concept implies that when a data point is perturbed with a small amount of noise, the predictions for noisy and clean inputs should not be different. Since this expected similarity is not dependent on the original label of the data points, we can use unlabeled data easily. Due to this straightforward incorporation of unlabeled data, perturbation-based methods are often implemented with neural networks and gives acceleration to this research area. In recent times, neural networks have utilized this approach in various application areas e.g. Collobert et al. 2011; Krizhevsky et al. 2012; LeCun et al. 2015.
3.3.3 Manifolds
Manifold methods hold the behaviors that establish that not all minor changes to the input should generate similar outputs. According to the manifold assumption: (a) the input space is composed of multiple lower-dimensional manifolds on which all data points lie and (b) data points sharing the same lower-dimensional manifold have the same label. In this survey paper, the authors discuss the two general types of methods based on the manifold assumption. Firstly, they consider manifold regularization techniques known as manifold regularization, to show how these approaches penalize differences in predictions for data points. Secondly, manifold approximation techniques are briefly reviewed, which estimates the manifolds M to optimize the objective functions.
3.3.4 Generative Models
All the procedures so far discussed fall into the discriminative category. In all the cases, these approaches classification problems without modeling data-generating distributions explicitly. In contrast, generative approaches model the process that generates data. For classification problems, a generative model is conditioned on a given y label. According to this paper, some of the widely used generative model approaches are Mixture models, Generative adversarial networks, Variational autoencoders.
4. Transductive methods
Transductive algorithms form the second major category of semi-supervised learning methods. Like inductive algorithms, transductive algorithms do not produce predictor that executes over the input space entirely. Actually, transductive methods yield a set of predictions for the provided unlabeled data. Therefore, a training phase and a testing phase are not distinguishable.
For transductive algorithms, labeled data (XL, yL ) and unlabeled data XU are provided which produce output that exclusively predictions yˆU for the unlabeled data. Collectively, these methods are often known as graph-based methods (Zhu 2008) for both labeled and unlabeled data. Some well-established transductive graph-based methods are General framework for graph-based methods, Inference in graphs, Probabilistic label assignments: Markov random fields etc.

5. Conclusions and future scope
This survey covers a huge area of the field of semi-supervised learning starting from early 2000 to recent publications. Moreover, the authors present an up-to-date taxonomy of the semi-supervised classification methods. Considering all these, here are few points to consider for future direction:
- One important issue to address in near future is, possible performance degradation caused by the introduction of unlabeled data. Limited supervised techniques only perform better than their supervised counterparts or base learners in specific cases (Li and Zhou 2015; Singh et al. 2009).
- Recent studies have shown that perturbation-based methods with neural networks consistently outperform their supervised counterparts. This flexibility of using considerable advantage of using neural networks should be explored more and should gain more popularity in the field of semi-supervised neural networks.
- Recently, automated machine learning (AutoML) has been used widely to achieve the robustness of the models. These approaches include meta-learning and neural architecture for automatic algorithm selection and hyperparameter optimization. While AutoML techniques are successfully being used to supervised learning, there is a lack of improvement of research in the semi-supervised field. This field should be studied more to bring striking results to semi-supervised approaches.
- Another important step towards the advancement of semi-supervised approaches are having standardized software packages or library dedicated to this domain. Currently, some generic packages like the KEEL software package do exist that include a semi-supervised learning module (Triguero et al. 2017).
- One vital sector that needs serious attention from researchers is building a strong connection between clustering and classification. Essentially, both approaches are a special branch of semi-supervised, where either only labeled or unlabeled data is considered. The recent hype in generative models can be seen as a good sign to address this field.
To conclude, unlabeled data is now playing a vital role in the progress of machine learning and giving us a paradigm shift. Semi-supervised can be seen to dominate this field to get the best out of it.
Key References:
- J. E. van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, feb 2020.
- Bengio, Y., Delalleau, O., & Le Roux, N. (2006). Chapter 11. Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, & A. Zien (Eds.), Semi-supervised learning (pp. 193–216). Cambridge: The
MIT Press. - Bengio, Y., Delalleau, O., & Le Roux, N. (2006). Chapter 11. Label propagation and quadratic criterion. In O. Chapelle, B. Schölkopf, & A. Zien (Eds.), Semi-supervised learning (pp. 193–216). Cambridge: The MIT Press.
- Goldberg, A. B., Zhu, X., Singh, A., Xu, Z.,&Nowak, R. D. (2009).Multi-manifold semi-supervised learning. In Proceedings of the 12th international conference on artificial intelligence and statistics (pp. 169–176).
- Haffari, G. R., & Sarkar, A. (2007). Analysis of semi-supervised learning with the Yarowsky algorithm. In Proceedings of the 23rd conference on uncertainty in artificial intelligence (pp. 159–166).
- Triguero, I., García, S.,&Herrera, F. (2015). Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowledge and Information Systems, 42(2), 245–284
- Jebara, T., Wang, J., & Chang, S. F. (2009) Graph construction and b-matching for semi-supervised learning. In Proceedings of the 26th annual international conference on machine learning (pp. 441–448).
- Guyon, I., & Elisseeff, A. (2006). An introduction to feature extraction. In I. Guyon, M. Nikravesh, S. Gunn, & L. A. Zadeh (Eds.), Feature extraction (pp. 1–25). Berlin: Springer.
- Sheikhpour, R., Sarram, M. A., Gharaghani, S., & Chahooki, M. A. Z. (2017). A survey on semi-supervised feature selection methods. Pattern Recognition, 64, 141–158.
- Bennett, K. P., & Demiriz, A. (1999). Semi-supervised support vector machines. In Advances in neural information processing systems (pp. 368–374).
- Goldberg, A. B., Zhu, X., Singh, A., Xu, Z.,&Nowak, R. D. (2009).Multi-manifold semi-supervised learning. In Proceedings of the 12th international conference on artificial intelligence and statistics (pp. 169–176).
- Urner, R., Ben-David, S.,&Shalev-Shwartz, S. (2011). Access to unlabeled data can speed up prediction time. In Proceedings of the 27th international conference on machine learning (pp. 641–648).
- Bruna, J., Zaremba, W., Szlam, A., & LeCun, Y. (2014). Spectral networks and locally connected networks on graphs. In International conference on learning, representations.