TL;DR: There is no such thing as unsupervised learning. All choices for formulating a learning problem can be broadly thought of as being supervision. The existence of data (regardless of its kind) can also be thought of as being supervision. Similarly for the inductive biases imposed by experts, such as the model class and learning algorithm. Incorporating supervision effectively is the essence of machine learning.

Introduction

All machine learning starts with data. Nonetheless, there is this widespread idea that some data is better than other; that some learning is supervised and other is unsupervised; that unsupervised learning does not require supervision. In this post, I want to question these beliefs and rethink supervision. I claim that the existence of data is, by itself, supervision; that supervision is anything that steers the model is a particular direction; that supervision is not limited to data; that all machine learning is supervised in the truest sense; that categorizing learning algorithms or problems as supervised, unsupervised, semi-supervised, self-supervised, transfer, by reinforcement, and so on, is of little value at best and misleading at worst.

Supervised or unsupervised?

Left-to-right language modelling is a popular machine learning problem. Given a prefix of words (or a context), predict the next word. Should this problem be considered supervised or unsupervised learning? I think that it should be considered supervised learning because whatever supervision is necessary to train the model is present in the data. Why should we think of it as being unsupervised learning just because the data wasn’t labelled or generated for the specific purpose of training these models.

Clustering, which we often think as being the prototypical unsupervised learning task, is dependent on the choice of vector space where the data points lie in. This vector space is not readily available and must be chosen by the expert. Shouldn’t this choice invalidate the claim that it is unsupervised learning? Why not? What makes this choice less worthy of being called supervision than choices made by human data annotators?

A popular approach is to do supervised learning on a proxy task that is of little interest in itself but is sufficiently related to tasks of interest that is able to induce useful representations, allowing us to learn with less additional supervision downstream. These representations can then be used to achieve higher performances (e.g., by fine-tuning the parameters of the pretrained model on data for the task of interest). For example, visual feature extractors learned through image classification on ImageNet, or sentence encoders learned through left-to-right or masked language modelling on large internet text corpora. I fail to see how this can qualify as unsupervised learning. While this data is more readily available than a handcrafted dataset for a specific task, it is still a supervised task with supervised data for it. Maybe it is called unsupervised or transfer learning when we use this pretrained model in a different task that we care about. Even so, the initialization of the pretrained model was done in a supervised manner.

Supervision through model class design

While we like to think that most of the information lies in the data, the model class has a huge impact on the performance of the final trained model. Choices for the model class from which the learned models will be drawn can lead to large differences in performance. Examples of such choices are what initial features to use and how to encode them. For example, in deep learning there are likely thousands of papers devoted to model class variations.

Consider the goal of training a conversational agent from existing conversations between two people. In one case, we treat conversations as coming all from a single non-specific source, e.g., by training a model that conditions on conversational context and produces a response. In this case, the speakers are not explicitly modelled. In the other case, we take into account the personas of the people chatting (e.g., you may have some side information about the people in your training dataset that informs their likes and dislikes, or you may have access to past conversations or utterances from each person).

Based on the persona information, it is likely that we can train a generation model that better captures what it means for two people to engage in conversation and how their interests, desires, and personality interact and steer the conversation. In the first case, personas are not modelled beyond what might arise from the textual context of the conversation. In the second case, the model has access to past interactions to induce a persona. This also endows the model with a steerable aspect, where its user can provide these interactions to induce suitable personas.

The second case uses more supervision than the first, but maybe not sufficiently so to be considerably more challenging to train as it is conceivable that conversations could be easily annotated with this information. Fundamentally, the difference between these learning formulations is the information that is being conditioned on for generation. While perhaps a modest difference at first sight, I find it reasonable to believe that the first model would default to generations that are common amongst all people and all conversational contexts, while the second model would capture the dynamics of the conversation better (e.g, how people tend to talk about shared interests) and therefore generate more meaningful conversations. If additionally, we can provide information to the model about the situational context where the conversation happens, it can do an even better job capturing the dynamics of the conversation, e.g., a conversation happening on a dating website versus on a political forum. All this information is present in the data but it is only available to train the model in the form of supervision if the model effectively encodes it and incorporates it for prediction.

Embodiment and affordances

When we play a video game (e.g., an RPG), do we learn affordances from scratch? Almost certainly not—we rely heavily on transferred experience from the real world, e.g., doors in the game should have similar semantics to doors in real life. Based on the game that we are playing, we expect certain objects to be important and certain actions to be possible. The learning problem that a person must solve when picking up a video game is much shallower than learning an arbitrary visual mapping to actions. It hinges strongly on expectations of how the game should work and what are its objectives. Our expectations are already supervised by evolution and by experience in the real-world so it is foolish to think that humans learn to play video games in a tabula rasa manner.

Our understanding of everyday objects is a function of how they relate to us and how they support our existence. For example, a power outage results in an abrupt change of perceived affordances. You become suddenly aware that your refrigerator and computer no longer have the same affordances available and if the outage was to continue indefinitely, the new affordances would become permanent.

The relation of objects to us is clear from our embodied existence, but this supervision is not fully present in the data that we use to currently train machine learning models. Getting supervision about affordances is, in my opinion, perhaps one of the greatest barriers for training open-domain robots. Text is fairly poor when it comes to providing supervision about manipulation because current datasets rarely pair text with manipulation data. Furthermore, for humans, most of the supervision about manipulation is hardwired with additional one being acquired early in life. For example, the supervision about manipulation that you might find in text is already expressed in terms of fairly high-level constructs (e.g., peel, cut, and simmer in the case of recipes) which does not contain enough supervision that could be used to train a manipulator from scratch. Humans can learn from text and videos because we know how to transfer words and diagrams on a page to behavior that we must ourselves execute in the real world, i.e., we know how to consume supervision presented in this form. Other animals are also able to learn from demonstrations.

Symbolic manipulation and world models are useful from an energy minimization perspective. If you are able to do the exact same computation (meaning a computation that is of equal value in supporting your existence), you would much rather do that computation with a lower resolution model than with a high resolution model that contains a lot of information that is useless to the problem that you want to solve. Abstraction therefore comes from this careful balancing act between faithfulness of the representation and the effort needed to maintain that representation (e.g., selective blindness and our inability to see hollow faces). Symbolic representations likely arise from our continuous world as a way of reasoning about its state. For example, if some aspect is always in the same configuration, we do not need to pay attention to it, we know that it will be in that configuration, making perception less burdensome.

I think that robotics is not yet capable enough, not because we don’t have the technology to make it so, but because we haven’t fully internalized the bitter lesson. Why don’t we tackle robotics heavily through data collection? Construct a large-scale two-person data collection experiment where one person teleoperates the robot and the other person passes the instructions along. In this case, affordances are properly demonstrated and paired with text (and potentially other multimodal information). Combining this with large-scale language models might be enough to induce supervision for a lot more actions than those that have been demonstrated. Maybe supervision can be induced by stringing together supervision chains at different levels. For exhibiting some behavior, a supervision chain must be present somehow, from links resulting from experience in the real world by the agent to links that are induced over long-time ranges that can be ascribed to evolution. This is similar to how pretraining works in machine learning, e.g., first you learn to do image classification on ImageNet and that is enough to learn a good visual feature extractor which then can be used to learn other tasks more efficiently, by building on the supervision of the proxy task for solving the task of interest.

Is human learning unsupervised?

We are incredibly constrained by our environment and experience. Our learning is constrained. Our bodies and brains are products of evolution. Whatever adaptations we have, exist because they conferred evolutionary advantages. If these adaptations were useless, nature would not spend energy on them. The existence of an organism in its present form is a testament to the evolutionary durability of its adaptations. It does not make sense to regard an organism in separation of its physical form and the environment in which it is inserted. It is impossible to regard its skills without considering the purpose they serve in supporting its existence. In this sense, the evolutionary process by which organisms survive or die, is supervision in itself. The adaptations that are developed through this route are supervised through this evolutionary process.

For humans, these adaptations are broadly the systems of vision, speech, hearing, smell, taste, and touch, and the cognitive abilities that are possible through the use of these systems. These systems are underdeveloped in a newborn, and are then shaped during the child’s development. The development is influenced by the environment, requiring securing specific resources and stimuli for optimal development of cognitive abilities. Nonetheless, the process by which development happens, while dependent on external stimuli, is constrained. The developmental process is constrained by the genotype. The existence of this mechanism that adjusts the organism based on the conditions of the organism in its environment can likely also be seen as an evolutionary advantage. Evolution dictates which traits are propagated to the next generation and which traits are dampened. Life does not exist in an evolutionary vacuum. The feasibility of a particular genotype depends on the state of the environment in which its phenotypes will find themselves in. Genotypes (and the corresponding phenotypes) that we find around us were able, at least to some extent, to either weather or thrive in the conditions that they found in the environment. Whatever adaptations currently exist are a reflection of this.

It is hard to decouple the intelligent behavior from the substrate that is going to run it and the environment that this agent must successfully navigate. From a human standpoint, the systems of perception and actuation are adapted to guarantee evolutionary robustness. This is true for systems that we use directly (eyes, hands, …) and others that we think of as being more complex (memory, ability to acquire new concepts, ability to reason with models of the environment, symbolic manipulation). Whatever adaptation exists, better capture more energy for the agent than the one required to operate it. We can find commonalities between the evolutionary adaptations that we have and those of other species. We can deduce that these adaptations share similar purposes—to robustly guarantee the survival of the organism across time and expected conditions. This is why I think it is nonsensical to talk about great sample efficiency shown by humans when learning new tasks. People learn tasks that are highly reliant on their current capabilities, which are already greatly supervised by evolution, so comparing human learning to machine learning is comparing apples to oranges. This human learning, rather than being unsupervised, relies on really long chains of supervision.

Pleasure and pain

Pleasure and pain are cheap proxies for survivability during primitive times. The existence of an evolutionary process is supervision in itself. Organisms that successfully adapt persist, with their traits being propagated into future generations, and those that do not, perish and disappear. The existence of a trait must mean that, by itself, it contributed positively (or at least non-negatively) to evolutionary fitness. This means that human existence is not unconstrained; far from it. We are allowed the experiences and free will that the primitive evolutionary hardware substrate permits. What comes naturally to us as a species is a result of evolution: food tastes (smells) good, spoilt food tastes (smells) bad, behaviors that overall lead to bad outcomes in a primitive world cause pain and those that lead to positive outcomes cause pleasure. Being pleasure-seeking and pain-avoiding are good proxies that are important enough to be hardwired. The caveat being that pleasure is scarce and that pleasure is safe from an evolutionary perspective. If any of these two conditions is violated, you get into the domain of addiction.

This is an interesting adversarial example. If the environment changes enough, behaviour that previously was beneficial might get us into deep trouble. This seems far-fetched, but there are many examples of this in modern society (e.g., addictions in general: food, alcool, internet, gaming, drugs, pornography, …). While behaviors that cause pain are easy to avoid, behaviors that cause pleasure are harder to quit, especially if they are cheap. If it feels good, why shouldn’t you do it? You wouldn’t do it if you anticipate a greater amount of pain caused by it, but we are not good at estimating pain that is well off in the future or that takes some fuzzy proportion (e.g., overeating enough to be one hundred pounds overweight). We need to fight ourselves to avoid such behaviors in modern times, while if we were in primitive times there would never exist a necessity for the same amount of self-control, simply because the same amount of pleasure would never be available in the environment, e.g., food was never abundant enough that overeating yourself to death would be a concern. In those conditions, being incredibly pleasure-seeking might indeed be the way of maximizing the probability of survival and of propagating your genes. In modern times, some of these will just be destructive pleasure traps.

Much of modern society is structured around monetizing pleasure. Moderation and meaning don’t get discussed because they don’t align with a company’s bottomline, and frankly, in most cases, why should companies have to decide what is desirable or not; as long as their products get bought that is good enough for them. If your goal is to make the most money possible, you should make your product as addictive as possible. What I fear is that creating products this way is creating huge externalities for society. Existing primitive systems of decision making play a huge role in the decisions we make, whether we recognize it or not. For example, is drinking a beverage with 40g of sugar per 400mL of liquid really a choice that we want to have? Are consumers being protected by being given such a choice? Maybe in the future we will have lawsuits with greatly expanded scope, e.g., the recent ones for pharmaceutical industry and apparently, the tobacco industry many years ago. Anything that causes self-reported addictive behavior without a corresponding meaningful positive long-term effect should be subject to strong regulation regardless of it being formally considered a drug or not. What would it take for us to give a clear look at some of the systems and products throughout society and decide that they exploit behavioral aspects that should not be exploited for commercial purposes? For example, is infinite scrolling an ethical feature? It drives engagement and addictiveness, but is it something that people feel positively about? How can we quantify it? We have conquered the environment, but can we conquer ourselves? Are we, as the beetle, building our own proverbial beer bottle?

Impact of this discussion on our approach to machine learning

I think that ultimately the goal of machine learning is to build systems that match or surpass humans in all aspects of cognition. Our discussion informs how to diagnose shortcomings, i.e., why a trained model has specific problems and what additional supervision may be necessary to achieve the desired result. If a model performs unsatisfactory, it means that it did not have the necessary supervision to learn to solve general instances as it should.

Affordances that we recognize from the real embodied world have to be relearned statistically from supervision. For example, if you only ever read books and had no physical experience of the world, your representations of concepts would be much different than someone that has embodied experience. I’m firmly optimistic about the performance of systems that we can develop. I believe that either a system has enough supervision to learn the task, or it lacks that supervision and therefore fails unexpectedly (simply by exhibiting behavior that is different from the one that we expect). Nonetheless, the poor performance is not a statement about the limitations of the model, but rather about the limitations of the information that was provided (either for training it or during prediction) and therefore can be remedied with additional care for generating the appropriate supervision, i.e., whatever the model is doing might be a legitimate way of solving the task using only the supervision provided.

Broadening the concept of supervision gives us an appreciation about what other living agents may be doing. It is meaningless to talk about sensory data without the embodied experience of the agent that is acquiring that data. We need to think about data generation deeply and try to understand aspects that are mismatched from our day-to-day expectations for data collection. For example, all data is generated with some purpose. The photos and videos that you would generate for Facebook and Instagram probably do a very poor job covering typical images that you acquire throughout the day, which are acquired with a more functional purpose (e.g., forking food from the plate to your mouth).

In some cases, the correct thing to do might be to collect new data or augment the existing data with additional supervision. This is not always pursued because current practice tends to focus overly on learning algorithms and architectural innovations rather than on the data itself (i.e., data as a design component of the solution). Another way of improving it to look for unmodelled aspects that may be necessary to improve performance. If a model is not working as it should, a good question is what is the inductive bias that is missing in the current implementation? How can it be added? Is this going to solve all the problems? What additional problems will likely show up? Piecing together supervision may work well, where different models are trained in different proxy tasks, and then brought together in a task of interest with additional fine-tuning.

Conclusions

This discussion should make us cautious about drawing too many connections between machine learning and human learning. The inductive biases that go into machine learning models are still very far from those present in human learning, so comparing cognitive abilities is bound to be flawed. The most obvious mismatches are a lack of development through evolution, a lack of embodiment, and a lack of being subject to the same energy constraints as human agents and other agents in the real world. While we do not need to achieve human-level intelligence using the same principles, reasoning through the principles which might have led to human cognition gives us insights about the nature of the data and what machine learning models trained on it ought to capture or not. In my view, intelligence is fundamentally about behavior because that is the only thing that is observable to us. The best guide for determining whether we have achieved or not is to be cautious about what a specific machine learning approach captures and what does human learning capture and progressively improve results. I think that an approach either has enough supervision to induce a certain level of conceptual understanding or it has not. If it has not, we may be able to identify and include additional supervision to improve the model and bring it closer to the desired behavior.