Written by Scott Wilson
A constant inspiration throughout the history of the study of artificial intelligence is the wonder of human intelligence. Through every step, every investigation, and behind almost every breakthrough is the goal of making machines a little bit more like human beings when it comes to just figuring stuff out.
Human problem-solving is both a mystery and a wonder. Psychologists have been studying how people reason ever since there has even been a field of psychology… and philosophers have explored the same question for even longer.
Humans are learning machines from birth. Wired into our genetic makeup are ways to probe problems, make experiments of them, and learn from the results. It’s the basis of all knowledge acquisition and most reasoning skill.
The tenacity and creativity with which people approach learning experiences is the gold standard in AI. And the closest technique that the field has developed to how we do it is reinforced learning.
What Is Reinforcement Learning?
When looking at the various kinds of machine learning models, reinforcement learning instantly jumps out as an intuitively human kind of reasoning.
Supervised learning is a kind of easy mode for machines that can’t be applied to general problems. It takes previously labeled examples and drills on them like a student using flashcards. That’s great if there is someone to make flashcards for you; but most of the challenges we face don’t come with such suggestions. In fact, many novel problems may have no examples to work from at all.
Unsupervised learning does away with the need to label data, but it doesn’t go very far in actual problem-solving or reasoning. While it can identify attributes that are common to examples or tease out trends within data, it has no way of assigning meaning to those attributes or understanding why they might be important. It’s still down to outside agents to evaluate those findings.
Reinforcement Learning: An Introduction
Reinforcement learning functions at a very human level: an RL algorithm attempts to act on information it is given and assesses the result based on a programmed reward function.
An extremely simple reinforcement learning example can be visualized through a Roomba.
A robot vacuum cleaner has a very basic task from a human perspective. But as a machine, spatial concepts like rooms, roving pets, and potted plants are anything but simple. Navigating a new house with shifting obstacles is a process that requires learning.
Early robots would encounter an obstacle and keep bouncing off it. But a reward function could be introduced to make hitting something a negative and not hitting anything a positive. A virtual treat would be given for completing a room without hitting something. Perhaps another treat, slightly less tasty, would be given for finishing the room as fast as possible. More bumping could be faster, but still not score as high.
Equipped with a neural network to provide a sort of memory, the robot could now try new routes and know which to avoid by calculating the total reward. The best path would be the fastest route with the fewest hits. After some number of tries, the machine will have learned the optimal way around the house.
If this doesn’t remind you of a kid trying to figure out how to get a cookie jar down from the top of a refrigerator, you may be a machine yourself.
Reinforcement learning is essentially standard Pavlovian behavioral training in electronic form.
The best part of the reinforcement learning approach is that it doesn’t need to end. If the furniture is rearranged, the robot can absorb that information and adjust… just like people do.
A Short Look at Reinforcement Learning Algorithms
Of course, a simple example is far simpler than the reality of AI reinforcement learning as it exists today.
While the goal of all reinforcement learning methods is to build an algorithm that maximizes rewards for desired behavior, there are many different methods and techniques used toward that end.
The four basic approaches to reinforcement learning fall generally into these areas:
- Value - The algorithm uses the value function to judge the payoff of learning actions
- Policy - The algorithm is given a basic policy to start with during the learning process; it may work to optimize this policy over time
- Imitation - While not explicitly given a policy to work from, the algorithm is given examples to work from
- Model - These algorithms are given an explicit virtual model of the environment to learn within
There are various combinations and cross-over approaches to these. In general, they can also be subdivided into approaches that are model-based reinforcement learning or model-free reinforcement learning… in other words, whether they will be given or learn a model of the system or are expected to behave according to pure trial and error experimentation.
In all cases, the process is iterative. RL algorithms experiment with different approaches as they go. Each has a balance between exploring new potential solutions that may be better than the current best reward, and optimizing solutions that have already scored high rewards. This is the explore vs exploit trade-off.
Essentially, it’s a kind of gamble: do you stand pat on a 16 in blackjack or hit on it? There are optimal statistical answers to such questions when it comes to RL. Finding those answers is what ML engineers do.
Multi Agent Reinforcement Learning Simulates AI Interaction
If it’s not complicated enough to build up a solid RL model for a single agent, imagine how you would reinforce learning for multiple agents at the same time. That’s multi-agent reinforcement learning, an investigation of how shared environments impact behavior and learning.
MARL doesn’t have a particular kind of utility in training, but offers invaluable insights into how AI will function in real-world applications. That’s because the advent of AI in almost every industry is going to lead to a large number of autonomous AI agents acting in various capacities. How will an AI learning to process invoices interact with another AI running accounts payable? What will both of them do when they tie into a reporting AI trying to develop projections for management purposes?
MARL is how AI researchers will learn to craft complex and complementary behavior through machine learning, rather than setting off a firestorm of conflicts and weirdness.
Inverse Reinforcement Learning Lets Algorithms Find Their Own Rewards
Taking RL a step further, researchers are exploring how to train algorithms without having to specify their reward function first. Instead, they ask the algorithm to make up its own reward.
This is accomplished by having the RL agent observe another agent, or a person, performing a task and by those observations attempting to infer what reward they are pursuing. By modeling the behavior that the example agent provides, the algorithm should be able to understand and identify the goals they are pursuing.
While IRL is still in early days, it offers a potentially powerful way for robust reward functions to be created… which, as you will see in the next section, is critical to the success of RL.
Reinforcement Learning From Human Feedback: Bringing the Personal to Artificial Intelligence
Although reinforcement learning is a kind of magic in that it allows machines to learn on their own, there are still many concepts that are pretty inefficient for AI to figure out on its own. In other cases, engineers may want to train AI on concepts that are still inherently human: humor, for example, is almost impossible to define in a reward function.
Reinforcement learning with human feedback is a way to use RL while incorporating direct human judgement into the reward function.
RLHF can be incorporated at various stages of training.
Pre-training may incorporate specifically created data to prime the process with optimal results.
Direct feedback happens when the model presents results directly to a human trainer, who then corrects or accepts the result as accurate.
Fine-tuning occurs after the bulk of model training has been completed. This is essentially a form of supervised training to reinforce learning specifically on finer points that additional RL might not easily achieve independently.
PPO Reinforcement Learning Helps Keep Algorithms on Track
Proximal Policy Reinforcement Learning puts some guardrails around the RL process to keep algorithms from going too far off the rails. Essentially, it helps offer human constraints that wouldn’t otherwise be present in the RL training process.
PPO is designed to help maximize the steps an RL algorithm can take to improve itself without overcorrecting.
As an RL algorithm experiments with the explore/exploit dichotomy, it has a choice with how dramatically it will alter its next move. Large steps risk overshooting optimal choices, taking longer or leading the algorithm astray. Small steps risk taking too long to find the right balance.
PPO imposes boundaries to keep the algorithm from following its natural inclination where that may result in guesses either too big or too small.
It’s proven so effective that it’s now the default RL algorithm used by OpenAI. Most organizations using RLHF now implement it with PPO as well.
Reinforcement Learning With Human Feedback Has Its Own Drawbacks
RLHF may seem like an ideal combination of training AI, but it does come with some drawbacks.
First, it’s expensive: humans are tremendously slow compared to machines. The time value of their contributions, especially in great quantity, can be quite costly.
Next, human judgement is inconsistent. For example, human evaluators training an AI model on humor will probably deliver a lot of contradictory information just based on different ideas of what’s funny.
Finally, humans bring human biases to the table when they offer feedback. In a sense this is benefit, since it is an accurate reflection of the world they represent and the expectations they set. But it’s also a drawback since some of the worst qualities of human judgement come from built-in biases.
When RLHF draws on human evaluators without sufficient diversity, it can lead to overfitting of the model to the problem. That loses the flexibility and power that AI should bring to the table.
Reinforcement Learning Comes With Trade-Offs and Limitations
Like any tool in the modern AI toolbox, there are trade-offs involved with the reinforcement learning technique.
Some machine learning problems don’t have consistent or clear objectives. It’s tough to define a reward function when you can’t tell the machine going in exactly what it should achieve.
In other cases, it’s absolutely necessary for algorithms to be explainable… they need to be clearly programed and trained from the ground up in ways that humans can understand, rather than training themselves by unknown means.
And sometimes the constellation of potential choices exceeds even the capacity of high-performance computer systems. An RL model may just not be able to explore the potential solution set in any reasonable timeframe.
Deep Reinforcement Learning Requires Careful Thought Around Rewards
Something that researchers found very early on with reinforcement learning is that reward functions have to be very carefully tuned to reflect the desired outcomes. And it can be challenging to do that when computers get creative.
In one famous instance, a researcher experimenting with RL in video games used a boat racing game to test an agent in developing pathfinding and competitive skills. But it turned out that the points model used by the game offered rewards for hitting target buoys along the route. And it also turned out that doing this could garner a higher score more quickly than actually completing the race.
The Paperclip Problem, where AI optimized to make paperclips runs amok turning every tool and source of metal into paperclip fodder, is an example of the result of a poorly designed reward function.
The RL agent simply found an area with repopulating buoys it could smack into repeatedly, occasionally catching fire, and not getting anywhere near the finish. It racked up a high score—but learned nothing the researchers had intended.
More complex learning environments offer even more pitfalls for reward tuning. In scenarios where multiple, even competing, goals exist, the ideal reward function may simply be unattainable.
Reinforcement Learning Can’t Cover Every Possible AI Use Case
Related to the requirement for large amounts of training data is the fact that RL can’t function very well with problems for which such data simply doesn’t exist. Any novel or low-frequency kind of task AI is expected to perform probably can’t be automated with RL techniques.
Reinforcement learning also falls down in situations where a particular action or attempt at solution may not be detectable until much later in the process. For example, modeling environmental issues that are highly dependent on small differences in initial conditions, but which do not play out for hundreds of years, doesn’t provide feedback directly tied to the actions in a useful timeframe.
In other cases, RL is inadvisable due to the stakes of the experiments. In the realm of robotics, for example, using reinforcement learning to teach a robot to pick up eggs will result in an awful lot of broken eggs before it lands on appropriate techniques.
Treating an autonomous vehicle like a Roomba when it comes to reinforcement learning would just create a lot of property damage.
For all that RL doesn’t represent a silver bullet to all machine learning challenges, however, it remains one of the most powerful techniques that AI engineers are using today.
Finding Fixes in Reinforcement Learning Limitations Is Where the Future of AI Is Heading
Yet the entire field of AI revolves around finding solutions to limitations that once seem insurmountable. Reinforcement learning itself was once thought to be a dead-end; only new processing power and the availability of large amounts of data rescued it from obscurity.
Similarly, AI researchers and scientists are finding ways around the limitations. For example, despite the apparent difficulty of using RL in autonomous vehicles, researchers are starting to train driving models in virtual environments where they can crash and burn as often as they like to figure out lane changes before turning them loose in real vehicles.
Other experimental approaches include:
- Developing multiple reward systems to encourage models to pick up more complex behaviors
- Optimizing learning trajectories to increase efficiency in training
- Creating ensemble models to multiply strategies toward increased sample efficiency
The future of reinforcement learning is bright. But it takes highly skilled researchers and engineers to make it happen.
AI Researchers Reinforce Learning for Themselves Through Advanced Degrees in AI
Being a part of that future requires the right education. For reinforcement learning, that typically means an advanced degree in computer science, machine learning, or artificial intelligence.
Basic concepts and the essential core math behind reinforcement learning will be covered even in undergraduate programs, like a Bachelor of Science in Artificial Intelligence.
But to really get to where the cutting edge developments are happening in RL, a Master of Science in Machine Learning or similar graduate degree is a must.
These programs often include specific coursework in reinforcement learning. Those can include:
- Reinforcement learning Python programming
- Deep reinforcement learning
- Fundamentals of reinforcement learning
These cover advanced topics in RL like policy gradient design and the mechanics of actually programming RL algorithms.
They also have more general AI education that helps develop an understand of when and how to apply RL to address common challenges in artificial intelligence.
Degree programs at this level don’t just put students in classrooms so they can digest information delivered from on high. They also include research requirements and hands-on projects that allow you to develop real-world theories and expertise in reinforcement learning.
Certificates and Certifications Help Polish up Essential Reinforcement Learning Skillsets
If you already hold a degree in artificial intelligence or computer science, and have that general background, you can punch up your RL education with certificate courses or other focused studies. A Graduate Certificate in Machine Learning will probably only take a few months to earn, but offers the same advanced coursework in RL that you would find in a full degree in the field.
There are also free MOOC (Massive Open Online Course) options like the Hugging Face Deep Reinforcement Learning course.
Reinforcement learning is one of the most important skillsets in machine learning today. Landing the top jobs in AI can take more than just having the right knowledge. Getting a professional certification to demonstrate that you have mastered it may be even more important.
Various different machine learning certifications will offer future employers that assurance, or show that you have mastered various tools that are important in most RL implementations. For example, an Amazon Web Services Machine Learning Specialty Certification verifies your ability to use AWS for handling many of the building and modeling steps necessary to train an ML algorithm with RL.
Whatever platform they are used on, you can be sure that reinforcement learning and its various branches and combinations will continue to be a critical piece of making machines a little more life-like in the years to come. And that’s all part of what will make artificial intelligence a key part of the future of human life.