Machine Learning in Robotics

by Bharath Kumar P. Last updated on 29/Oct/2021

Posted on 29/Oct/2021

The field of Machine Learning (ML) can be understood as a science of learning systems. Machine Learning aims to develop algorithms that demonstrate the ability to learn from data and improve with experience. Machine Learning has become a central subdiscipline of Artificial Intelligence and utilizes methods from statistical learning theory for efficient data analysis. In particular, the great success of ML for data analysis has led to its application in many commercial and scientific applications, e.g., becoming a central tool for the dominant IT companies to exploit their data as well as for applications in bioinformatics and the neurosciences.

The Machine Learning & Robotics Lab aims to push Machine Learning methods towards intelligent real-world systems, in particular robots autonomously learning to interact with and manipulate their environment. Unlike standard data analysis methods, the system needs to actively collect data and derive models of the environment that enable goal-directed decision making and planning.

Imitation Learning for robotic arm:

Humans are able to learn how to perform a task by simply observing their peers performing it once; this is highly desirable behavior for robots, as it would allow the next generation of robotic systems, even in households, to be easily taught tasks, without additional technology or long interaction times. Endowing a robot with the ability to learn from a single human demonstration rather than through teleoperation, would allow for a more seamless human-robot interaction.

Domain-Adaptive Meta-Learning (DAML) is a recent approach that uses an end-to-end method for one-shot imitation of humans which leverages a large amount of prior meta-training data collected for many different tasks. This approach required thousands of examples across many tasks during meta-training: these examples are videos of a person physically performing the tasks and teleoperated robot demonstrations, meaning that there has to be an active and long human presence when collecting the dataset.

Image Credits:?arxiv

Imitation learning aims to learn tasks by observing a demonstrator, and can broadly be classified into two key ar- eas: behavior cloning, where an agent learns a mapping from observations to actions given demonstrations, and inverse reinforcement learning, where an agent attempts to estimate a reward function that describes the given demonstrations.

The majority of work in behavior cloning operates on a set of configuration-space trajectories that can be collected via teleoperation, kinesthetic teaching, sensors on a human demonstrator, through motion planners, or even by observing humans directly. Expanding further on the latter, learning by observing humans has previously been achieved through hand-designed mappings between human actions and robot actions, visual activity recognition, and explicit hand tracking, and more recently by a system that infers actions from a single video of a human via an end-to-end trained system.

Image Credits:?arxiv

Data Collection:

Many approaches to human imitation rely on training in the real world. This has many disadvantages, but most evident is the amount of time and effort needed to collect data for the training dataset. In the case of DAML, thousands of demonstrations had to be recorded, which rely on an active human presence to obtain both human and robot demonstrations, as the robot still has to be controlled in some way. For instance, in the DAML placing experiment, a total of 2586 demonstrations were collected to form the training dataset, meaning tens of research hours dedicated to collecting data, with no guarantees that the dataset allows the network to generalize well enough. Training in the simulation provides much more flexibility and availability of data: data generation can be easily parallelized and does not require constant human intervention. Additionally, there have been many successful examples of systems trained in simulation and then run in the real-word.

Training:

Our task-embedding network and control network use a convolutional neural network (CNN), which consists of 4 convolution layers, each with 16 filters of size 5×5, followed by 3 fully-connected layers consisting of 200 neurons. Each layer is followed by layer normalization and an elu activation function, except for the final layer, where the output is linear for both the task-embedding and control network.

Input consists of 125 × 125 RGB images and the robot proprioceptive data, including the joint angles. The proprioceptive data is concatenated to the features extracted from the CNN layers of the control network, before being sent through the fully-connected layers. The output of the embedding network (embedding size) is a vector of length 20. The output of the control network corresponds to velocities applied to the 6 joints of a Kinova Mico 6-DoF arm.

Image Credits:?arxiv

Multi-Robot Path Planning Method Using Reinforcement Learning:

Reinforced learning is the training of machine learning models to make a sequence of decisions so that a robot learns to achieve a goal in an uncertain, potentially complex environment by selecting the action to be performed according to the environment without an accurate system model. When learning data is not provided, some actions are taken to compensate the system for learning. Reinforcement learning, which includes actor-critic structure and q learning, has many applications such as scheduling, chess, and robot control based on image processing, path planning, etc. Most of the existing studies use reinforcement learning exclusively the performance in simulation or games. In multi-robot control, reinforcement learning and genetic algorithms have some drawbacks that have to be compensated for. In contrast to the control of multiple motors in a single robot arm, reinforcement learning of the multiple robots for solving one task or multiple tasks is relatively inactive.

The robot learns the next action based on the learned data when it selects the next action, and after several learning, it moves to the closest target. By interacting with the environment, robots exhibit new and complex behaviors rather than existing behaviors. The existing analytical methods suffer from adaptation to complex and dynamic systems and environments. By using Deep q learning and CNN, reinforcement learning is performed on the basis of image, and the same data as the actual multi-robot is used to compare it with the existing algorithms.

Reinforcement Learning:

Reinforcement learning is the training of machine learning models to make a sequence of decisions through trial and error in a dynamic environment. The robots learn to achieve a goal in an uncertain, potentially complex environment through programming the object by reward or penalty.

Image Credits:?mdpi

When a robot is moving in a discrete, restrictive environment, it chooses one of a set of definite behaviors at every time interval and assumes that it is in a Markov state; the state changes to the probability of difference.

Image Credits:?mdpi

At every time interval t, a robot can get status s from the environment and then take action. It receives a stochastic prize r, which depends on the state and behavior of the expected prize Rst to find the optimal policy that an entity wants to achieve.

Image Credits:?mdpi

The discount factor means that the rewards received at t time intervals are less affected than the rewards currently received. The operational value function Va is calculated using the policy function π and the policy value function Vp. The state value function for the expected prize when starting from state s and following the policy is expressed by the following equation.

Image Credits:?mdpi

It is proved that there is at least one optimal policy as follows. The goal of Q-learning is to set an optimal policy without initial conditions. For the policy, define the Q value as follows.

Image Credits:?mdpi

Q(st, at) is the newly calculated Q(st?1, at?1), and Q(st?1, at?1) corresponds to the next state by the current Q(st?1, at?1) value and the current Q(st?1, at?1).

Reinforcement Learning based path planning:

The learning experience that occurs at each time step through multiple episodes to be stored in the dataset is called memory regeneration. The learning data samples are used for updating with a certain probability in the reconstructed memory each time. Data efficiency can be improved by reusing experience data and reducing correlations between samples.

Image Credits:?mdpi

Without treating each pixel independently, we use the CNN algorithm to understand the information in the images. The transformation layer transmits the feature information of the image to the neural network by considering the area of the image and maintaining the relationship between the objects on the screen. CNN extracts only feature information from image information. The reconstructed memory to store the experience basically stores the agent’s experience and uses it randomly when learning neural networks. Through this process, which prevents learning about immediate behavior in the environment, the experience is sustained and updated. In addition, the goal value is used to calculate the loss of all actions during learning.

Image Credits:?mdpi

Environment:

The proposed algorithm learns based on the search time of the A* algorithm and assumes learning success when it reaches the target position more quickly than the A* algorithm.

Image Credits:?mdpi

In the graph, the red line depicts the score of the proposed algorithm and the blue line is the score when learning Q parameters per mobile robot. The experiment result confirms that the score of the proposed algorithm is slightly slower at the beginning of learning but reaches the final goal score.

Image Credits:?mdpi

The proposed algorithm, which shows the target arrival speed of the episode-based algorithm, has a slower learning progression speed than the Q parameter of each model with a similar target position.

Image Credits:?mdpi

The average score of each generation increases gradually as learning progresses. As the learning progresses, it gradually gets better results.

Experiments based on the proposed algorithm in a dynamic environment is performed to compare the learning results of the D* algorithm. D* and its variants have been widely used for mobile robot navigation because of its adaptation to a dynamic environment. During the navigation of the robot, new information is used for a robot to replan a new shortest path from its current coordinates repeatedly.?

For the D* algorithm, if an obstacle is located on the robot path, the path will be checked again after the path of movement. In the same situation, different paths may occur depending on the circumstances of the moving obstacle.

Image Credits:?mdpi