Deep Reinforcement Learning for ENAS

论文复现与改进

Experimental Environment

OS：Archlinux 5.16.1-arch1-1
Software：Python 3.10, Cuda 11.5
Hardware: Intel CORE i5 8th Gen, NVIDIA Geforce GTX 1060

Experimental Problems

According to the paper and the paper reproduction code, please add comments of key step codes.
Improve the deep reinforcement learning algorithm of reproducing code in this paper. The improved part needs to be different from the policy gradient algorithm of the original paper reproduction code.
Apply the improved algorithm practical tasks, and illustrate the necessity and superiority of the improved algorithm for practical tasks.

Solution

What is Neural Architecture Search(NAS)

As is implied by the name, it’s a searching problem. To solve some problems using reinforcement learning methods such as Q-learning, we need to design a network structure to train. Obviously, the quality of the training datas and parameters such as learning rates will affect the training effect, but the network structure is also an indicator of the training effect. So, it’s significant to design a good network structure. In the past, this task was done manually. Now we want to leave the task to a neural network. Yes, let a neural network generate a neural network!

The neural network can be abstracted into a graph. The direction of the edges represents the flow direction of the input datas, and the points have an activation function, such as softmax and tanh. All the neural network should do is to generate a directed graph and place a activation function on all the points.

What is Efficient Neural Architecture Search(ENAS)

Well, it’s still a neural architecture search problem, but it’s efficient. How to understand it? In fact, in order to generate a better network, we need to train this neural network, and we regard it as a Controller. For a network generated by it, how to evaluate whether it is good or bad, that is, reward? Train the network! Because of the expensive cost of time to train a network, which will slow down the training speed of the controller, we need to speed up. For example, if a network generated by controller need 10 hours to train until it convergent, that is, we need 10 hours to get a reward of the action made by controller. Imagine that you play a game, do an action, and then give you feedback for 10 hours before you can make the next action, isn’t it very uncomfortable?

So, how to speed up? We can start with the trained network. The purpose of training the network is to obtain good parameters. Note that the structure of the generated network is similar sometimes, and some parameters can be retained so that there is no need for retraining, which can speed up the convergence of the network. Therefore, the key to improvement is parameter sharing.

The Framework of ENAS

According to the paper, it’s a searching problem. So there will be a searching space. For example, there is a complete graph with four points in the following Figure.

The controller is a Recurrent Neural Network, RNN, which has a flexible length of output and memory of the network outputing before. The task the controller should do is select edges and the activation function on the points. The following Figure is an example of the output of controller: ['tanh', 1, 'ReLU', 2, 'ReLU', 1, 'tanh']

The first activation function is on the Node 1. And the next two is about the information of Node 2: the data from Node 1 will flow to Node 2, and the activation function on Node 2 is ReLU, and so on. Node 0 is the input and the data on it will always flow to Node 1. And the points which has no output edge is regarded as End Node, and we will mean all the data in End Node and flow to Output Node to get the result.

From the search space we can see that all the parameters(W, b) of the neural network are on the edges, and when we select an edge at one time and train it. Next time we again select the same edge, the previously trained parameters can be reused.

So, here is the steps of the whole framework:

The Controller generate a network, that is, output a sequence such
as ['tanh', 1, 'ReLU', 2, 'ReLU', 1, 'tanh']
Generate the network according to the sequence.
Train the network.
Evaluate the network, get the accuracy
Controller get the reward, and learning.

Problems in the paper reproduction code

Frankly speaking, this code is NOT a reproduction of the ENAS parameter sharing paper based on the above introduction at all.

It cann’t run

This shocked me the most, although the reason why it cann’t run was not fatal.

For example, in training.py, Line 67, there is a variable named val_freq, but it’s undefined. Also, in training.py and chileNet.py, there are some codes using numpy package, which is not imported.

I noticed the issues, and I read the article written by the author carefully. It doesn’t mention the parameter sharing. Good.

Every time the controller generate a network, and train it, it will call net.apply(weight_reset) to reset the parameter. See more in childNet.py, Line 167.

It’s just a simple NAS, even a simplified version, which will explain next.

Its searching space is slightly different

Its searching space consist of hidden units, instead of points. That is, its Controller will generate a sequence such as [2, 8, 'tanh', 16, 4 'ReLU', 8, 2], which means there are six hidden units, whose output size are 2, 8, 16, 4, 8, 2. Suppose the input size is 10, then it will become size 2 through the first hidden units, and 8 through the second hidden units, and activate by tanh function, and become size 16 through the third hidden units, and so on.

See more in childNet.py, the __init__ function of Net.

Yes, this is a serial network structure, which, in graph terms, is a chain!

There is some difference between code and artical

It encodes the hidden units and activation function into one vector. And for every output probability vector the controller gives, it will sample it by the probability, which means at every time, it’s possible to select a hidden unit or activation function.

However, in the artical it says the data to controller will switch.

The lists will be switched depending on the nature of the iteration
step, so, for an odd number, the number of hidden units will be fed as
the input and for an even number, the type of the activation function
chosen is fed as the input to the memory cell of the recurrent neural
network.

If it is the same as what is said in this article, the output of the controller should be a hidden layer, and the activation functions appear alternately. Obviously, it didn’t.

See more in policy.py, forward function.

Redundant dependency in requirement.txt

The requirements.txt contains so many packages that it didn’t depend. It just needs less than ten packages.

Improvement

The main idea of the code is: define a class PolicyNet named Controller, define a class childNet which include a network. Controller generates a network information and give childNet, childNet analies the information and trains the network, gives the network accuracy as reward to Controller. Controller learns using Reinforcement learning and repeated.

To add this feature, we need to construct a graph. But here is a problem.

All the points in the graph MUST have the SAME sizes of input and output. For example, the following Figure show the graph we construct with 5 points.

If we select an edge from Node 1 to Node 3, it requires that the output size of Node 1 and the input size of Node 3 MUST be the same. The same as Node 1 and Node 2, therefor the Node 2 and Node 3 and so on. All the points must have the same input size and output size, except the input size of Node 1 because the data from Node 0, which is our input, will only flow to Node 1.

Also, the direction of edge is always from small Node to large Node, which can confirm the graph is valid, that is, a connected acyclic graph.

I construct a Net when define chileNet.

input_edge = nn.Linear(num_features, hidden_features).to(self.__device)
hidden_edge = [nn.Linear(hidden_features, hidden_features).to(self.__device) for _ in range(layer_limit * (layer_limit - 1) // 2)]
output_edge = nn.Linear(hidden_features, num_classes).to(self.__device)
self.net = nn.ModuleList([input_edge, *hidden_edge, output_edge]).to(self.__device)

The parameters on the edges will not be reseted every training. So it’s parameters sharing.

Change chains to Graph

For the network information named layers given by Controller, childNet will analizy it and train the network.

For example, suppose the layers be ["tanh", 1, "ReLU"]. For the odd position, it should be the source Node, and for the even position, it should be the activation function.

for pos, layer in enumerate(layers):
        if layer == 'EOS':
            break
        if pos & 1:
            if isinstance(layer, str):
                raise Exception("Network WRONG! Expect int but found str")
            if layer not in value_node:
                raise Exception("Network WRONG! Index Large")
            # data flow from layer to index_node
            val[index_node] = self.net[self.hidden_edge(layer, index_node)](val[layer])
            # now index_node has value
            value_node.add(index_node)
            # the data from layer is translated to index_node, so the layer's data cann't be the result data
            if layer in end_node:
                end_node.remove(layer)
        else:
            if isinstance(layer, int):
                raise Exception("Network WRONG! Expect str but found int")
            # activate the value of index_node
            val[index_node] = activation_functions[layer](val[index_node])

            index_node = index_node + 1

Function hidden_edge(st, ed) will give back the network edge which starting from Node st to Node ed.

Acutually, we need to control the data flow in network because it’s not serial. So we need to record the values on EVERY point instead of the current point.

And finally, get the average of all the values of End Node and get the result, finishing one whole forwarding.

# just let result be the same variable type as x
result = val[1]
for end in end_node:
    # sum up all value of the end node
    result = result + val[end]
result = result - val[1];

# calculate the mean
result = result / len(end_node)

# calculate the output(classes)
return self.net[-1](result)

It seems little complex, because the network structure isn’t serial anymore, and we need to forward it manually.

Use Priority Replay

There is something special in this problem. We can get the reward only after the whole actions made. So the criti-actor network may not be useful to improve the performance of the controller. Besides Policy Grident, which the Controller uses, there are other methods like Proximal Policy Optimization(PPO), an off-policy method, and Deterministic Policy Gradient(DPG), and more, Deep Deterministic Policy Gradient(DDPG), an continuous-action-suitable method. Here I implement a method similar to Proximal Policy Optimization, that is, reuse the past experience.

Because some experience may be important, such as high-reward action, we should let the controller learning from them sometimes so that it can performan better. So for every some eposes, we will let controller sample from memory.

# get the result
if i > Memory_replay and i % Memory_replay == 0:
    m_index, prob, actions, m_reward, _ = memory_pool.sample(batch_size)
else:
    prob, actions = policy(training)

Every experience has a priority defined by its reward, and sampled in a probabilistic manner according to the priority.

We use a data structure named sumtree to implementation, as shown in the following Figure. Its idea is simple: cut a line into many parts and place the experience on it. And the length of the part depend on its priority. And randomly chose a position, and sample the experience to which the position belongs. And we use a tree structure to maintain the experience adding and priority updating.

Using Cuda to Speed up training

It is very common to use GPU to speed up training, and thanks to Pytorch, it’s very easy to train it on GPU, that is, transfer the data from RAM to GPU, and let GPU process.

According to the actual measurement, which will explain next, it takes 16 hours to complete 500 eposes of training with CPU and only 2 hours with GPU. It’s nearly eight times faster.

Result

In order to better show the effect of the improvement, I will test them in emotion classification, a typical problem in the field of Natural Language Processing(NLP).

The data(click to download)
are reviews of movies, and they are all labled to indicate whether they are positive or negative. We need to train a neural network so that for a movie review, the network can tell us whether it is positive or negative.

And the task the controller should do is to generate a good network structure.

And the task we should do is to train the controller so that it will generate a good network with good parameters.

After the cleaning of data and encoding using word2vec, we get a vector represented a review which can be input into network.

Due to the limitation of time and the performance of GPU, I use the first 1000 reviews, in which the first 600 is used as trained datas and the next 400 is used as test datas. After 500 eposes, and batch size 4, using three version: origin version, improved version, improved + replay version, here is the result.

The above Figure show the accuracy of the network generated by Controller and the following Figure show the reward and loss of Controller.

It’s clearly that the origin version is very funny. Although it trains very fast—only 30 minutes using CPU, but seems to learn nothing. I just fix the bugs mention above and run it. And the version with replay performs slightly better than the version without replay. It can see that in eposes 500 the version with replay still has an upward trend.

Conclusion

From this result, I know the necessity of my improvement: modify the network structure, increase parameter sharing, and add memory replay. Maybe the reason of the poor performance at starting is the insufficient training due to the device performance limited. But thanks to the parameters sharing, the parameters can be trained next time the new network generated, the accuracy and the reward will improve. Also, with reinforcement learning, the Controller can learn from network structure better. From the NLP practics, it’s useful for us to find a good network structure using the improved Controller.

Use a neural network to generate a neural network is an amazing thing. There is still some thing can be improved such as loading trained model, using more efficient policy gradient.

You can access the code in my github.

Reference

Xuanyi Dong,David Jacob Kedziora, Katarzyna Musial, and Bogdan Gabrys. Automated deep learning: Neural architecture search is not the end. CoRR, abs/2112.09245, 2021.
Zhenhan Huang, Chunheng Jiang, Pin-Yu Chen, and Jianxi Gao. Network graph based neural architecture search. CoRR, abs/2112.07805, 2021.
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. CoRR, abs/1802.03268, 2018.
Kaixiong Zhou, Qingquan Song, Xiao Huang, and Xia Hu. Auto-gnn: Neural architec- ture search of graph neural networks. CoRR, abs/1909.03184, 2019.