I aim to do at least a competition in machine learning by year. Because of my growing interest in reinforcement learning, I felt I am ready to take a challenge in this domain. Kaggle offers machine learning competition but not in reinforcement learning. For that, I went on the AICrowd website to find a bunch of RL competition.
This time I took an interest in the MineRL competition on "Sample Efficient Reinforcement Learning using Human Prior". It is a competition based on the Minecraft game where the agent has to obtain a diamond. For recall, to get one, the agent will have to craft several items as sticks and pickaxes and be able to navigate in a mine. Then, some key challenges here will be to implement long-term planning and an efficient exploration method. We don't want the agent to explore all the crafting recipes!
To gradually experiment, the organizers prepared sub-task environments as navigate to a destination or chopping several tree blocks. Moreover, the organizers gave a dataset of human plays for each sub-task. After looking a bit to the data, I found out each human demonstration consists of a video, a json, and npz (Numpy zipped archive) files.
Before training an agent, a good thing is to explore the dataset as we will count on them. Let's see what we got!
# Environments and tasks
There are five kinds of tasks: Navigate, Navigate Extreme that spawns the agent on extreme hill biome, Obtain Pickaxe, Obtain Diamond and Treechop.
All except Treechop comes into two variants: Sparse and Dense, In the "Dense" variant, the agent receives a reward at every tick for how much closer it is from the target.
Note: The environment spawns the agent in a survival map.
We have 1525 videos split among these environments.
To explore the video type file, I started by looking at the metadata. I use the hachoir library to extract it. Below you can see the distribution of the resolution and the times for all the videos. The most part of the videos have a resolution 64x64 resolutions, but some have a higher resolution. 64x64 is a proper resolution for a neural network. The 256x192 videos resolutions, I will see if I resize or discard them or something else.
hachoir can extract more like the frame rate, but in this case, it wasn't available in the metadata.
Now, I need to know a bit more on the content of the videos. For that, I extracted all the frames of the videos to return to an image analysis setting. Then to visualize the distribution, I made an embedding projection. More precisely, I sampled 1024 images, retrieved the output of a pre-trained ResNet18, and project on 2D using UMAP (Uniform Manifold Approximation and Projection). We get the following result for the videos demonstrations of the ObtainDiamond environment.
That not bad, but the pictures are overlapping, and we cannot inspect the hidden ones. After a quick search, I found how to fit the projection into a square grid using the Jonker-Volgenant algorithm (a linear assignment problem solver).
There is not light in a mine, and the visualization reminds this to me. A part of the dataset is really too dark to see through it. The agent should be able to craft a torch and place it to see in the dark to see the diamond block!
# JSON files
Plotting the content of the JSON file is interesting because it encloses the data if the human succeed the task or not (because of a bug or whatever).
As we see, some data can be left over. Around 20% of the entire dataset didn't accomplish the task. I will clean it to avoid misleading demonstrations.
# Sequence visualisation
NPZ is a Numpy archive file that consists of subfiles with Numpy arrays. In fact, this is sequence data: the action, inventory, and reward saved for each frame.
I plotted some charts for each, but it is more informative combined with the associated the frame. I used the Holoviews lib with Bokeh to get the navigation by frame. The main pro to this library is that I can add <div> HTML block and put anything that can be copy-paste later.
This visualization doesn't give an overall view of the dataset, but it will allow me to inspect a sequence of action-observation pairs more carefully if needed.
# About the state and the actions spaces
The state space depends on the considered environment. For example, in the "navigate" environment, a compass angle is returned; in the "obtain pickaxe" environment, the object held in the right-hand figures in environment observation.
Regarding the action space, we find the directional movement with the jump, sprint, sneak, the camera, ... All these controls are discrete except for the camera that expects continuous values.
# What now?
Most of these visualizations can be used again for another machine learning project. Thus, I regrouped them into a repository I called Joligraf.
With this competition, the problem is posed as an imitation learning problem, but for baseline, I will first pose it a supervised learning problem. Since we have data, I will start by finding a good state representation or augment it.
We will see more in imitation learning and what I found soon! To be continued…
Found an article interesting and/or useful? Please consider following me to be alerted when I post new content!
You can also support me with a retweet!