A solution to the Deep Learning reproducibility crisis
Deep learning has seen a rapid increase in popularity over the last 5 years.
Given that its a field in the rapid stages of its development, it has brought about an influx of researchers and institutes. This surge was so rapid that in 2018, tickets for a niche machine learning conference, NIPS (now NeurIPS), sold out in less than 12 minutes.
Given the large opportunity that lies ahead, many labs and independent researchers alike are under massive pressure to publish or perish. This has led to a large number of research papers being published in deep learning every day. According to a recent survey, there are about 100 new papers released on ArXiv each day most of which claim state of the art performance.
With this rapid influx of papers, comes various challenges. Researchers who performed a meta-analysis of state of the art Reinforcement Learning algorithms found that many of the results were questionable and largely dependent on random seeds. Conferences such as NeurIPS and ICLR are catching on to this and have introduced reproducibility challenges in 2019.
However, I believe this problem can be solved with a combination of tools that already exist today.
The following is a description of what this might look like.
1. The conference provides researchers with a container that contains pre-installed and up to date deep learning frameworks (e.g., PyTorch, TensorFlow, MXNet). Researchers will have to ensure that their code can be deployed on this container before submitting their paper. This should help mitigate issues that arise from the development environment and versioning of frameworks.
2. This container would have the ability to specify system-wide random seed that is used during training/testing. This would allow the conference editors to test the approach with several random seeds.
3. For the final submission, authors train and test their approach on pre-defined datasets/pre-trained weights that are provided by the conference. This ensures that there are no mistakes pertaining to data generation. There may be certain kinds of research that require training on new datasets but those cases are likely easier to handle on a case by case basis.
4. During the review process, conference organizers choose a small number of random hidden seeds that are used to train networks several times after submission to the conference using API above. The resources for this part will be provided by authors. Since this is done after submission, it shouldn’t substantially affect researcher productivity.
5. After the networks are trained, they are then evaluated on a standardized online leaderboard several times. Various statistics such as the mean, standard deviation and runtime are reported. There might be a model for the resources for evaluation to be provided by conference sponsors. Given that large companies benefit the most from commercializing deep networks at scale, it doesn’t seem like a far cry for them to sponsor a small number of resources that can be used to perform evaluations that help to reduce the noise in Deep Learning research.
6. The good thing about this approach is that authors can choose to make their container public or keep it private. Many industrial research labs have to protect their IP and thus cannot release code. This approach accommodates that as well.
This is merely a first sketch of a solution that will likely have to go through several iterations before it solves the problem well. However, this is something we should all start thinking of more since there is no advancement in science without certain ground. Constructive criticism and other ideas are welcome. Either leave a comment below or send an email to firstname.lastname@example.org.