From the outset with this project, we were interested in the possibility of utilising Papers With Code data (1446 research leaderboards) to try and bridge the gap between reported results in papers and what you get by running associated open source code.
“Reproducibility” is a big word - and we do not claim to have solved this problem (yet) (!) - but as a step in the right direction we came up with a narrow definition of epsilon reproducibility:
Epsilon Reproducibility : A paper is epsilon-reproduced if abs(code result - paper result) < epsilon
We made epsilon task specific - for example Image Classification on ImageNet has a different epsilon than Language Modelling on WikiText-103 - and chose the levels by looking at how far implementation results tended to deviate from paper results.
We also had to decide what to do if the code results were above the paper results : does this count as not reproduced? We opted for it being “reproduced with benefits” - hence the checkmark with the plus - but this is arguably still problematic; it doesn’t answer the question why the implementation has better results.
Would be very interested in community feedback, in particular:
- Thoughts on the concept of eps-reproducibility
- How should we set epsilon for each task?
- What to do with code results that exceed paper results?