NWO conducts double-blind research on the assessment procedure of the Open Competition ENW-M

The assessment procedure that NWO applies within the Open Competition of the science domain Exact Sciences and Natural Sciences (hereafter: ENW-M) is integer. This is the conclusion of a scientifically grounded investigation regarding the ENW-M program within the Open Competition. NWO sees no reason for major adjustments in the assessment process of the relevant program based on the results, but does see opportunities to further improve the assessment process in certain areas.

As a research financier, NWO has the task of selecting the best research in the Netherlands through competition. Scientists and research institutions can submit an application for funding of research projects as soon as NWO publishes a call for research proposals. In awarding research grants, quality is always decisive: the financier allows the quality of the research proposal and the expected scientific or societal impact to be assessed. This is done at the invitation of NWO by an external scientific committee, consisting of experts on the theme. There is also the possibility to ask additional external experts – so-called referees – to provide a judgment. The assessment procedure results in a ranking, where the best research proposals are honored. Approximately 8,000 research proposals are assessed annually, with thousands of scientists involved as assessors.

Double-blind research

The reason for the investigation into the assessment procedure of ENW-M within the Open Competition was a re-evaluation of a round of this program, as the original assessment had been declared invalid. After the re-evaluation, a number of differences between the priorities of the new and old assessment committees turned out to be greater than expected. This prompted ENW to examine the part of the assessment procedure involving the committee based on a scientifically grounded research method. This research was completed at the end of 2024.

The aim of the research was to investigate the extent to which the assessment procedure by committees in this program leads to a reproducible outcome. A double-blind study was set up, in which the same set of research proposals and referee reports were assessed by two different committees. The outcomes of each set of two committees were compared and analyzed according to different methods and premises to determine how reliably the committees prioritize the applications. In collaboration with researchers from TU Delft (prof. dr. Geurt Jongbloed and colleagues), a statistical model was also developed based on historical data to ascertain the degree of agreement in priorities that can be expected in different assessments; after all, it is human work. Finally, a substantive analysis of the differences in priorities was also made together with the involved committee members.

Outcomes

All of this led to the following outcomes:

The different rankings (resulting from the double-blind assessments) with priorities from best to worst research proposals show differences, but also have reasonable agreement in the main sections. The 20% of proposals identified as best and the 20% identified as worst are fairly consistently identified. The differences that occur manifest mainly in the middle part of the ranking.
The statistical model from TU Delft shows that the observed differences between the priorities are generally in line with expectations. Furthermore, the model proves that the agreement between the committees from the double-blind study is better in all cases than the agreement between two random rankings.
The differences in prioritization by committee members are often based on a difference in interpretation by committee members, for example about concepts such as high-risk/high-gain, or on a difference in appreciation of the rebuttal.
The observed differences in the ranking are often already reflected in the preliminary priorities based on the initial individual scores, before the committees convened. Only in a few cases does group dynamics within the committee, such as one dominant committee member in the discussion, seem to have played a decisive role.

Based on the research, it can be concluded that the reproducibility of the assessment by committees is high enough. The assessment procedure by committees within ENW-M can thus be considered integer. NWO sees no reason for major adjustments in the assessment process of the relevant program based on the results, but acknowledges that there is room to further optimize reliability. NWO recognizes that assessment procedures by committees involve difficult choices regarding the acceptance and rejection of applications, which are assumed to be fully objective. However, assessing applications remains ultimately human work, where a certain degree of randomness can be involved.

NWO is focusing on optimizing the instructions for the assessment committees based on the outcomes of the report and will formulate clearer definitions of the assessment criteria. In addition, NWO will explore whether the use of score rubrics, initially specifically for ENW-M, can contribute to reducing interpretational differences and definitional issues regarding the criteria. Rubrics are tools used in committee assessments where a committee member can clearly indicate their judgment on each aspect of the assessment criterion and thus arrive at a total assessment of the criterion. NWO will work on these adjustments in the coming months.

Read the complete research report, including summary