One paper, nine reviews

faces-marks


As a researcher, what can we do about those decisions that we justifiably consider to be unfair regarding the rejection of our papers? What can we do about the lack of accountability currently allowed by our system?

This year, I submitted a paper to the Machine Tracks of three conferences in 2019 (NAACL, ACL and WMT). The 9 reviews I received were wildly different not only in the scores (4, 5, 4, 1.5, 2.5, 1.5, 5, 4, 2), but also in the quality of the reviews and the conscientiousness of the reviewers. I was so shocked by the absurd variability in the reviews that I felt that it was important to share this experience, parts of the reviews and what I would personally like to see changed in the reviewing process. I know this has been discussed before and I am not suggesting that this is a new topic. However I hope that this drop in the ocean can help the wave of change that is long overdue in the reviewing process.

N.B. I would have gladly shared the reviews in full, but am concerned that this violates current ACL policies.

Some background on the article itself

To provide some background to the reviews, the paper is a resource paper, presenting a novel dataset containing spontaneous written dialogues between English and French speakers, mediated entirely by Machine Translation (MT) systems. The participants provided fine-grained annotations of perceived MT quality (on the sentence-level) as the dialogue was taking place, and manual references translations were produced for all sentences once all dialogues had been collected. We focus on the collection method itself as a way of collecting and comparing manual judgments of MT quality, and also on the qualities of the resulting corpus, which can be used as a test set for the MT of dialogue (and MT in general), but also for the analysis of spontaneous bilingual exchanges. The paper can be read in full here and the corpus is freely available here. Please feel free to check it out and use it! And although this post focuses on the problems of the reviewing system, I would like to thank those reviewers who did give us some good feedback and constructive criticism.

Here is a summary of the reviews:

Conference #1: NAACL2019

Summary: Frustrating rejection but interested and serious reviewers.

        Decision:                               Reject
        Overall scores:                     4, 5, 4 (out of 6)
        Meta-review:                        None
        Reviewer-bidding phase:     Yes

The reviewers were generally positive and considered that it was a novel resource. For example, reviewer 1 cited the following strength (amongst others):

the methodology used is replicable and it presents a novelty on data creation and machine translation assessment

and reviewer 3 started by saying:

This is a very interesting paper

They gave some good constructive criticism, which I tried to take into account in the subsequent submissions of the article (space permitting). NAACL was particularly competitive this year and I was not alone in being rejected with high scores. The fact that papers with lower scores than this got accepted (possibly in other tracks) was frustrating, particularly due to the lack of meta-review. However, I understand that scores are not always everything (and thankfully so in some cases).

Conference #2: ACL2019

Summary: Insultingly low scores from reviewers who consider resources to be sub-research.

        Decision:                                Reject
        Overall scores:                      1.5, 1.5, 2.5 (out of 5)
        Meta-review:                         Yes
        Reviewer-Bidding phase:      No

The main criticisms here (completely different from the NAACL reviewers) was that the research was not novel, and the meta-review basically repeated this criticism:

  • no novel research (Reviewer 1)
  • there is no novel research discussed in this paper (Reviewer 2)
  • not propose novel ideas (Reviewer 3)

I should add that the reviewers did not actually point out resources that do the same thing. Moreover, the following comment by reviewer 3 does suggest that the problem is not so much one of scientific novelty per se, but that the reviewers do not consider resource construction to be a valid scientific contribution:

The paper is mostly a description of the corpus and its collection and contains little scientific contribution

This was echoed by the reviewers’ calls to submit to LREC, stating its lack of suitability for ACL:

This paper is not suitable for ACL in my opinion..  It is very suitable for LREC and for MT specific conferences and workshops.

I have heard of many similar cases met by other researchers trying to publish resource papers. Not only are resources extremely valuable for everyone, but they demand scientific skill to design and construct. The problem is that certain members of the community totally underestimate resource creation. It was clear to me that the three reviewers were not in my target audience, were uninterested in corpus creation and in the importance of such resources. The three positive NAACL reviews (from reviewers who had actually bid on the paper) indicate that the opinions expressed in these ACL reviews were certainly not unanimous.

On a side note, these were not just average rejection scores (a paper nowadays can be reliably rejected with scores around 3). These low scores are ones that I would usually reserve for papers that are REALLY bad: containing tonnes of false facts, are totally unclear or are intentionally misleading, plus having little to offer the community. Given the other very positive reviews from the other conferences (and even the textual content of these ACL reviews), I assume that the paper is far from falling into this category.

The main difference between the NAACL and ACL reviewer setups was the lack of reviewer bidding for ACL. The reviewers assigned to the paper did not choose to read my article and therefore were most likely not the right people to review it (particularly dangerous for a resource paper). There was a resource track at ACL (proof that resource papers are meant to be welcome). I had consciously made the decision to submit to the MT track rather than to resources, because I thought that:

  1. those who would be most interested in the article would be MT researchers
  2. my intended audience would be in the MT track rather than the resources since reviewers were encouraged to review for one track only
  3. the positive NAACL reviews showed it was relevant to the MT track

In hindsight, this was probably a mistake (the resource track seemed much more welcoming and conscientious).

Instead the 3 reviews for the paper (plus the meta-review which does not pick up the anti-resource bias) gave 3 incredibly low scores, totally out of synch with the previous positive reviews from NAACL (and the following reviews from WMT). I know that randomness due to subjectivity does exist, but the incredible divide between the scores of these two conferences illustrates how dangerous not getting the right reviewers for papers can be for a paper.

WMT2019

So the paper was resubmitted to WMT (I did not resubmit to EMNLP due to their restriction on double submissions this year).

        Decision:                                Reject
        Overall scores:                      5, 2, 4 (out of 5)
        Meta-review:                         None
        Reviewer-Bidding phase:      No

Two very positive reviews (reviewers 1 and 3) came out of this review process (also with some constructive criticism). They got why the resource would be useful. Reviewer 1 found it

interesting for several use cases

and thought that the

[d]escription and experimental evaluation are very thorough

Reviewer 3 thought that:

The paper is well-writen and the contributions are clear. The paper itself contains valuable analysis and the corpus provides opportunities for future research.

The scores for individual factors (relevance, soundness, originality, impact, clarity, meaningful comparison) were similar for the two reviewers.

Relevance (1-5):                                5, 4, 5
Soundness / Correctness (1-5):          5, 3, 5
Originality / Innovativeness (1-5):     4, 2, 3
Impact of Ideas or Results (1-5):        4, 2, 3
Clarity (1-5):                                     5, 5, 5
Meaningful Comparison (1-5):          5, 1, 5
Overall Recommendation (1-5):        5, 2, 4

However, reviewer 2’s review sticks out like a sore thumb. It is short, so here it is:

This paper presents a new English-French test set for the evaluation of MT for informal, written bilingual dialogue. It contains 144 spontaneous dialogues with 5,700+ sentences. The motivations are (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication.

The idea of building a corpus with typical errors by MT is not bad. However, the contribution of this paper is only this small corpus… This would be a too small contribution so as to present at WMT. In my opinion it would need more technical contributions. It is encouraging to see new methods, which are described at Section 5, developed.

The summary part of the review (in bold) is actually a copy-paste of the paper’s abstract, with a couple of nouns replaced by “it”. So I have no qualms about copyright in this respect, since it is actually copied from the article. The review itself is shockingly dismissive and not of a quality I would expected from any ACL or ACL-colocated conference. Their inability to summarise the article and the vagueness of the remarks suggests that the reviewer probably did not read the paper at all. Moreover, the scores are totally incoherent with the other two reviews. For example,  instead of the 2 scores of 5/5 for ‘meaningful comparison’ (the paper does contain a related work section), reviewer 2 gives it a score of 1. If I had seen this review as a chair, I would have immediately dismissed it as a review that should not be taken into account. And although it may not have been the only reason for the paper’s rejection, I am incredibly concerned that this review did indeed have an impact on the final decision.

The paper was rejected without meta-review.

What is the effect?

Most academics have felt the frustration of being rejected due to unfair or inconsistent reviewers. As time goes on, I personally get the impression that the current system of peer-review, flooded with biased, dismissive and lazy reviews, is more broken than ever. We should all be talking about this. Bad reviewing practices lead to unjust decisions that do not reflect good scientific judgment. These decisions can have a direct impact on the future of (especially young) researchers, whose future permanent positions may depend on having an extra article accepted (or not) in a rank A conference. We talk a lot about creating a welcoming and non-offensive environment at conferences (e.g. anti-harassment policy), which is great, but seem to be ignoring the effect that unprofessional and closed minded reviewing can create in the short term (on the mental health of researchers) and also in the long term (additionally for their future employment).

It is also a waste of time for both authors and reviewers. Doing research (and in particular creating resources) is time-consuming, and due to anonymity restrictions, this specific research paper and corpus has been unreleasable for over half a year due to being locked into the anonymity cycle. Constant rejection due to unjust reviews has the effect of prolonging the non-usability of these resources and forcing the author to be stuck in an endless loop of resubmission (or forcing them to abandon research and the hope of publication).

Reviewers are wasting their valuable time re-reviewing papers rejected because of other reviewers’ inadequate reviews. The work has been judged valuable by a number of reviews, but rejected because a couple are close-minded or have insufficient integrity, and the same paper must again be reviewed by 3 more reviewers in the next cycle. I have personally reviewed for several conferences this year (including ACL and WMT) and spent at least several hours on each review – some may say that this is too much, but I consider that it is important to provide fair and thorough feedback to the best of my ability. All it takes is a single reviewer to be dismissive, inaccurate or in some cases not actually bother to read the paper, for a paper to be rejected.

What can the solutions be?

  • Appreciation of diverse research topics (and in particular of resources):
    • Create a space for resource papers within specific research tracks rather than relying on a separate track. Those who are most likely to use the resources under review are very likely to be reviewing in specific tracks (such as MT). Flagging these articles as “resource papers” could help assign reviewers who are competent and willing to reviews these papers within topic-specific tracks.
    • Provide some rough guidelines for reviewing different types of papers. What do we expect from an experimental paper, from a resource paper, etc.?
    • Keep the bidding phase. Automatic assignment of articles in large tracks such as MT leads to a quite random assignment of articles, particularly for younger reviewers who do not have a backlog of articles to be used for automatic assignment. For ACL, I reviewed 5 very diverse papers, none of which were ones I would usually have chosen. As a consequence, the reviews took me far longer than usual in order to read up on the topics and provide a more informed review. My own paper was assigned reviewers who were totally disinterested.
  • Improved transparency of the review and decision process:
    • I am more and more in favour of OpenReview. Having reviews available to everyone would make identifying unfair or poor reviews easier, and anyone who is interested could reply to unfair reviews.
    • Meta-reviews provide some insight into decisions made but could be made far more systematic – I only received a meta-review from ACL (that did not pick up on the extreme nature of the reviews). For WMT, I did not receive such a meta-review, despite the highly diverging scores.
    • A clear indication of which arguments led to the decision being taken. Do area chairs ignore reviews that are clearly inadequate, and if not, why not? These should be clearly identified as dud reviews and this information fed back to authors as well as to the reviewer in question. If a reviewer was told that their review was considered inadequate/inappropriate, this could make them question their attitude towards the reviewing process
  • Accountability of reviewers:
    • Should reviewers be deanonymised? This is clearly a big change that also has its drawbacks. However, it would encourage reviewers to be constructive in their reviews and discourage shoddy ones. It could also encourage interesting discussion between authors and reviewers at conferences.
    • How about reviewing the reviews? Ok, not a full-scale review, which would be unfeasible, but unfair or incorrect statements should be flagged up, and in many cases the best people to do this are the other reviewers of the paper (I realise that this would not have saved my paper from the anti-resource reviewers at ACL, but still…). A system is already in place for inter-reviewer discussion (but this often happens ages after the reviews are completed). Reviewers could systematically receive an email update when another review is added, along with a prompt to “review” the new review (particularly if the scores are different). Simply indicating which parts of the the review they agree with or heavily disagree with would provide a lot more clarity for area chairs and help catch out those dud reviews or any unreasonable comments within a review.

Yet again, I know many of these have been discussed before. But I do think that we cannot let these things go unnoticed. I have not experienced such severe problems for a single article before: the inconsistency of the reviews, the bias of reviewers assigning undeservedly low scores and a lazy reviewer who actually has an impact on the rejection of an article. This is not something that just happened once, and it will continue to happen unless the community changes the way we allow poor reviewers to get away with it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s