Date: May 24, Tuesday, 14:00-15:00 IST

Panelists:

Alane Suhr (PhD student, Cornell)

Amanda Stent (ARR EiC)

Graham Neubig (ARR co-CTO)

Hal Daumé III

Hinrich Schütze (moderator)

Marie-Catherine de Marneffe (NAACL PC)

Preslav Nakov (ACL PC)

Yoav Goldberg (EMNLP PC)

Minutes: Abdullatif Köksal

Chair of the Business Meeting: Timothy Baldwin

Business Meeting

Question from Audience:

Q: I know the reason for having such a panel but will the consequences be shared?

A: (Tim?) The survey will be public. Now, by this panel, we are gathering information for the decision making process. The consequences will be dependent on this decision making.

Q: Will the results of this survey be public?

A: (Hinrich) They will be public. (addition after panel: subject to data privacy considerations)

Q: There seem to be a couple of different ways of giving feedback on ARR. For example, the thread on OpenReview, somehow things get pushed from OpenReview to ARR, or the other way. The question is about where is the distinction between what OpenReview software is doing, and how can we keep transparency there? What is ARR responsible for, and how can we keep track of that?

A: Tim says that Hinrich will give a response for that.

Online Question from Raj Dabre: Perhaps a bad question so I apologize in advance: What are the arguments against making reviews and reviewer names public? Would it make people feel more accountable, objective and less mean? Would it deter people from reviewing?

A: This is a question about unblinding reviewers. This is ICLR model vs ACL version of using OpenReview. There is a strong commitment to double-blind reviewing in our community.

Hinrich: If a junior person does not feel free to give an open review for papers. It’s about protecting the reviewers. It’s also orthogonal to rolling review, so ACL Exec can certainly change it if it’s the general feeling from the community.

A: The results of the original reviewing survey shows the strong commitment towards double-blind reviewing. Unless that’s changed radically over time (...) but I think it’s not.

Q (Andre Martins): It’s important to distinguish between making the identity of reviewers public vs making the reviews public. Reviews can be visible like in other ML conferences. In my experience, it was beneficial. They would be helpful without revealing identity.

A: Thanks for clarification. We should discuss this more broadly. Maybe we might not release reviews publicly. Iryna Gurevych's group conducts research and developments of tools to help out the reviewing process. Release of reviews vs release of reviewer identity are somewhat orthogonal. There are questions about deanonymizing reviews and it’s a bit gray. But happy to have a more general discussion if there are more questions about this.

Q: I was hoping that we would talk more about the professional conduct committee. Do you see any future direction in it, just beyond numbers that you can find in the reports?

A (Emily Bender): I’m no longer co-chair of the PCC but this question might be for me. I don’t know much about the future plans of the PCC. But it’s beneficial to have a body that people come to. This work is best done with a lot of privacy. So, you’re not gonna hear details about what happened with cases to protect everyone involved. If we share details publicly, people would have a harder time coming forward if they didn’t know that there’ll be privacy. For those reasons, it’s difficult to tell you anything specific but it’s beneficial to have a body there to speak with.

Tim: Now, we’ll hand it over to the panel.

Hinrich: Thanks everyone for attending. [Introducing all attendees and starting with slides]

My personal perspective: One really important question is how do we frame this discussion? What are the big issues here? There are three questions: Common infrastructure vs fragmented system, Rolling review vs from-scratch, ARR vs other rolling review.

Common infrastructure vs Fragmented system:
Common infrastructure: Built by Graham Neubig and his team, keeps track of reviewers and submissions. I think it’s really a game changer for review quality in the future because now we have statistics about how good reviews are and if they are getting better. And we can work on improving it based on the evidence, for example mentoring and encouraging good reviewers to keep reviewing. We can also reward/incentivize reviewers because people are so tired of reviewing because of COVID and also volume. We want to reward and incentivize good reviewers.

Fragmented system: Each venue manages it separately. I think that’s not a good way.
Rolling review vs From-scratch review
Rolling review: It means revise and submit. Advantages of that are shorter cycles, less reviewer effort, and hopefully reduced overhead in general.

From-scratch review: In case of bad reviews, there is an easy reset with completely new reviewers. It can also be possible with ARR. And also individual venues have more control.
ARR vs Other form of rolling review
ARR: It has been much more challenging than I anticipated. But there’s a solid plan going forward now. If we are going with rolling review, then we should probably go with ARR.

There’s the survey that Tim’s mentioned. [The link of the survey is shared in the chat] It has been widely distributed by social media. I encourage all of you to participate in the survey, we need input from you to know what your opinion is.

Agenda

Opening statements
Panel discussion
Questions from audience

Opening Statements

Amanda Stent: I’m one of the founding co-editors in chief of ACL Rolling Review, and previously the member of ACL reviewing committee.

[Starting with a brief history of ACL]. In 2019, ACL Exec appointed an ACL Reviewing committee. The problem statement was the rapid growth of submission and increasing popularity of preprints causing several problems in the current ACL reviewing system.
In 2020, they proposed two sets of proposals. The short term proposal was Findings, which at first many people hated and now love. And, it has an incredibly high h-index. Also, best reviewer awards which are happening in ACL and reviewer training.

Long-term proposal was ACL Rolling Review. That summer there was a survey for the community and discussions in the ACL Business meeting and on social media. In the next months, ACL Execs sourced a CTO for ARR, that’s Graham, and editors-in-chief. ARR started in March.

ARR currently supports the following:

● Revise and resubmit. Sticky reviews.

● Reviewer training and reviewer recognition.

● Supports anonymous preprints through OpenReview and more than 900 papers have opted-in this option in the last year.

● Comprehensive review statistics, submission over time statistics

● Responsible NLP research and ethics-review training. Review mentoring and training.

[Thanks to the volunteers, reviewers, and authors.]

I want to share statistics coming from Goran. You can find more statistics in the Q1 quarterly report.
ACL and NAACL 22:

ACL:
Most papers were committed to ACL in the last three months of ACL deadline.

11.6% of revised once (submitted twice)

Average review and meta-review score increased by 0.34, 0.78.

Same pattern is observed with the findings.

NAACL 22:

27% of the papers accepted in NAACL were revised once, 1% revised twice (means submitted three times).

Average review score increased by 0.5, average meta-review score increased by 1 point.

Similar patterns are observed with the findings.

Sometimes it goes down but on average, the scores increase.

Through revise-resubmit, we see that we reduced reviewer overload. Reviewers don’t need to spend too much time reviewing for the second time. When separate reviewers are requested from authors, those reviewers can still see reviews from the first one.

For ACL and NAACL, average scores are similarly showing consistency.

We try our best to make sure that reviewers published 5 papers in the last 5 years, so they are contemporary experts. But ultimately it’s up to all of us to make sure review quality is high: we all have to do our share.

Graham Neubig, CTO of ARR:

Now our platform is OpenReview. We run our scripts for ARR in each cycle. This is done together with the tech team. It takes about 15-20 hours of work every month to do this including gathering all of the data from scripts, pulling data from previous months, etc. On top of that, when we implement new things, we are writing the code, fixing bugs. A lot of work from volunteers, they are mostly PhD students. It’s a lot of work, but hopefully it gradually gets better. We’re also discussing with OpenReview to see whether we can work together with them. If anyone likes reviewing and coding, we could also have more tech team members.

Hal Daume:

I’m Hal and I want to thank the organizers because I’m not officially associated with ARR. I think the current direction ACL reviewing is headed is buggy in both design and implementation and disadvantages interdisciplinary work. I currently believe we should stop, take a step back and turn off ARR. To clarify, I’m critical of the system, not the people. Thanks to them for their enormous efforts!

The future of reviewing document, coordinated by Hinrich, is including high-level objectives, including important aspects for the ACL community. On the surface, these are rather unobjectionable. Large part of the design problem is how to resolve tension.

Example for tension:

Timeliness: Authors probably want timely good reviews while reviewers want time to write reviews.

Fair: In a published IRB-approved randomized control trial, we found that marking a paper as a resubmission leads reviewers to score 0.75 lower score on a ten-point scale.

Resolving tensions is an important aspect of this.

Design Proposal:

Turn off ARR.
Year 1 Develop compute infrastructure to support system. I would use semi-structured interviews instead of surveys. Surveys mask these tensions but you can identify it with semi-structured interviews.
Year 2: Trial on one conference and a few workshops
Year 3: Turn off and redesign based on previous unknown-unknowns
Then consider deploying to all *ACL conferences

I would identify direct and indirect stakeholders in the first year. I would conduct semi-structured interviews to identify benefits and harms. It would be human-centered values.

Agency:

Agency is the core of many complaints about ARR. Focus on three different stakeholders.

Authors, Reviewers, (S)ACs/PCs with respect to Agency:

Authors -> Can they control which venue would be more suitable for their paper?

Reviewers -> Can they control when and what to review, how much they are included in the final decision?

(S)ACs/PCs -> Can they control reviewing systems in the conferences? Can PCs try new things like an additional page?

Preslav Nakov (ACL Program Chairs):

This is the first conference that adapted ARR as an only way of submission. We didn’t have an obligation to do that, we could have done a hybrid mode. But after discussing this with ARR and NAACL PC chairs, we thought it would be very useful to use ARR. I would like to share the results of our survey from senior area chairs.

SAC Survey

● Quality of reviews:
The vast majority of SACs (62.5%) are positive, generally satisfied with reviews. 27.5% are negative.

Comments:
Make at least one of them a senior researcher.
No two reviewers from the same institution.

These issues are fixable. Also, we have lots of positive comments.

● Scoring scale:
Every conference changes the reviewing scale, we had our own. We wanted to check if it is useful for reviewing?
67.5% positive, 12.5% negative
Most people are happy. Some people said that they learned more from the text to find the correct score. This might be the case with other conferences.

● Meta review scoring:
52.5 positive, 25% negative

Most people are happy but there is a mismatch between the results of the review scoring scale and meta-review scoring scale.

● (Lack of) Direct Communication with Action Editors and Reviewers
The question is how much this impacted the decision.
52% of them said they are not affected. 25% of them are affected by this.

● Are Comments to the SACs useful?
65% of the time, it is found useful

● Lack of areas
Lack of areas didn’t impact the decision (68.4% agreement).

Some of the comments were discussing that the reviewers didn’t know how to evaluate papers without areas. If we keep areas, it would be easier to calibrate scores.

● User friendliness of OpenReview
47.5% (positive) vs 37.5% (negative)

● Two stage Review Process
Reviewing in ARR and decision in ACL.
58% of SACs like it.

● Quality of reviews is better or worse than past:
50% same
27% worse
14% better

Average is kinda the same.

Final recommendations:

Clear division of labor between ARR and PC chairs of *ACL conferences.

Areas

Main ACL conferences should be the focus instead of workshops at least in the beginning

Guaranteed reviews by a deadline

Dedicated OpenReview person would be helpful

We need more human involvement and more automation

Marie-Catherine de Marneffe (NAACL Program Chairs):

I want to point out what Amanda said. All the volunteers worked very hard and all reviews were on time with more than 5000 papers. There were lots of complaints but the outcome was good in the end. But this panel is about the future of reviewing, let’s focus on that part.

For agency, I agree with Hal’s point. We need to keep in mind that when a paper is rejected from a venue, then the burden of reviewing is shifted to another conference.

Alane Suhr:

I’m Alane, a 6th year PhD student at Cornell. I was a reviewer in ACL conferences, workshops, and managed workshops, I was an AC in NAACL last year. Actually, I haven’t touched ARR before, I don’t have experience with ARR.

I appreciate having back-and-forth dialog as a reviewer and an author. As a reviewer or AC, sometimes discussions are coming back-and-forth too much, the balance is needed.

I like the rigid structure in ARR. For example in NEURIPS, they have open-ended text boxes but there’s not a very rigid structure which I find helpful. As a reviewer, I like to bid on papers and also automated paper matching scores. I do think the track system has been nice but it’s a bit tough for multidisciplinary work. Some flexibility would be nice.

As an AC and reviewer, I like back-and-forth conversation between AC and reviewers, between reviewer and authors.

Yoav Goldberg:

I like the comments on agency. I didn’t like Hal’s proposal of let’s do some other platform from scratch. I think it’s very hard to design. It’s sort of a failure not because of principle, the important part is details and how to get them right. The main issue with ARR is the details and that we can’t really design a new system from scratch and make it work. I think something about the process should be much more gradual than it’s now.

Hinrich: Is there someone who wants to respond to Hal's proposal?

No? Let’s start with questions.

Question/Answer Session from Audience

Question (Pascale Fung): I’d like to address past comments about the process. We are trying a new system from scratch. But OpenReview and infrastructure were not ready, it has never been done before. In a way, that has been very stressful for everyone involved. To some extent, I agree with Hal. I advocate a process with pause. But they have done lots of technical work, perhaps we need to build it better but we need more support from OpenReview. We also need support from ACL Exec, actual financial support, to implement the ARR process. It cannot be a manual process.

For example, in the ethics-review process, we cannot make other people implement things for us. It’s not a process problem, it’s an infrastructure problem. I believe we should focus on that. Under Amanda’s leadership, we’re improving the ARR process. I don’t know if we should pause or not. That’s almost impossible because this ARR process gives us the experience. I found the agency issues are not particularly related to ARR. Timeliness is a lot related to infrastructure. I don’t see your point that ARR is reducing agency. Authors can choose which conferences to submit to with ARR too. It’s implemented now. It was a technical issue. We should focus on the technical part.

Answer:

Yoav: I think it’s not only infrastructure. There are many small details in ARR that should/could be discussed. It should be something more formal than more than 20 minutes in a panel. Tracks or no tracks, author response or no author response? What kind of cycles? How much is it decoupled from the conferences? They are not related to infrastructure. People in ARR have some ideology but we need to do this to be open with other people.

Amanda: There’s a survey which you are encouraged to fill out. It addresses all of your questions.

Yoav: It’s good but still it’s not a process. In the end, the people who make the decisions are three people in the ARR team. We need a broader structure in place to support broader discussions, not only with several people. I don’t think I reflect what people want.

Hinrich: I don’t understand that comment because survey results will be public.

Yoav: Who decides the questions in the survey? Who acts on the results? There should be more people in this process than the people currently running ARR. It should have a wider audience.

Graham: I’d like to point out that the ARR process was based on the first survey. There were lots of parameters that were underspecified that didn’t become clear before running the process. I basically agree with Yoav, we should take a survey. If there are some concerns about the power within ARR editor-in-chiefs, then ACL Exec can decide parameters that were underspecified before. ACL Exec are elected by membership, if membership doesn’t like them, they can kick them out.

Marie: I thought that was the plan. ACL Exec and ACL reviewing committee, this panel, recommendation from NAACL and ACL, we are getting input from people. Then decisions will be made. Maybe Hinrich can explain what the plan is.

Hinrich: The survey contains all the decision points from ACL and NAACL program chairs. Then, the survey can tell us what the community wants and then we can make changes. The worst case, we can stop the experiment. I think there’s a clear decision process defined.

Amanda: We already have made a large number of changes in ARR in response to feedback we have received from the community. We changed review forms, meta-review forms, added author response, added senior area chairs, added ethics reviewing, etc. We faithfully execute what the community decides.

Question (Alexander Koller?): I really liked the idea of rolling review, thanks to the organizers. You took a big risk, and I think as a community in a few years, we will be better.
I would like to pick up on the agency point that Hal raised. It’s not a secondary problem, it’s the main problem. Agency makes reviewers better. As a reviewer, I would like to be assigned related papers. With areas, the risk of bad assignments it was lower. Back in the old days, we even had bidding where there was an actual agency. Reviewers had a good contribution. As a reviewer, it’s really good to be asked via AC, “Can you review for this conference for my track?”. It makes me feel like I have a choice. Then, the AC has built a personal relationship with me. When I’m late, they can actually ask me with a real conversation. Subtleties of how the reviewing system is set up beyond technical details really matter to the interpersonal relationship between reviewer and AC and authors. I’m advocating that we should give reviewers the feeling that they’re involved in the process. High reviewer quality cannot happen without agency.

Answer:

Preslav: Agency part is very important to us, PC chairs. We made a commitment to authors that any paper submitted with a specific deadline will have reviews and meta-reviews ready for the ACL acceptance decision. ARR actually gave us this "auto-agency". We could go and supervise the entire process. We kinda said that we want our ACs to be (if they want to) inside the rolling review. We were able to implement many things. Hal’s remark about agency is important. To me, if a conference accepts to use ARR only, PC chairs have to be in the same model. Of course the final decision will be ARR still, but they will be very involved. If a conference goes with the hybrid mode (like EMNLP), they probably don’t need that agency.

Amanda: Agency is a fine thing. It’s a function of privilege and access and it seems like a good intent. I’ll just give two examples.

When an Action Editor invites reviewers, they invite someone that they know. It may unintentionally exclude (by geography, by affiliation) a large portion of our community.

Agency is great as long as everyone intends and actually behaves with integrity. You can refer to the Communications of the ACM article by Michael Littman from 2020 for an example of how peer reviewing can be subverted with ill intent very easily. We must balance agency with the integrity of our process and equitability (accountability?). These are all good values that must be balanced.

Question (from Chat): If reviewers are changed, what good is it?

Answer (Hinrich): It’s a problem but sometimes reviewers are not available. You have to assign another reviewer.

Question (from Chat): Why are SACs and ARR not contacting the ACs with a clear deadline as done before ARR?

Answer (Hinrich): That’s probably a problem that can be worked out.

Question (from Chat): How do you measure the quality of reviews?

Answer (Goran Glavas): We are very soon going to launch forums and we will be able to provide feedback on several aspects. It also depends on the interaction with Openreview for them to make the form available. It’s about to happen for direct feedback from authors on several aspects of quality for reviews.

Question: I think, when we are judging SACs, we need to distinguish ARR and usual ACL reviewing. In ARR, the main goal is to help improve papers, what’s wrong with the paper and what’s good about it independent of acceptance or rejection. From my experience, the feedback is how to improve the paper. For example, if there’s a paper that requires a small fix, you can still say that I promise to fix that by camera ready in the normal process. In ARR, they can say that “Fix that experiment and submit again”. It’s a distinction that should be considered in the ARR process.

Question: I think ARR is implemented in an inconsistent way with reviewing for conferences.

Even though ARR instantiates a model quite different from our traditional conferences, we still have PCs, SACs, ACs. When there’s a mismatch, it’s very difficult to check hundreds of different meta-reviews. The conferences have got to change their reviewing process so there should be less awkwardness between recommendations made in one and decisions made in the other. I find it extremely difficult.

Answer:

Hinrich: That’s one of the fundamental design choices in ARR. Reviews are not custom-made for one conference. So, I think there’s going to be a tension between the output of rolling review and what the conference needs.

Graham: I just have one technical comment. Sebastian Riedel has recently been working on recruiting Senior Action Editors for ARR. The senior action editors have basically consistent action editors below them. So, certain senior action editors are in charge of certain action editors. That’s kind of the strength of the open review system. If we had an inconsistency, it can be resolved by SACs of the conference for example. I think that’s something at least technically fixable.

Amanda: What Graham is saying is that we will have something like tracks. And also, for the first time in my career, we will be able to see scores, review and meta-review scores in time.

Hinrich: Thanks to the ARR team, ACL-NAACL PCs. It has been a difficult time for them but we have a great conference here. Please fill out the survey, we will take it up from there.