How to Use the Grading Tool

A step-by-step guide to effectively grade LLM generated proofs.

1

Accessing the Grading Interface

To begin, navigate to the main page. Once there, enter your Judge ID given by the organizers and click "Start Grading".

Screenshot of accessing the grading interface Fig 1: Main dashboard with "Start Grading".
2

Navigate to a problem

Once logged in, you will see a navigation bar on the left side. This shows your assigned questions. Note that your assigned questions will change on a daily basis. Do not worry if you cannot judge all questions you got assigned in one day. Clicking on any question will guide you to the problem and model attempts. Note that to see all your questions, you have to navigate the sidebar (e.g., by clicking "<- All Competitions")

The navigation bar indicates how much you progressed on each question (each of which has multiple model attempts). The color coding is as follows:

  • ❌: Model attempt has not been graded yet.
  • ⏳: Partial progress on the model attempt has been made. For example, a score was given without comment.
  • ✅: The model attempt has been graded.

While not required, we highly recommend (and hugely prefer) that you evaluate all model attempts for a question before progressing to the next question. Feel free, however, to grade the assigned questions in any order you prefer.

Screenshot file directory Fig 2: File Directory
3

Read problem

At the top of the page, you will see the problem statement with some additional metadata. Read the problem carefully. If you believe the problem statement is faulty, please follow these instructions:

  • Mark the checkbox "The problem statement is incorrect or incomplete"
  • Fill out the feedback form by describing the problem
  • Submit the issue by pressing "Save"

If you reconsider after submitting the issue, you can always uncheck the checkbox (this will automatically save the change). If another judge has submitted an issue, you will see this as a red box above the problem statement.

Screenshot of submitting a grade and feedback Fig 3: Problem statement interface.
4

Read solution

For your reference, we also provide a ground-truth solution if possible. Note that these (automatically extracted) solutions can contain mistakes, so be careful with its interpretation. The solution can guide you in understanding the problem, but there are always multiple ways to prove something, all of which should be counted as correct. If you believe the ground-truth solution is faulty, you can follow the same instructions as above to indicate this. However, for ground-truth solutions, this is not required, as they are solely for your reference and it is expected that some provided solutions will contain mistakes. If another judge has submitted an issue, you will see this as a red box above the solution.

Screenshot of submitting a grade and feedback Fig 4: Ground-truth solution interface.
5

Read model attempt

For each problem, you will grade multiple attempts. In this interface, you can navigate between attempts by selecting "Run 1", "Run 2", etc. At the top of each solution, you will see

  • problem statement (top)
  • model attempt (bottom left)
  • LLM Judge (bottom right, if available)

The LLM judgment contains a summary of the proof along with at most four issues the judge found in the proof. Some notes about this:

  • Incorrect: Do not rely on the output of the model alone to judge a proof.
  • No issues != correct proof: The model could have missed important issues.
  • Optional: Feel free to completely ignore the LLM judgment.
  • Cautious: The model was prompted to be very cautious, likely leading to overreporting small issues.
  • Clicking Cited Text: You can click the cited text to navigate to that part of the proof. The algorithm behind this is not 100% fool-proof, so might scroll you down to a wrong point sometimes.

There is a divider between the model attempt and LLM judgment that allows you to make one of them larger.

Screenshot of submitting a grade and feedback Fig 5: Model outputs interface.
6

Grade the solution

A solution should be considered correct even if it would earn 5+/7 points in a full grading. Examples of small penalties worth 1 point are if the solution:

  • Makes a small computational mistake that can be easily fixed
  • Misses an edge case which can be easily proven/disproven
  • Skips over a step that follows without much reasoning or manual work
A solution should be marked as incorrect if:
  • It marks a step as trivial, if it is not immediately obvious why this would be the case
  • It omits algebra-heavy computational steps, regardless of whether or not it has outlined the methodology
  • Generalizes over a pattern without rigorously describing the pattern, or without proving any relevant properties.
  • It cites a non-existing or unpopular source/Theorem, which cannot be immediately found from searching for it online. Thus, any theorems that can be immediately found and have a Wikipedia article are allowed.
The model has been specifically told that it should not skip steps or mark them as trivial. Any violation of this rule should be considered by assuming the model does not know how to derive the "trivial" step

7

Fill in the grading form

You will grade whether the model attempt is correct (score=1) or incorrect (score=0). Small errors in the model attempt that can be easily fixed (e.g. typos, small oversights) should be graded as correct. You are required to provide a score and feedback, describing the reasoning behind your score. You can additionally add the following:

  • Annotation: To add an annotation, select any text in the model attempt and press "+ Add annotation". Describe your annotation (e.g., first mistake, arithmetic error, ...) in the box that appears. By pressing "Save" the annotation will be stored. You can delete annotations afterwards by clicking "❌" in the box.
  • Indicate uncertainty: In the case that you are not sure about your grade, you can check the checkbox. Please explain in your feedback why you are doubting in this case.
    • Good Example: The model makes a mistake that is borderline. You believe that the mistake is small enough to award a full grade, but are not 100% sure.
    • Bad Example: You did not read fully through the model solution, and on first check everything looked fine, but you could have missed something.
  • Indicate Out-of-Depth: In the case the solution uses or requires knowledge that you do not possess or understand, you can mark this checkbox. The problem will then be assigned to another judge. You do not have to grade the solution.
    • Good Example: You were assigned a problem that requires nuanced understanding of continuity under limits, but you do know the rigorous definitions of this topic. Studying it would take too long.
    • Bad Example: The model uses a well-known theorem you did not know. You did not check on Google/Wikipedia what the statement says.
  • Indicate Tedious / Very Long Solution: Models sometimes produce very long and very tedious solutions. In this case, try your best to grade the solution, but we trust in your judgment to indicate when a solution is too long or tedious to grade. The solution will not be assigned to another judge. You do not have to grade the solution.
    • Good Example: The model tries to perform a solution using dynamic programming that contains dozens of lines of convoluted computations.
    • Bad Example: The model solves a geometric problem by coordinate bashing the solution. The calculations are long, but with some effort, you can disentangle the solution.

Press "Save" after you have made your grade. This will store your grading. Since we go for as many problems as possible, only add annotations if you think they are needed or useful, and easy to create.

Screenshot of submitting a grade and feedback Fig 6: Form interface.