<template>
  <div>
    <!--begin::DocumentVQA-->
    <div class="row">
      <div class="col-lg-6 col-xxl-6">
        <b-card title="Rules" tag="article" class="mb-2">
          <b-card-text>
            All the benchmark submissions are expected to conform to the
            following rules to guarantee fair comparison, reproducibility, and
            transparency.
            <ol class="ml-10 mt-5">
              <li>
                All results should be automatically obtainable starting from
                either raw PDF documents or the JSON files we provide. In
                particular, it is not permitted to rely on the potentially
                available source file that our PDFs were generated from or
                in-house manual annotation.
              </li>
              <li>
                Despite the fact that we provide an output of various OCR
                mechanisms wherever applicable, it is allowed to use software
                from outside the list. In such cases, participants are highly
                encouraged to donate OCR results to the community, and we
                declare to host them along with other variants. As OCR is an
                essential factor affecting the final score, it is expected to
                provide detailed information on used software and its version.
              </li>
              <li>
                Any dataset can be used for unsupervised pretraining. The use of
                supervised pretraining is limited to datasets where there is no
                risk of information leakage, e.g., one cannot train models on
                datasets constructed from Wikipedia tables unless it is
                guaranteed that the same data does not appear in
                WikiTableQuestions and TabFact.
              </li>
              <li>
                It is encouraged to either use datasets already publicly
                available or to release private data used for pretraining. The
                minimum requirement for a private dataset is the description of
                its size and creation sufficient to reproduce the results.
              </li>
              <li>
                Training performed on a development set is not allowed. We
                assume participants select the model to submit using training
                loss or validation score. We do not release test sets and keep
                them secret by introducing a daily limit of evaluations
                performed on the benchmark's website.
              </li>
              <li>
                Although we allow submissions limited to one category, e.g., QA
                or KIE, complete evaluations of models that are able to
                comprehend all of the tasks with one architecture are highly
                encouraged.
              </li>
              <li>
                Since different random initialization or data order can result
                in considerably higher scores, we require the bulk submission of
                at least three results with different random seeds.
              </li>
              <li>
                Every submission is required to have an accompanying
                description. It is recommended to include the link to the source
                code.
              </li>
            </ol>
          </b-card-text>
        </b-card>
      </div>
      <div class="col-lg-6 col-xxl-6">
        <b-card title="Scoring" tag="article" class="mb-2">
          <b-card-text>
            <p>
              One who aims to propose a benchmark aggregating varied, existing
              tasks has to decide whether to unify the metrics used across the
              suite or keep the original measures.
            </p>
            <p>
              To provide an objective means of comparison with the previously
              published results, we decided to retain the initially formulated
              metrics.
            </p>
            <p>
              Regarding the overall score, we consider resorting to an
              arithmetic mean of different metrics desirable due to its
              simplicity and straightforward calculation and analysis. Moreover,
              it is partially justified as all ANLS, F1, and accuracy are
              interrelated variants of the same measure.
            </p>
            <p>
              To discount for a different number of tasks within each type, we
              perform a two-step averaging, i.e., average scores within each
              category and then average aggregated scores of Document QA, KIE
              and Table QA groups.
            </p>
          </b-card-text>
        </b-card>
      </div>
    </div>
    <!--end::DocumentVQA-->
  </div>
</template>

<script>
import { SET_BREADCRUMB } from "@/core/services/store/breadcrumbs.module";
// import ListWidget3 from "@/view/content/widgets/list/Widget3.vue";
// import StatWidget7 from "@/view/content/widgets/stats/Widget12.vue";
// import Paper from "@/view/content/widgets/due/Paper.vue";

export default {
  name: "leaderboard",
  components: {
    // ListWidget3,
    // StatWidget7,
    // Paper
  },
  mounted() {
    this.$store.dispatch(SET_BREADCRUMB, [{ title: "Evaluation Protocol" }]);
  }
};
</script>
