Definition

Evaluation tools allows to download and pre-process data for use in a challenge.

A evaluation tool artifact is typically stored in a file evaluation_tool.yml.

This schema of the file is described below.

Bonseyes AI App Evaluation Protocol Metadata

type

object

properties

  • metadata

Metadata for AI Artifacts

  • interface

ID of the AI Class for which this evaluation protocol has been designed

type

string

  • image

Name of the docker image of the tool

type

string

  • parameters

Path to the schema file used to validate the tool parameters

type

string

  • input_datasets

Input datasets expected by the tool

type

object

additionalProperties

type

object

properties

  • description

Description of the dataset

type

string

  • metrics

Metrics produced by the evaluation protocol

type

object

additionalProperties

Description of the metric

type

object

properties

  • title

Short description of the metric

type

string

  • description

Detailed description of the metric

type

string

  • unit

Unit of the measurement

type

string

additionalProperties

False

  • is_generic_tool

Generic evaluation tools can be used for different challenges (of the same class). It can lack some properties that exist in challenge manifest where the tool is referred to (e.g. generic tool doesn’t have to provide slots for datasets to be used, while this can be “overriden” with a concrete challenge and datasets defined there).

type

boolean

additionalProperties

False

Create Evaluation Tool

This guide assumes you already have the done the following:

An evaluation tool is a piece of code that is able to test the performance of an AI app running on a target hardware. To do so it sends the data to the AI app using an HTTP API exposed on the target and performs various measurements. It then stores the results in a output directory. This must minimally be a JSON file with a series of metrics but can also include for instance a PDF report. Data tools are usually developed with challenges to express the acceptance criterias of for the AI apps that answer to it.

An evaluation tool at its core it consist of an executable (or script) that runs on the developer workstation and is capable of reading data, send it to the AI app via an HTTP API, collect the results and build a report . In order to guarantee portability this script is packaged inside a docker container so that all its dependencies are available on the system that will run it.

To create an evaluation tool three steps are necessary:

  1. Create the evaluation tool script

  2. Dockerize the evaluation tool script

  3. Create the evaluation tool metadata

The rest of this page describes these steps.

Create the evaluation tool script

The data tool script must be an executable that can consume the following command line parameters:

  • --output-dir [path to directory] : path to a directory where the data must be stored

  • --target-url [url] : URL where the AI app HTTP API is exposed

  • --dataset-dir [path to directory] or --dataset-dir [datasetname] [path to directory] (optional) : path where some the input datasets are accessible, the first syntax is used if only one is passed, the second if multiple are passed.

  • --parameters [path to JSON file] (optional) : path to a JSON file containing the parameters for the script provided by the challenge writer

  • --cache-dir [path to directory] (optional): path to directory where intermediate results can be stored

The exact process carried out by the evaluation tool and the parameters accepted depend on the type of evaluation tool being developed. An evaluation tool must always output a JSON file benchmark.json that contains an object, each property of the object is one of the metrics measured by the tool and the key is the resulting value.

The parameters actually required depend on the type of evaluation tool being developed. The --dataset-dir is used to pass datasets distributed with the challenge (or the output of the corresponding data tools). The --parameters parameter is used by generic evaluation tools that can be re-used in multiple challenges. The --cache-dir is for by tools that support to resume their execution after a transient error and can be used during development to cache evaluation results.

The evaluation script must invoke the AI app HTTP interface. Each AI app interface has its own HTTP interface. The pre-defined interfaces (image classification, object detection, …) expect the data to be sent as the body of a POST request and they return a JSON object with the result of the evaluation. By setting a custom HTTP header x-metrics to all it is possible to obtain performance details of the AI app.

Dockerize the evaluation tool script

Once the evaluation tool script is ready it is necessary to create a docker image containing all the required dependencies.

This depends on the way the tool has been written, for a python based tool the Dockerfile could look as follows:

FROM python

COPY requirements.txt /app/

RUN pip3 install -r /app/requirements.txt

COPY evaluation.py /app/

ENTRYPOINT [ "python3" , "/app/evaluation.py" ]

It is mandatory to set the script as the entry point of the container with the ENTRYPOINT directive as this is the way the container will be invoked by the Bonseyes tooling.

The docker needs then to be built (and potentially stored on a remote registry where other users can download it):

$ docker build -t path/to/registry/and/repository/for/tool .
$ docker push path/to/registry/and/repository/for/tool

Create the evaluation tool metadata

In order to use the evaluation tool in a challenge some metadata needs to be defined. This allows the system to find the tool and provides documentation for the tool.

The metadata must be stored in a artifact package, this can be a dedicated package or it can be stored along with the challenge. To create a new package use you can follow the instructions in /pages/dev_guides/package/create_package.

The evaluation tool metadata consists of a main file typically called evaluation_tool.yml and an optional schema for the parameters typically named parameters.yml. The main data tool file has the following structure:

metadata:
  title: Title for the evaluation tool
  description: |
    Multiline description of the evaluation tool
image: path/to/registry/and/repository/for/tool
parameters: relative path to the parameters schema file (optional)
interface: com_bonseyes/interfaces#image_classification
input_datasets:
  test_images:
    description: Images and corresponding ground truth used to perform the accuracy test
metrics:
  accuracy:
    title: Accuracy
    description: Percentage of images correctly classified
    unit: percentage
  latency:
    title: Latency
    description: Average time to perform teh inference on an image
    unit: ms

The metadata object contains a description of the tool. The image property points to the image containing the dockerized tool script. The interface property specifies the interface that the AI app under test must implement. The metrics property specifies a list of metrics that are generated by the tool. The input_datasets property is used to describe the datasets that the tool requires as input, corresponding entries must be present in the data section of the evaluation procedure of the challenge.