Definition

Evaluation tools allows to download and pre-process data for use in a challenge.

A evaluation tool artifact is typically stored in a file evaluation_tool.yml.

This schema of the file is described below.

Bonseyes AI App Evaluation Protocol Metadata

type	object
properties
metadata	Metadata for AI Artifacts
interface	ID of the AI Class for which this evaluation protocol has been designed
	type	string
image	Name of the docker image of the tool
	type	string
parameters	Path to the schema file used to validate the tool parameters
	type	string
input_datasets	Input datasets expected by the tool
	type	object
	additionalProperties	type	object
		properties
		description	Description of the dataset
			type	string
metrics	Metrics produced by the evaluation protocol
	type	object
	additionalProperties	Description of the metric
		type	object
		properties
		title	Short description of the metric
			type	string
		description	Detailed description of the metric
			type	string
		unit	Unit of the measurement
			type	string
		additionalProperties	False
is_generic_tool	Generic evaluation tools can be used for different challenges (of the same class). It can lack some properties that exist in challenge manifest where the tool is referred to (e.g. generic tool doesn’t have to provide slots for datasets to be used, while this can be “overriden” with a concrete challenge and datasets defined there).
	type	boolean
additionalProperties	False

Create Evaluation Tool

This guide assumes you already have the done the following:

Setup the local environment as explained in Prerequisites

An evaluation tool is a piece of code that is able to test the performance of an AI app running on a target hardware. To do so it sends the data to the AI app using an HTTP API exposed on the target and performs various measurements. It then stores the results in a output directory. This must minimally be a JSON file with a series of metrics but can also include for instance a PDF report. Data tools are usually developed with challenges to express the acceptance criterias of for the AI apps that answer to it.

An evaluation tool at its core it consist of an executable (or script) that runs on the developer workstation and is capable of reading data, send it to the AI app via an HTTP API, collect the results and build a report . In order to guarantee portability this script is packaged inside a docker container so that all its dependencies are available on the system that will run it.

To create an evaluation tool three steps are necessary:

Create the evaluation tool script
Dockerize the evaluation tool script
Create the evaluation tool metadata

The rest of this page describes these steps.

Create the evaluation tool script

The data tool script must be an executable that can consume the following command line parameters:

--output-dir [path to directory] : path to a directory where the data must be stored
--target-url [url] : URL where the AI app HTTP API is exposed
--dataset-dir [path to directory] or --dataset-dir [datasetname] [path to directory] (optional) : path where some the input datasets are accessible, the first syntax is used if only one is passed, the second if multiple are passed.
--parameters [path to JSON file] (optional) : path to a JSON file containing the parameters for the script provided by the challenge writer
--cache-dir [path to directory] (optional): path to directory where intermediate results can be stored

The exact process carried out by the evaluation tool and the parameters accepted depend on the type of evaluation tool being developed. An evaluation tool must always output a JSON file benchmark.json that contains an object, each property of the object is one of the metrics measured by the tool and the key is the resulting value.

The parameters actually required depend on the type of evaluation tool being developed. The --dataset-dir is used to pass datasets distributed with the challenge (or the output of the corresponding data tools). The --parameters parameter is used by generic evaluation tools that can be re-used in multiple challenges. The --cache-dir is for by tools that support to resume their execution after a transient error and can be used during development to cache evaluation results.

The evaluation script must invoke the AI app HTTP interface. Each AI app interface has its own HTTP interface. The pre-defined interfaces (image classification, object detection, …) expect the data to be sent as the body of a POST request and they return a JSON object with the result of the evaluation. By setting a custom HTTP header x-metrics to all it is possible to obtain performance details of the AI app.

Dockerize the evaluation tool script

Once the evaluation tool script is ready it is necessary to create a docker image containing all the required dependencies.

This depends on the way the tool has been written, for a python based tool the Dockerfile could look as follows:

FROM python

COPY requirements.txt /app/

RUN pip3 install -r /app/requirements.txt

COPY evaluation.py /app/

ENTRYPOINT [ "python3" , "/app/evaluation.py" ]

It is mandatory to set the script as the entry point of the container with the ENTRYPOINT directive as this is the way the container will be invoked by the Bonseyes tooling.

The docker needs then to be built (and potentially stored on a remote registry where other users can download it):

$ docker build -t path/to/registry/and/repository/for/tool .
$ docker push path/to/registry/and/repository/for/tool

Create the evaluation tool metadata

In order to use the evaluation tool in a challenge some metadata needs to be defined. This allows the system to find the tool and provides documentation for the tool.

The metadata must be stored in a artifact package, this can be a dedicated package or it can be stored along with the challenge. To create a new package use you can follow the instructions in /pages/dev_guides/package/create_package.

The evaluation tool metadata consists of a main file typically called evaluation_tool.yml and an optional schema for the parameters typically named parameters.yml. The main data tool file has the following structure:

metadata:
  title: Title for the evaluation tool
  description: |
    Multiline description of the evaluation tool
image: path/to/registry/and/repository/for/tool
parameters: relative path to the parameters schema file (optional)
interface: com_bonseyes/interfaces#image_classification
input_datasets:
  test_images:
    description: Images and corresponding ground truth used to perform the accuracy test
metrics:
  accuracy:
    title: Accuracy
    description: Percentage of images correctly classified
    unit: percentage
  latency:
    title: Latency
    description: Average time to perform teh inference on an image
    unit: ms

The metadata object contains a description of the tool. The image property points to the image containing the dockerized tool script. The interface property specifies the interface that the AI app under test must implement. The metrics property specifies a list of metrics that are generated by the tool. The input_datasets property is used to describe the datasets that the tool requires as input, corresponding entries must be present in the data section of the evaluation procedure of the challenge.