Definition
Evaluation tools allows to download and pre-process data for use in a challenge.
A evaluation tool artifact is typically stored in a file evaluation_tool.yml
.
This schema of the file is described below.
Bonseyes AI App Evaluation Protocol Metadata
type |
object |
|||
properties |
||||
|
||||
|
ID of the AI Class for which this evaluation protocol has been designed |
|||
type |
string |
|||
|
Name of the docker image of the tool |
|||
type |
string |
|||
|
Path to the schema file used to validate the tool parameters |
|||
type |
string |
|||
|
Input datasets expected by the tool |
|||
type |
object |
|||
additionalProperties |
type |
object |
||
properties |
||||
|
Description of the dataset |
|||
type |
string |
|||
|
Metrics produced by the evaluation protocol |
|||
type |
object |
|||
additionalProperties |
Description of the metric |
|||
type |
object |
|||
properties |
||||
|
Short description of the metric |
|||
type |
string |
|||
|
Detailed description of the metric |
|||
type |
string |
|||
|
Unit of the measurement |
|||
type |
string |
|||
additionalProperties |
False |
|||
|
Generic evaluation tools can be used for different challenges (of the same class). It can lack some properties that exist in challenge manifest where the tool is referred to (e.g. generic tool doesn’t have to provide slots for datasets to be used, while this can be “overriden” with a concrete challenge and datasets defined there). |
|||
type |
boolean |
|||
additionalProperties |
False |
Create Evaluation Tool
This guide assumes you already have the done the following:
Setup the local environment as explained in Prerequisites
An evaluation tool is a piece of code that is able to test the performance of an AI app running on a target hardware. To do so it sends the data to the AI app using an HTTP API exposed on the target and performs various measurements. It then stores the results in a output directory. This must minimally be a JSON file with a series of metrics but can also include for instance a PDF report. Data tools are usually developed with challenges to express the acceptance criterias of for the AI apps that answer to it.
An evaluation tool at its core it consist of an executable (or script) that runs on the developer workstation and is capable of reading data, send it to the AI app via an HTTP API, collect the results and build a report . In order to guarantee portability this script is packaged inside a docker container so that all its dependencies are available on the system that will run it.
To create an evaluation tool three steps are necessary:
Create the evaluation tool script
Dockerize the evaluation tool script
Create the evaluation tool metadata
The rest of this page describes these steps.
Create the evaluation tool script
The data tool script must be an executable that can consume the following command line parameters:
--output-dir [path to directory]
: path to a directory where the data must be stored--target-url [url]
: URL where the AI app HTTP API is exposed--dataset-dir [path to directory]
or--dataset-dir [datasetname] [path to directory]
(optional) : path where some the input datasets are accessible, the first syntax is used if only one is passed, the second if multiple are passed.--parameters [path to JSON file]
(optional) : path to a JSON file containing the parameters for the script provided by the challenge writer--cache-dir [path to directory]
(optional): path to directory where intermediate results can be stored
The exact process carried out by the evaluation tool and the parameters accepted depend on the type of evaluation tool
being developed. An evaluation tool must always output a JSON file benchmark.json
that contains an object, each
property of the object is one of the metrics measured by the tool and the key is the resulting value.
The parameters actually required depend on the type of evaluation tool being developed. The --dataset-dir
is used to
pass datasets distributed with the challenge (or the output of the corresponding data tools). The --parameters
parameter is used by generic evaluation tools that can be re-used in multiple challenges. The --cache-dir
is for
by tools that support to resume their execution after a transient error and can be used during development to cache
evaluation results.
The evaluation script must invoke the AI app HTTP interface. Each AI app interface has its own HTTP interface. The
pre-defined interfaces (image classification, object detection, …) expect the data to be sent as the body of a POST
request and they return a JSON object with the result of the evaluation. By setting a custom HTTP header x-metrics
to all
it is possible to obtain performance details of the AI app.
Dockerize the evaluation tool script
Once the evaluation tool script is ready it is necessary to create a docker image containing all the required dependencies.
This depends on the way the tool has been written, for a python based tool the Dockerfile could look as follows:
FROM python
COPY requirements.txt /app/
RUN pip3 install -r /app/requirements.txt
COPY evaluation.py /app/
ENTRYPOINT [ "python3" , "/app/evaluation.py" ]
It is mandatory to set the script as the entry point of the container with the ENTRYPOINT
directive as this is the
way the container will be invoked by the Bonseyes tooling.
The docker needs then to be built (and potentially stored on a remote registry where other users can download it):
$ docker build -t path/to/registry/and/repository/for/tool .
$ docker push path/to/registry/and/repository/for/tool
Create the evaluation tool metadata
In order to use the evaluation tool in a challenge some metadata needs to be defined. This allows the system to find the tool and provides documentation for the tool.
The metadata must be stored in a artifact package, this can be a dedicated package or it can be stored along with the challenge. To create a new package use you can follow the instructions in /pages/dev_guides/package/create_package.
The evaluation tool metadata consists of a main file typically called evaluation_tool.yml
and an optional schema
for the parameters typically named parameters.yml
. The main data tool file has the following structure:
metadata:
title: Title for the evaluation tool
description: |
Multiline description of the evaluation tool
image: path/to/registry/and/repository/for/tool
parameters: relative path to the parameters schema file (optional)
interface: com_bonseyes/interfaces#image_classification
input_datasets:
test_images:
description: Images and corresponding ground truth used to perform the accuracy test
metrics:
accuracy:
title: Accuracy
description: Percentage of images correctly classified
unit: percentage
latency:
title: Latency
description: Average time to perform teh inference on an image
unit: ms
The metadata
object contains a description of the tool. The image
property points to the image
containing the dockerized tool script. The interface
property specifies the interface that the AI app under test
must implement. The metrics
property specifies a list of metrics that are generated by the tool. The
input_datasets
property is used to describe the datasets that the tool requires as input, corresponding entries
must be present in the data
section of the evaluation procedure of the challenge.