A Datatool is a containerized python based utility, which extracts data from a given source dataset, translates both unstructured (images, audio, video files etc.) and semi-structured (CSV, JSON, TXT, XML, XLS, etc.) data to common representation. It also enables loading the translated data into python based data structures using a standard API.
By performing this ETL (Extract, Transform and Load) step, the datatool generates a standard data structure and definitions for the annotations, which are then consumed by the data scientists and analysts for various purposes, e.g., data analysis, predictive modeling etc.
The purpose of this document is to give an overview of the datatool workflow to obtain an structured dataset from raw data.
To be able to complete this workflow:
Make sure you have set up the local environment as explained in Prerequisites
Have Docker 19.03 installed in your local environement.
A Datatool workflow is as follows:
1. Documentation: Overall documentation including the example samples and statistics:
Documentation (README.md) with example samples and annotations.
Exploratory Data Analysis (EDA) Report: containing the annotation statistics, trends and interactions.
2. Main Interface: Process the source input data and create the final standardized dataset:
Script datatool.py to process source input data and generate standardized output.
Script datatool_patch.py to apply any available post-processing (cleaning, geometric transformations etc.) on the generated datatool output.
Dependencies: The following contains the list of python modules needed by the datatool.
3. Exploratory Data Analysis: Create / Re-create an Exploratory Data Analysis report on datatool output for getting statistical insights:
Script create_EDA_report.sh to download and run the EDA report tool on datatool output and generate the EDA report.
4. Visualize Annotations on Datatool Output: Draw samples at random from datatool output and visualize annotations. This is useful in validation and verification of the annotations:
Script visualize_annotations.py to visualize annotations on randomly drawn samples from datatool output.
Dependencies visualize_annotations/requirements.txt required list of python modules needed by the script.
5. Data Loader Example: Example of how to easily load the datatool output for model training/validation using the datatool API:
Script example_dataloaders/example_dataloader_pytorch.py provides a data loader example for pytorch using the datatool API.
Currently, two Datatool examples are available: