Obtain Datasets

Validation of the model is always performed against specific data combined into datasets. To obtain trustworthy results, a dataset must satisfy the following requirements:

  • Format: the data should be compatible with the model domain (Computer Vision or Natural Language Processing).

  • Content: a dataset must be representative. The data needs to be aligned with the model use case (for example, images of people for face detection).

  • Size: a dataset should contain a sufficient number of items (100+).

Note

You can use Datumaro to make the process of creating your dataset easier. Datumaro is a free framework and CLI tool for building, transforming, and analyzing datasets and annotations.

Image Dataset

Image datasets can be either Annotated or Not Annotated:

  • Not Annotated dataset contains only images and allows using most of the DL Workbench features: measure performance, optimize, and visualize the model, etc.

  • Annotated dataset contains images and information about what each image is showing. It expands the possibilities of working with a model and allows measuring accuracy and optimizing the model within a controllable accuracy drop.

Text Dataset

A text dataset should be represented as a table in СSV/TSV format of at least two columns with Text and Label for Text Classification use case. Textual Entailment task requires a СSV table of three columns with Premise, Hypothesis, and Label. HuggingFace’s datasets library provides access to different text datasets.