Tree-Structured Parzen Estimator (TPE)

Sequential model-based optimization methods (SMBO) [1] are formalization of Bayesian optimization. The sequential refers to running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a surrogate model. There are five main aspects of SMBO:

  1. Domain: A domain of hyperparameters or search space
  2. Objective function: An objective function which takes hyperparameters as input and outputs a score that needs to minimize (or maximize)
  3. Surrogate function: The representative surrogate model of the objective function
  4. Selection function: A selection criterion to evaluate which hyperparameters to choose next trial from the surrogate model
  5. History: A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model

There are several variants of SMBO, which differ in how to build a surrogate and the selection criteria (steps 3–4). The TPE [1] builds a surrogate model by applying Bayes's rule.

NOTE: TPE requires many iterations to converge to an optimal solution, and it is recommended to run it for at least 200 iterations. Because every iteration requires evaluation of a generated model, which means accuracy measurements on a dataset and latency measurements using benchmark, this process may take from 24 hours up to few days to complete, depending on a model. Due to this, even though the TPE supports all OpenVINO™-supported models, it is being continuously validated only on a subset of models:

  • SSD MobileNet V1 COCO
  • Mobilenet V2 1.0 224
  • Faster R-CNN ResNet 50 COCO
  • Faster R-CNN Inception V2 COCO
  • YOLOv3 TF Full COCO

Algorithm overview

TPE algorithm consists of multiple (the following) steps:

  1. Define a domain of hyperparameter search space,
  2. Create an objective function which takes in hyperparameters and outputs a score (e.g., loss, RMSE, cross-entropy) that we want to minimize,
  3. Get couple of observations (score) using randomly selected set of hyperparameters,
  4. Sort the collected observations by score and divide them into two groups based on some quantile. The first group (x1) contains observations that gave the best scores and the second one (x2) - all other observations,
  5. Two densities l(x1) and g(x2) are modeled using Parzen Estimators (also known as kernel density estimators) which are a simple average of kernels centered on existing data points,
  6. Draw sample hyperparameters from l(x1), evaluating them in terms of l(x1)/g(x2), and returning the set that yields the minimum value under l(x1)/g(x1) corresponding to the greatest expected improvement. These hyperparameters are then evaluated on the objective function.
  7. Update the observation list in step 3
  8. Repeat step 4-7 with a fixed number of trials

Parameters

TPE accepts following parameters:

/* Optimizer used to find "optimal" hyperparameters */
"optimizer": {
"name": "Tpe", // Optimizer name
"params": {
"eval_subset_size": 2000, // [Optional] Subset of test data to tune the model. If not specified, the whole dataset will be used.
"max_trials": 100, // Maximum number of trails
"max_minutes": 10, // [Optional] Trials time limit. When it expires, the last trial is completed and the best result is returned.
"stop_on_target": true, // [Optional] Flag to stop TPE trials when accuracy_loss and latency_reduce targets are reached.
// If false or not specified TPE will continue until max_trials or max_minutes is reached even if targets are reached earlier.
"trials_load_method": "cold_start", // Start from scratch or reuse previous results, supported options [cold_start, warm_start, fine_tune, eval]
"accuracy_loss": 0.1, // Accuracy threshold (%)
"latency_reduce": 1.5, // Target latency improvement versus original model
"accuracy_weight": 1.0, // Accuracy weight in loss function
"latency_weight": 1.0, // Latency weight in loss function
// An optional list of reference metrics values.
// If not specified, all metrics will be calculated from the original model
"metrics": [
{
"name": "accuracy", // Metric name
"baseline_value": 0.72 // Baseline metric value of the original model
}
],
"benchmark": {
// Latency measurement benchmark configuration (https://docs.openvinotoolkit.org/latest/_inference_engine_samples_benchmark_app_README.html)
"performance_count": false,
"batch_size": 0,
"nthreads": 4,
"nstreams": 0,
"nireq": 0,
"api_type": "sync",
"niter": 4,
"duration_seconds": 30,
"benchmark_app_dir": "<path to benchmark_app>" // Path to benchmark_app If not specified, Python base benchmark will be used. Use benchmark_app to reduce jitter in results.
}
}
}

trials_load_method should be used in following manner:

Accuracy and latency components in the loss function are designed to be balanced equally, so that the algorithm is able to achieve an expected accuracy target and provide best possible latency improvement. Changing accuracy_weight, which is left open for experimentation, is discouraged, but it is recommended to change latency_weight to 0 for configurations that do not change latency result, for example, when tuning parameters that only change numeric values of the parameters, such as quantization ranges, and do not change graph structure or data types.

Advantages :

  1. TPE supports a wide variety of variables in parameter search space e.g., uniform, log-uniform, quantized log-uniform, normally-distributed real value, categorical.
  2. Less time to tune. Extremely computationally efficient than conventional methods.
  3. Scope to define the tuning time of quantization.
  4. Scope to define quantization search space and strategies from a wide range of OpenVINO™ quantization algorithms.
  5. Scope to define error tolerance and desired latency improvement. This algorithm guaranteed to get best possible accuracy and latency from optimal parameter combination (quantization algorithms and strategies).

Multiple node configuration

Domain for hyperparameter search space can be vast. Searching best result is time-consuming process. Current implementation allow to use multiple machines to work together. For more information goto Multi-node description.

Drawbacks:

TPE does not model interactions between hyperparameters.

Reference:

[1] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for Hyper-Parameter Optimization,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2546–2554.