TPE multiple node configuration based on MongoDB database.

Multi-node configuration purpose is to reduce execution time of TPE algorithm. Main execution model of multi-node configuration is to run identical script on many machines and share trials, loss function configuration, evaluated parameters and search space objects between them.

Multi-node configuration can work with two modes: master and peer. In peer mode all machines calculate latency for evaluated models themselves. It means that for this configuration nodes used need to be homogeneous. In master configuration only one machine called server calculate latency for evaluated models. Thanks to that nodes can be heterogeneous. Client machines are responsible for generating the model based on parameters chosen by TPE, evaluating its accuracy and sending it to server node for latency measurement.

First iteration is executed on every node. During this iteration Loss function configuration is created.

Search Space is created on server node after 2nd trial execution. After Search Space configuration is created by server node, client nodes start evaluating own trials. After evaluation end, nodes insert result to shared database.

Final result can be taken from any of nodes.

Configuration of nodes

In this configuration we distinguish two types of nodes. One is server which is responsible for proper "search space" and "loss function config" creation and client nodes which reads data produced by server and use it for further best result search and evaluation. Server and Client .json config file changes:

"optimizer": {
"name": "Tpe",
"params": {
"multinode": {
"name": "node_name", ← optional
"type": "server", ← for server node
"type": "client", ← for client node
"server_addr": "<server_ip_addr&gt;:<server_port_number&gt;",
"tag": "group_name", ← optional
"mode": "peer"← optional
},
"max_trials": 10,
"trials_load_method": "cold_start",
...,
}
}

parameters:

How to run TPE in multi node configuration

For every node you need to have environment prepared in the same way as for regular run. Models and Datasets should be prepared in configuration files. When you add "multinode" parameter to configuration file and MongoDB is active and running you need to run tool on every node you want to be part of searching group of machines.

Server should be run first because it needs to prepare data for other nodes. When server node starts its 2nd Trial clients will start their own search.

Steps needed to run multi-node configuration:

  1. Add to your base configuration file 'multinode' parameters with one server and client type for rest of nodes,
  2. Create **'pot'** database in MongoDB instance,
  3. Run server node as first (server will perform cleanup on database),
  4. Run client nodes,

How to select mode

When mode parameter is set to peer all nodes (clients and server) will search for the best result and evaluate models (for accuracy and latency). In this mode nodes should be homogeneous, so that result of latency calculation on the same model would be similar on every node. This allows calculating correct loss and as a result achieve better and faster convergence.

Master mode is more reliable for latency calculation. Only server node calculate latency for the rest of the nodes, but it is not doing accuracy evaluation and does not take part in searching for the best result. That's why this mode is slightly slower than previous one.

Select master mode when:

Select peer mode when:

How it works

All synchronization is done by MongoDB database. There is no direct communication between server and clients. When client needs information about loss function configuration, fp32 metrics or search space it needs to wait until server push this data to database.

Results

Time in minutes for TPE execution for 100 trials on ssd-mobilenetv1 and COCO dataset.

No. of nodes master mode peer mode
1n/a 747
2n/a 414
3304 205
4211 no data
5160 no data
6125 no data
7110 96