Multi-node configuration purpose is to reduce execution time of TPE algorithm. Main execution model of multi-node configuration is to run identical script on many machines and share trials, loss function configuration, evaluated parameters and search space objects between them.

Multi-node configuration can work with two modes: master and peer. In peer mode all machines calculate latency for evaluated models themselves. It means that for this configuration nodes used need to be homogeneous. In master configuration only one machine called server calculate latency for evaluated models. Thanks to that nodes can be heterogeneous. Client machines are responsible for generating the model based on parameters chosen by TPE, evaluating its accuracy and sending it to server node for latency measurement.

First iteration is executed on every node. During this iteration Loss function configuration is created.

Search Space is created on server node after 2nd trial execution. After Search Space configuration is created by server node, client nodes start evaluating own trials. After evaluation end, nodes insert result to shared database.

Final result can be taken from any of nodes.

Configuration of nodes

In this configuration we distinguish two types of nodes. One is server which is responsible for proper "search space" and "loss function config" creation and client nodes which reads data produced by server and use it for further best result search and evaluation. Server and Client .json config file changes:

"optimizer": {
    "name": "Tpe",
    "params": {
    "multinode": {
        "name": "node_name", ← optional
        "type": "server", ← for server node
        "type": "client", ← for client node
        "server_addr": "<server_ip_addr&gt;:<server_port_number&gt;",
        "tag": "group_name", ← optional
        "mode": "peer"← optional
    },
    "max_trials": 10,
    "trials_load_method": "cold_start",
    ...,
    }
}

parameters:

"name": Name saved in trials.csv file, mainly for debug purpose,
"type": Can be "server" or "client"
"server_addr": <server_ip_addr>: IP address of machine where MongoDB database is configured. It can be different from any Node IP used for TPE execution. <server_port_number>: Port number of MongoDB database, by default it's 27017
"tag": Name for group of systems working together. Without this tag, systems will be grouped by the model they are working on. If more than one group of systems is working on the same model using the same MongoDB database, their results will collide.
"mode": "peer" or "master" mode selection.

How to run TPE in multi node configuration

For every node you need to have environment prepared in the same way as for regular run. Models and Datasets should be prepared in configuration files. When you add "multinode" parameter to configuration file and MongoDB is active and running you need to run tool on every node you want to be part of searching group of machines.

Server should be run first because it needs to prepare data for other nodes. When server node starts its 2nd Trial clients will start their own search.

Steps needed to run multi-node configuration:

Add to your base configuration file 'multinode' parameters with one server and client type for rest of nodes,
Create **'pot'** database in MongoDB instance,
Run server node as first (server will perform cleanup on database),
Run client nodes,

How to select mode

When mode parameter is set to peer all nodes (clients and server) will search for the best result and evaluate models (for accuracy and latency). In this mode nodes should be homogeneous, so that result of latency calculation on the same model would be similar on every node. This allows calculating correct loss and as a result achieve better and faster convergence.

Master mode is more reliable for latency calculation. Only server node calculate latency for the rest of the nodes, but it is not doing accuracy evaluation and does not take part in searching for the best result. That's why this mode is slightly slower than previous one.

Select master mode when:

machines with different hardware configuration are used (memory, CPU),
layer option is set in configuration file, (latency sensitive),
latency is main factor to be improved,
for number of nodes 4+.

Select peer mode when:

machines used are homogeneous,
range estimator option is set in configuration file (accuracy sensitive),
accuracy is main factor to be improved,
for limited number of nodes 1-3.

How it works

All synchronization is done by MongoDB database. There is no direct communication between server and clients. When client needs information about loss function configuration, fp32 metrics or search space it needs to wait until server push this data to database.

Results

Time in minutes for TPE execution for 100 trials on ssd-mobilenetv1 and COCO dataset.

No. of nodes	master mode	peer mode
1	n/a	747
2	n/a	414
3	304	205
4	211	no data
5	160	no data
6	125	no data
7	110	96