Shortcuts

Train and Test

  • Train

    • Train with your PC

    • Train with multiple GPUs

    • Train with multiple machines

      • Multiple machines in the same network

      • Multiple machines managed with slurm

  • Test

    • Test with your PC

    • Test with multiple GPUs

    • Test with multiple machines

      • Multiple machines in the same network

      • Multiple machines managed with slurm

Train

Train with your PC

You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/train.py ${CONFIG_FILE} [ARGS]

Note

By default, MMPose prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
--work-dir WORK_DIR The target folder to save logs and checkpoints. Defaults to a folder with the same name as the config file under ./work_dirs.
--resume [RESUME] Resume training. If specify a path, resume from it, while if not specify, try to auto resume from the latest checkpoint.
--amp Enable automatic-mixed-precision training.
--no-validate Not suggested. Disable checkpoint evaluation during training.
--auto-scale-lr Automatically rescale the learning rate according to the actual batch size and the original batch size.
--cfg-options CFG_OPTIONS Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that quotation marks are necessary and that no white space is allowed.
--show-dir SHOW_DIR The directory to save the result visualization images generated during validation.
--show Visualize the prediction result in a window.
--interval INTERVAL The interval of samples to visualize.
--wait-time WAIT_TIME The display time of every window (in seconds). Defaults to 1.
--launcher {none,pytorch,slurm,mpi} Options for job launcher.

Train with multiple GPUs

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
GPU_NUM The number of GPUs to be used.
[PYARGS] The other optional arguments of tools/train.py, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command:

PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]

If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS]

Train with multiple machines

Multiple machines in the same network

If you launch a training job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS Description
NNODES The total number of machines.
NODE_RANK The index of the local machine.
PORT The communication port, it should be the same in all machines.
MASTER_ADDR The IP address of the master machine, it should be the same in all machines.

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Multiple machines managed with slurm

If you run MMPose on a cluster managed with slurm, you can use the script slurm_train.sh.

[ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]

Here are the arguments description of the script.

ARGS Description
PARTITION The partition to use in your cluster.
JOB_NAME The name of your job, you can name it as you like.
CONFIG_FILE The path to the config file.
WORK_DIR The target folder to save logs and checkpoints.
[PYARGS] The other optional arguments of tools/train.py, see here.

Here are the environment variables that can be used to configure the slurm job.

ENV_VARS Description
GPUS The total number of GPUs to be used. Defaults to 8.
GPUS_PER_NODE The number of GPUs to be allocated per node. Defaults to 8.
CPUS_PER_TASK The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
SRUN_ARGS The other arguments of srun. Available options can be found here.

Test

Test with your PC

You can use tools/test.py to test a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

Note

By default, MMPose prefers GPU to CPU. If you want to test a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
CHECKPOINT_FILE The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
--work-dir WORK_DIR The directory to save the file containing evaluation metrics.
--out OUT The path to save the file containing evaluation metrics.
--dump DUMP The path to dump all outputs of the model for offline evaluation.
--cfg-options CFG_OPTIONS Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that quotation marks are necessary and that no white space is allowed.
--show-dir SHOW_DIR The directory to save the result visualization images.
--show Visualize the prediction result in a window.
--interval INTERVAL The interval of samples to visualize.
--wait-time WAIT_TIME The display time of every window (in seconds). Defaults to 1.
--launcher {none,pytorch,slurm,mpi} Options for job launcher.

Test with multiple GPUs

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
ARGS Description
CONFIG_FILE The path to the config file.
CHECKPOINT_FILE The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
GPU_NUM The number of GPUs to be used.
[PYARGS] The other optional arguments of tools/test.py, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the below command:

PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]

If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS]

Test with multiple machines

Multiple machines in the same network

If you launch a test job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS Description
NNODES The total number of machines.
NODE_RANK The index of the local machine.
PORT The communication port, it should be the same in all machines.
MASTER_ADDR The IP address of the master machine, it should be the same in all machines.

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Multiple machines managed with slurm

If you run MMPose on a cluster managed with slurm, you can use the script slurm_test.sh.

[ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]

Here are the argument descriptions of the script.

ARGS Description
PARTITION The partition to use in your cluster.
JOB_NAME The name of your job, you can name it as you like.
CONFIG_FILE The path to the config file.
CHECKPOINT_FILE The path to the checkpoint file (It can be a http link, and you can find checkpoints here).
[PYARGS] The other optional arguments of tools/test.py, see here.

Here are the environment variables that can be used to configure the slurm job.

ENV_VARS Description
GPUS The total number of GPUs to be used. Defaults to 8.
GPUS_PER_NODE The number of GPUs to be allocated per node. Defaults to 8.
CPUS_PER_TASK The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
SRUN_ARGS The other arguments of srun. Available options can be found here.
Read the Docs v: fix-doc
Versions
latest
1.x
v0.14.0
fix-doc
cn_doc
Downloads
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.