Welcome to MMEval’s documentation!¶
Introduction¶
MMEval is a machine learning evaluation library that supports efficient and accurate distributed evaluation on a variety of machine learning frameworks.
Major features:
Comprehensive metrics for various computer vision tasks (NLP will be covered soon!)
Efficient and accurate distributed evaluation, backed by multiple distributed communication backends
Support multiple machine learning frameworks via dynamic input dispatching mechanism
Installation and Usage¶
Installation¶
MMEval
requires Python 3.6+ and can be installed via pip.
pip install mmeval
To install the dependencies required for all the metrics provided in MMEval
, you can install them with the following command.
pip install 'mmeval[all]'
How to use¶
There are two ways to use MMEval
’s metrics, using Accuracy as an example:
from mmeval import Accuracy
import numpy as np
accuracy = Accuracy()
The first way is to directly call the instantiated Accuracy
object to calculate the metric.
labels = np.asarray([0, 1, 2, 3])
preds = np.asarray([0, 2, 1, 3])
accuracy(preds, labels)
# {'top1': 0.5}
The second way is to calculate the metric after accumulating data from multiple batches.
for i in range(10):
labels = np.random.randint(0, 4, size=(100, ))
predicts = np.random.randint(0, 4, size=(100, ))
accuracy.add(predicts, labels)
accuracy.compute()
# {'top1': ...}
Support Matrix¶
Supported distributed communication backends¶
MPI4Py | torch.distributed | Horovod | paddle.distributed | oneflow.comm |
---|---|---|---|---|
MPI4PyDist | TorchCPUDist TorchCUDADist |
TFHorovodDist | PaddleDist | OneFlowDist |
Supported metrics and ML frameworks¶
Note
The following table lists the metrics implemented by MMEval and the corresponding machine learning framework support. A check mark indicates that the data type of the corresponding framework (e.g. Tensor) can be directly passed for computation.
Note
MMEval tested with PyTorch 1.6+, TensorFlow 2.4+, Paddle 2.2+ and OneFlow 0.8+.
Metric | numpy.ndarray | torch.Tensor | tensorflow.Tensor | paddle.Tensor | oneflow.Tensor |
---|---|---|---|---|---|
Accuracy | ✔ | ✔ | ✔ | ✔ | ✔ |
SingleLabelMetric | ✔ | ✔ | ✔ | ||
MultiLabelMetric | ✔ | ✔ | ✔ | ||
AveragePrecision | ✔ | ✔ | ✔ | ||
MeanIoU | ✔ | ✔ | ✔ | ✔ | ✔ |
VOCMeanAP | ✔ | ||||
OIDMeanAP | ✔ | ||||
COCODetection | ✔ | ||||
ProposalRecall | ✔ | ||||
F1Score | ✔ | ✔ | ✔ | ||
HmeanIoU | ✔ | ||||
PCKAccuracy | ✔ | ||||
MpiiPCKAccuracy | ✔ | ||||
JhmdbPCKAccuracy | ✔ | ||||
EndPointError | ✔ | ✔ | ✔ | ||
AVAMeanAP | ✔ | ||||
StructuralSimilarity | ✔ | ||||
SignalNoiseRatio | ✔ | ||||
PeakSignalNoiseRatio | ✔ | ||||
MeanAbsoluteError | ✔ | ||||
MeanSquaredError | ✔ |
Implementing a Metric¶
To implement a metric in MMEval
, you should implement a subclass of BaseMetric that overrides the add
and compute_metric
methods.
In the evaluation process, each metric will update self._results
to store intermediate results after each call of add
. When computing the final metric result, the self._results
will be synchronized between processes.
An example that implementing simple Accuracy
metric:
import numpy as np
from mmeval.core import BaseMetric
class Accuracy(BaseMetric):
def add(self, predictions, labels):
self._results.append((predictions, labels))
def compute_metric(self, results):
predictions = np.concatenate(
[res[0] for res in results])
labels = np.concatenate(
[res[1] for res in results])
correct = (predictions == labels)
accuracy = sum(correct) / len(predictions)
return {'accuracy': accuracy}
Use Accuracy
:
# stateless call
accuracy = Accuracy()
metric_results = accuracy(predictions=[1, 2, 3, 4], labels=[1, 2, 3, 1])
print(metric_results)
# {'accuracy': 0.75}
# Accumulate batch
for i in range(10):
predicts = np.random.randint(0, 4, size=(10,))
labels = predicts = np.random.randint(0, 4, size=(10,))
accuracy.add(predicts, labels)
metric_results = accuracy.compute()
accuracy.reset() # clear the intermediate results
Using Distributed Evaluation¶
Distributed evaluation generally uses a strategy of data parallelism, where each process executes the same program to process different data.
The supported distributed communication backends in MMEval can be viewed via list_all_backends.
import mmeval
print(mmeval.core.dist.list_all_backends())
# ['non_dist', 'mpi4py', 'tf_horovod', 'torch_cpu', 'torch_cuda', ...]
This section shows how to use MMEval in the combination of torch.distributed
and MPI4Py
for distributed evaluation, using the CIFAR-10 dataset as an example. The related code can be found at mmeval/examples/cifar10_dist_eval.
Prepare the evaluation dataset and model¶
First of all, we need to load the CIFAR-10 test data, we can use the dataset classes provided by Torchvison
.
In addition, to be able to slice the dataset according to the number of processes in a distributed evaluation, we need to introduce the DistributedSampler
.
import torchvision as tv
from torch.utils.data import DataLoader, DistributedSampler
def get_eval_dataloader(rank=0, num_replicas=1):
dataset = tv.datasets.CIFAR10(
root='./', train=False, download=True,
transform=tv.transforms.ToTensor())
dist_sampler = DistributedSampler(
dataset, num_replicas=num_replicas, rank=rank)
data_loader = DataLoader(dataset, batch_size=1, sampler=dist_sampler)
return data_loader, len(dataset)
Secondly, we need to prepare the model to be evaluated, here we use resnet18
from Torchvision
.
import torch
import torchvision as tv
def get_model(pretrained_model_fpath=None):
model = tv.models.resnet18(num_classes=10)
if pretrained_model_fpath is not None:
model.load_state_dict(torch.load(pretrained_model_fpath))
return model.eval()
Single process evaluation¶
After preparing the test data and the model, the model predictions can be evaluated using the mmeval.Accuracy metric. The following is an example of a single process evaluation.
import tqdm
import torch
from mmeval import Accuracy
eval_dataloader, total_num_samples = get_eval_dataloader()
model = get_model()
# Instantiate `Accuracy` and calculate the top1 and top3 accuracy
accuracy = Accuracy(topk=(1, 3))
with torch.no_grad():
for images, labels in tqdm.tqdm(eval_dataloader):
predicted_score = model(images)
# Accumulate batch data, intermediate results will be saved in
# `accuracy._results`.
accuracy.add(predictions=predicted_score, labels=labels)
# Invoke `accuracy.compute` for metric calculation
print(accuracy.compute())
# Invoke `accuracy.reset` to clear the intermediate results saved in
# `accuracy._results`
accuracy.reset()
Distributed evaluation with torch.distributed¶
There are two distributed communication backends implemented in MMEval
for torch.distributed
, TorchCPUDist and TorchCUDADist.
There are 2 ways to set up a distributed communication backend for MMEval
:
from mmeval.core import set_default_dist_backend
from mmeval import Accuracy
# 1. Set the global default distributed communication backend.
set_default_dist_backend('torch_cpu')
# 2. Initialize the evaluation metrics by passing `dist_backend`.
accuracy = Accuracy(dist_backend='torch_cpu')
Together with the above code for single process evaluation, the distributed evaluation can be implemented by adding the distributed environment startup and initialization.
import tqdm
import torch
from mmeval import Accuracy
def eval_fn(rank, process_num):
# Distributed environment initialization
torch.distributed.init_process_group(
backend='gloo',
init_method=f'tcp://127.0.0.1:2345',
world_size=process_num,
rank=rank)
eval_dataloader, total_num_samples = get_eval_dataloader(rank, process_num)
model = get_model()
# Instantiate `Accuracy` and set up a distributed communication backend
accuracy = Accuracy(topk=(1, 3), dist_backend='torch_cpu')
with torch.no_grad():
for images, labels in tqdm.tqdm(eval_dataloader, disable=(rank!=0)):
predicted_score = model(images)
accuracy.add(predictions=predicted_score, labels=labels)
# Specify the number of dataset samples by size in order to remove
# duplicate samples padded by the `DistributedSampler`.
print(accuracy.compute(size=total_num_samples))
accuracy.reset()
if __name__ == "__main__":
# Number of distributed processes
process_num = 3
# Launching distributed with spawn
torch.multiprocessing.spawn(
eval_fn, nprocs=process_num, args=(process_num, ))
Distributed evaluation with MPI4Py¶
MMEval
has decoupled the distributed communication capability. While the above example uses the PyTorch
model and data loading, we can still use distributed communication backends other than torch.distributed
to implement distributed evaluation.
The following will show how to use MPI4Py
as a distributed communication backend for distributed evaluation.
First, you need to install MPI4Py
and openmpi
, it is recommended to use conda
to install.
conda install openmpi
conda install mpi4py
Then modify the above code to use MPI4Py
as the distributed communication backend:
# cifar10_eval_mpi4py.py
import tqdm
from mpi4py import MPI
import torch
from mmeval import Accuracy
def eval_fn(rank, process_num):
eval_dataloader, total_num_samples = get_eval_dataloader(rank, process_num)
model = get_model()
accuracy = Accuracy(topk=(1, 3), dist_backend='mpi4py')
with torch.no_grad():
for images, labels in tqdm.tqdm(eval_dataloader, disable=(rank!=0)):
predicted_score = model(images)
accuracy.add(predictions=predicted_score, labels=labels)
print(accuracy.compute(size=total_num_samples))
accuracy.reset()
if __name__ == "__main__":
comm = MPI.COMM_WORLD
eval_fn(comm.Get_rank(), comm.Get_size())
Using mpirun
as the distributed launch method.
# Launch 3 processes with mpirun
mpirun -np 3 python3 cifar10_eval_mpi4py.py
MMCls¶
BaseMetric in MMEval
follows the design of the mmengine.evaluator module and introduces distributed communication backend to meet the needs of a diverse distributed communication library.
Therefore, MMEval
naturally supports the evaluation based on OpenMMLab 2.0 algorithm library, and the evaluation metrics using MMEval in OpenMMLab 2.0 algorithm library need not be modified.
For example, use mmeval.Accuracy in MMCls, just configure the Metric to be Accuracy in the config:
val_evaluator = dict(type='Accuracy', topk=(1, ))
test_evaluator = val_evaluator
MMEval’s support for OpenMMLab 2.0 algorithm library is being gradually improved, and the supported metric can be viewed in the support matrix.
TensorPack¶
TensorPack is a neural net training interface on TensorFlow, with focus on speed + flexibility
There are many examples of classic models and tasks provided in the TensorPack repository. This section shows how to use mmeval.COCODetection for evaluation in TensorPack-FasterRCNN, and the related code can be found at mmeval/examples/tensorpack.
First you need to install TensorFlow
and TensorPack
, then follow the preparation steps in the TensorPack-FasterRCNN example to install the dependencies and prepare the COCO dataset, as well as download the pre-trained model weights to be evaluated.
Scripts for model evaluation are provided in predict.py, and the model can be evaluated with the following commands:
./predict.py --evaluate output.json --load /path/to Trained-Model-Checkpoint --config SAME-AS-TRAINING
MMEval
provides a evaluation tools for TensorPack-FasterRCNN
that use mmeval.COCODetection. This evaluation script needs to be placed in the TensorPack-FasterRCNN
example directory, and then the evaluation can be executed with the following command.
# run evaluation
python tensorpack_mmeval.py --load <model_path>
# launch multi-gpus evaluation by mpirun
mpirun -np 8 python tensorpack_mmeval.py --load <model_path>
We tested this evaluation script on COCO-MaskRCNN-R50C41x and got the same evaluation results as the TensorPack report.
Model | mAP (box) | mAP (mask) | Configurations |
---|---|---|---|
COCO-MaskRCNN-R50C41x | 36.2 | 31.8 | MODE_FPN=False |
PaddleSeg¶
PaddleSeg is a semantic segmentation algorithm library based on Paddle that supports many downstream tasks related to semantic segmentation.
This section shows how to use mmeval.MeanIoU for evaluation in PaddleSeg, and the related code can be found at mmeval/examples/paddleseg.
First you need to install Paddle
and PaddleSeg
, you can refer to the installation documentation in PaddleSeg
. In addition, you need to download the pre-trained model to be evaluated, and prepare the evaluation data according to the configuration.
Scripts for model evaluation are provided in the PaddleSeg
repo, and the model can be evaluated with the following commands:
python val.py --config <config_path> --model_path <model_path>
Note that the val.py
script in the PaddleSeg
only supports single-GPU evaluation, not multi-GPU evaluation yet.
MMEval
provides a evaluation tools for PaddleSeg
that use mmeval.MeanIoU, which can be executed with the following command:
# run evaluation
python ppseg_mmeval.py --config <config_path> --model_path <model_path>
# run evaluation with multi-gpus
python ppseg_mmeval.py --config <config_path> --model_path <model_path> --launcher paddle --num_process <num_gpus>
We tested this evaluation script on fastfcn_resnet50_os8_ade20k_480x480_120k and got the same evaluation results as the val.py in PaddleSeg.
Config | Weights | mIoU | aAcc | Kappa | mDice |
---|---|---|---|---|---|
fastfcn_resnet50_os8_ade20k_480x480_120k | model.pdparams | 0.4373 | 0.8074 | 0.7928 | 0.5772 |
BaseMetric Design¶
During the evaluation process, the results of partial datasets are usually inferred on each GPU in data parallel to speed up the evaluation.
Most of the time, we can’t just reduce the metric results from each subset of the dataset as the metric result of the dataset.
Therefore, the usual practice is to save the inference results obtained by each process or the intermediate results of the metric calculation. Then perform an all-gather operation across all processes, and finally calculate the metric results of the entire evaluation dataset.
The above operations are completed by BaseMetric in MMEval
, and its interface design is shown in the following:
The add
and compute_metric
methods are interfaces that need to be implemented by users. For more details, please refer to Custom Evaluation Metrics.
It can be seen from the [BaseMetric](mmeval.core.BaseMetric) interface that the main function of
BaseMetric is to provide distributed evaluation. The basic process is as follows:
The user calls the
add
method to save the inference result or the intermediate result of the metric calculation in theBaseMetric._results
list.The user calls the
compute
method, andBaseMetric
synchronizes the data in the_results
list across processes and calls the user-definedcompute_metric
method to calculate the metrics.
In addition, BaseMetric also considers that in distributed evaluation, some processes may pad repeated data samples, in order to ensure the same number of data samples in all processes. Such behavior will affect the indicators correctness of the calculation. E.g. DistributedSampler
in PyTorch.
To deal with this problem, BaseMetric.compute can receive a size
parameter, which represents the actual number of samples in the evaluation dataset. After _results
completes process synchronization, the padded samples will be removed according to dist_collect_mode
to achieve correct metric calculation.
Note
Be aware that the intermediate results stored in _results
should correspond one-to-one with the samples, in that we need to remove the padded samples for the most accurate result.
Distributed Communication Backend¶
The distributed communication requirements required by MMEval
in the distributed evaluation mainly include the following:
All-gather the intermediate results of the metric saved in each process.
Broadcast the metric result calculated by the rank 0 process to all processes
In order to flexibly support multiple distributed communication libraries, MMEval abstracts the above distributed communication requirements and defines a distributed communication interface BaseDistBackend:
To implement a distributed communication backend, you need to inherit BaseDistBackend and implement the above interfaces, where:
is_initialized: identifies whether the initialization of the distributed communication environment has been completed.
rank: the rank index of the current process group.
world_size: the world size of the current process group.
all_gather_object: perform the all_tather operation on any Python object that can be serialized by
Pickle
.broadcast_object: broadcasts any Python object that can be serialized by
Pickle
.
Take the implementation of MPI4PyDist as an example:
from mpi4py import MPI
class MPI4PyDist(BaseDistBackend):
"""A distributed communication backend for mpi4py."""
@property
def is_initialized(self) -> bool:
"""Returns True if the distributed environment has been initialized."""
return 'OMPI_COMM_WORLD_SIZE' in os.environ
@property
def rank(self) -> int:
"""Returns the rank index of the current process group."""
comm = MPI.COMM_WORLD
return comm.Get_rank()
@property
def world_size(self) -> int:
"""Returns the world size of the current process group."""
comm = MPI.COMM_WORLD
return comm.Get_size()
def all_gather_object(self, obj: Any) -> List[Any]:
"""All gather the given object from the current process group and
returns a list consisting gathered object of each process."""
comm = MPI.COMM_WORLD
return comm.allgather(obj)
def broadcast_object(self, obj: Any, src: int = 0) -> Any:
"""Broadcast the given object from source process to the current
process group."""
comm = MPI.COMM_WORLD
return comm.bcast(obj, root=src)
Some distributed communication backends have been implemented in MMEval
, which can be viewed in the support matrix.
Multiple Dispatch¶
MMEval wants to support multiple machine learning frameworks. One of the simplest solutions is to have NumPy support for the computation of all metrics.
Since all machine learning frameworks have Tensor data types that can be converted to numpy.ndarray, this can satisfy most of the evaluation requirements.
However, there may be some problems in some cases:
NumPy has some common operators that have not been implemented yet, such as topk, which can affect the computational speed of the evaluation metrics.
It is time-consuming to move a large number of Tensors from CUDA devices to CPU memory.
Alternatively, if it is desired that the computation of the metrics for the rubric be differentiable, then the Tensor data type of the respective machine learning framework needs to be used for the computation.
To deal with the above, MMEval’s evaluation metrics provide some implementations of metrics computed with specific machine learning frameworks, which can be found in [support_matrix](.. /get_started/support_matrix.md).
Meanwhile, in order to deal with the dispatch problem of different metrics calculation methods, MMEval adopts a dynamic multi-distribution mechanism based on type hints, which can dynamically select corresponding calculation methods according to the input data types.
A simple example of multiple dispatch based on type hints is as below:
from mmeval.core import dispatch
@dispatch
def compute(x: int, y: int):
print('this is int')
@dispatch
def compute(x: str, y: str):
print('this is str')
compute(1, 1)
# this is int
compute('1', '1')
# this is str
Currently, we use plum-dispatch to implement multiple
dispatch mechanism in MMEval
. Based on plum-dispatch, some speed optimizations have been made and extended to support typing.ForwardRef
.
Warning
Due to the dynamically typed feature of Python, determining the exact type of a variable at runtime can be time-consuming, especially when you encounter large nested structures of data. Therefore, the dynamic multi-dispatch mechanism based on type hints may have some performance problems, for more information see at:wesselb/plum/issues/53
mmeval.core¶
mmeval.core
base_metric¶
Base class for metric. |
mmeval.core.dist_backends¶
mmeval.core.dist_backends
dist_backends¶
The base backend of distributed communication used by mmeval Metric. |
|
A base backend of Tensor base distributed communication like PyTorch. |
|
A dummy distributed communication for non-distributed environment. |
|
A distributed communication backend for mpi4py. |
|
A cpu distributed communication backend for torch.distributed. |
|
A cuda distributed communication backend for torch.distributed. |
|
A distributed communication backend for horovod.tensorflow. |
|
A distributed communication backend for paddle.distributed. |
|
A distributed communication backend for oneflow. |
mmeval.fileio¶
mmeval.fileio
File Backend¶
Abstract class of storage backends. |
|
Raw local storage backend. |
|
HTTP and HTTPS storage bachend. |
|
Lmdb storage backend. |
|
Memcached storage backend. |
|
Petrel storage backend (for internal usage). |
Register a backend. |
File Handler¶
A base class for file handler. |
|
A Json handler that parse json data from file object. |
|
A Pickle handler that parse pickle data from file object. |
|
A Yaml handler that parse yaml data from file object. |
A decorator that register a handler for some file extensions. |
File IO¶
Load data from json/yaml/pickle files. |
|
Check whether a file path exists. |
|
Read bytes from a given |
|
Return a file backend based on the prefix of uri or backend_args. |
|
Download data from |
|
Read text from a given |
|
Check whether a file path is a directory. |
|
Check whether a file path is a file. |
|
Concatenate all file paths. |
|
Scan a directory to find the interested directories or files in arbitrary order. |
Parse File¶
Load a text file and parse the content as a dict. |
|
Load a text file and parse the content as a list of strings. |
mmeval.metrics¶
mmeval.metrics
Metrics¶
Top-k accuracy evaluation metric. |
|
alias of |
|
alias of |
|
Calculate the average precision with respect of classes. |
|
MeanIoU evaluation metric. |
|
COCO object detection task evaluation metric. |
|
Proposals recall evaluation metric. |
|
Pascal VOC evaluation metric. |
|
Open Images Dataset detection evaluation metric. |
|
Compute F1 scores. |
|
HmeanIoU metric. |
|
EndPointError evaluation metric. |
|
PCK accuracy evaluation metric, which is widely used in pose estimation. |
|
PCKh accuracy evaluation metric for MPII dataset. |
|
PCK accuracy evaluation metric for Jhmdb dataset. |
|
AVA evaluation metric. |
|
Calculate StructuralSimilarity (structural similarity). |
|
Signal-to-Noise Ratio. |
|
Peak Signal-to-Noise Ratio. |
|
Mean Absolute Error metric for image. |
|
Mean Squared Error metric for image. |
|
Bilingual Evaluation Understudy metric. |
|
Sum of Absolute Differences metric for image. |
|
Gradient error for evaluating alpha matte prediction. |
|
Mean Squared Error metric for image matting. |
|
Connectivity error for evaluating alpha matte prediction. |
|
DOTA evaluation metric. |
|
Calculate Rouge Score used for automatic summarization. |
|
Calculate Natural Image Quality Evaluator(NIQE) metric. |
|
Perplexity measures how well a language model predicts a text sample. |
|
Calculate the char level recall & precision. |
|
EPE evaluation metric. |
|
AUC evaluation metric. |
|
NME evaluation metric. |
|
Calculate the word level accuracy. |