Shortcuts

Welcome to MMEval’s documentation!

Introduction

MMEval is a machine learning evaluation library that supports efficient and accurate distributed evaluation on a variety of machine learning frameworks.

Major features:

  • Comprehensive metrics for various computer vision tasks (NLP will be covered soon!)

  • Efficient and accurate distributed evaluation, backed by multiple distributed communication backends

  • Support multiple machine learning frameworks via dynamic input dispatching mechanism

mmeval-arch

Installation and Usage

Installation

MMEval requires Python 3.6+ and can be installed via pip.

pip install mmeval

To install the dependencies required for all the metrics provided in MMEval, you can install them with the following command.

pip install 'mmeval[all]'

How to use

There are two ways to use MMEval’s metrics, using Accuracy as an example:

from mmeval import Accuracy
import numpy as np

accuracy = Accuracy()

The first way is to directly call the instantiated Accuracy object to calculate the metric.

labels = np.asarray([0, 1, 2, 3])
preds = np.asarray([0, 2, 1, 3])
accuracy(preds, labels)
# {'top1': 0.5}

The second way is to calculate the metric after accumulating data from multiple batches.

for i in range(10):
    labels = np.random.randint(0, 4, size=(100, ))
    predicts = np.random.randint(0, 4, size=(100, ))
    accuracy.add(predicts, labels)

accuracy.compute()
# {'top1': ...}

Support Matrix

Supported distributed communication backends

MPI4Py torch.distributed Horovod paddle.distributed oneflow.comm
MPI4PyDist TorchCPUDist
TorchCUDADist
TFHorovodDist PaddleDist OneFlowDist

Supported metrics and ML frameworks

Note

The following table lists the metrics implemented by MMEval and the corresponding machine learning framework support. A check mark indicates that the data type of the corresponding framework (e.g. Tensor) can be directly passed for computation.

Note

MMEval tested with PyTorch 1.6+, TensorFlow 2.4+, Paddle 2.2+ and OneFlow 0.8+.

Metric numpy.ndarray torch.Tensor tensorflow.Tensor paddle.Tensor oneflow.Tensor
Accuracy
SingleLabelMetric
MultiLabelMetric
AveragePrecision
MeanIoU
VOCMeanAP
OIDMeanAP
COCODetection
ProposalRecall
F1Score
HmeanIoU
PCKAccuracy
MpiiPCKAccuracy
JhmdbPCKAccuracy
EndPointError
AVAMeanAP
StructuralSimilarity
SignalNoiseRatio
PeakSignalNoiseRatio
MeanAbsoluteError
MeanSquaredError

Implementing a Metric

To implement a metric in MMEval, you should implement a subclass of BaseMetric that overrides the add and compute_metric methods.

In the evaluation process, each metric will update self._results to store intermediate results after each call of add. When computing the final metric result, the self._results will be synchronized between processes.

An example that implementing simple Accuracy metric:

import numpy as np
from mmeval.core import BaseMetric

class Accuracy(BaseMetric):

    def add(self, predictions, labels):
        self._results.append((predictions, labels))

    def compute_metric(self, results):
        predictions = np.concatenate(
            [res[0] for res in results])
        labels = np.concatenate(
            [res[1] for res in results])
        correct = (predictions == labels)
        accuracy = sum(correct) / len(predictions)
        return {'accuracy': accuracy}

Use Accuracy

# stateless call
accuracy = Accuracy()
metric_results = accuracy(predictions=[1, 2, 3, 4], labels=[1, 2, 3, 1])
print(metric_results)
# {'accuracy': 0.75}

# Accumulate batch
for i in range(10):
    predicts = np.random.randint(0, 4, size=(10,))
    labels = predicts = np.random.randint(0, 4, size=(10,))
    accuracy.add(predicts, labels)

metric_results = accuracy.compute()
accuracy.reset()  # clear the intermediate results

Using Distributed Evaluation

Distributed evaluation generally uses a strategy of data parallelism, where each process executes the same program to process different data.

The supported distributed communication backends in MMEval can be viewed via list_all_backends.

import mmeval

print(mmeval.core.dist.list_all_backends())
# ['non_dist', 'mpi4py', 'tf_horovod', 'torch_cpu', 'torch_cuda', ...]

This section shows how to use MMEval in the combination of torch.distributed and MPI4Py for distributed evaluation, using the CIFAR-10 dataset as an example. The related code can be found at mmeval/examples/cifar10_dist_eval.

Prepare the evaluation dataset and model

First of all, we need to load the CIFAR-10 test data, we can use the dataset classes provided by Torchvison.

In addition, to be able to slice the dataset according to the number of processes in a distributed evaluation, we need to introduce the DistributedSampler.

import torchvision as tv
from torch.utils.data import DataLoader, DistributedSampler

def get_eval_dataloader(rank=0, num_replicas=1):
    dataset = tv.datasets.CIFAR10(
        root='./', train=False, download=True,
        transform=tv.transforms.ToTensor())
    dist_sampler = DistributedSampler(
        dataset, num_replicas=num_replicas, rank=rank)
    data_loader = DataLoader(dataset, batch_size=1, sampler=dist_sampler)
    return data_loader, len(dataset)

Secondly, we need to prepare the model to be evaluated, here we use resnet18 from Torchvision.

import torch
import torchvision as tv

def get_model(pretrained_model_fpath=None):
    model = tv.models.resnet18(num_classes=10)
    if pretrained_model_fpath is not None:
        model.load_state_dict(torch.load(pretrained_model_fpath))
    return model.eval()

Single process evaluation

After preparing the test data and the model, the model predictions can be evaluated using the mmeval.Accuracy metric. The following is an example of a single process evaluation.

import tqdm
import torch
from mmeval import Accuracy

eval_dataloader, total_num_samples = get_eval_dataloader()
model = get_model()
# Instantiate `Accuracy` and calculate the top1 and top3 accuracy
accuracy = Accuracy(topk=(1, 3))

with torch.no_grad():
    for images, labels in tqdm.tqdm(eval_dataloader):
        predicted_score = model(images)
        # Accumulate batch data, intermediate results will be saved in
        # `accuracy._results`.
        accuracy.add(predictions=predicted_score, labels=labels)

# Invoke `accuracy.compute` for metric calculation
print(accuracy.compute())
# Invoke `accuracy.reset` to clear the intermediate results saved in
# `accuracy._results`
accuracy.reset()

Distributed evaluation with torch.distributed

There are two distributed communication backends implemented in MMEval for torch.distributed, TorchCPUDist and TorchCUDADist.

There are 2 ways to set up a distributed communication backend for MMEval:

from mmeval.core import set_default_dist_backend
from mmeval import Accuracy

# 1. Set the global default distributed communication backend.
set_default_dist_backend('torch_cpu')

# 2. Initialize the evaluation metrics by passing `dist_backend`.
accuracy = Accuracy(dist_backend='torch_cpu')

Together with the above code for single process evaluation, the distributed evaluation can be implemented by adding the distributed environment startup and initialization.

import tqdm
import torch
from mmeval import Accuracy


def eval_fn(rank, process_num):
    # Distributed environment initialization
    torch.distributed.init_process_group(
        backend='gloo',
        init_method=f'tcp://127.0.0.1:2345',
        world_size=process_num,
        rank=rank)

    eval_dataloader, total_num_samples = get_eval_dataloader(rank, process_num)
    model = get_model()
    # Instantiate `Accuracy` and set up a distributed communication backend
    accuracy = Accuracy(topk=(1, 3), dist_backend='torch_cpu')

    with torch.no_grad():
        for images, labels in tqdm.tqdm(eval_dataloader, disable=(rank!=0)):
            predicted_score = model(images)
            accuracy.add(predictions=predicted_score, labels=labels)

    # Specify the number of dataset samples by size in order to remove
    # duplicate samples padded by the `DistributedSampler`.
    print(accuracy.compute(size=total_num_samples))
    accuracy.reset()


if __name__ == "__main__":
    # Number of distributed processes
    process_num = 3
    # Launching distributed with spawn
    torch.multiprocessing.spawn(
        eval_fn, nprocs=process_num, args=(process_num, ))

Distributed evaluation with MPI4Py

MMEval has decoupled the distributed communication capability. While the above example uses the PyTorch model and data loading, we can still use distributed communication backends other than torch.distributed to implement distributed evaluation.

The following will show how to use MPI4Py as a distributed communication backend for distributed evaluation.

First, you need to install MPI4Py and openmpi, it is recommended to use conda to install.

conda install openmpi
conda install mpi4py

Then modify the above code to use MPI4Py as the distributed communication backend:

# cifar10_eval_mpi4py.py

import tqdm
from mpi4py import MPI
import torch
from mmeval import Accuracy


def eval_fn(rank, process_num):
    eval_dataloader, total_num_samples = get_eval_dataloader(rank, process_num)
    model = get_model()
    accuracy = Accuracy(topk=(1, 3), dist_backend='mpi4py')

    with torch.no_grad():
        for images, labels in tqdm.tqdm(eval_dataloader, disable=(rank!=0)):
            predicted_score = model(images)
            accuracy.add(predictions=predicted_score, labels=labels)

    print(accuracy.compute(size=total_num_samples))
    accuracy.reset()


if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    eval_fn(comm.Get_rank(), comm.Get_size())

Using mpirun as the distributed launch method.

# Launch 3 processes with mpirun
mpirun -np 3 python3 cifar10_eval_mpi4py.py

MMCls

BaseMetric in MMEval follows the design of the mmengine.evaluator module and introduces distributed communication backend to meet the needs of a diverse distributed communication library.

Therefore, MMEval naturally supports the evaluation based on OpenMMLab 2.0 algorithm library, and the evaluation metrics using MMEval in OpenMMLab 2.0 algorithm library need not be modified.

For example, use mmeval.Accuracy in MMCls, just configure the Metric to be Accuracy in the config:

val_evaluator = dict(type='Accuracy', topk=(1, ))

test_evaluator = val_evaluator

MMEval’s support for OpenMMLab 2.0 algorithm library is being gradually improved, and the supported metric can be viewed in the support matrix.

TensorPack

TensorPack is a neural net training interface on TensorFlow, with focus on speed + flexibility

There are many examples of classic models and tasks provided in the TensorPack repository. This section shows how to use mmeval.COCODetection for evaluation in TensorPack-FasterRCNN, and the related code can be found at mmeval/examples/tensorpack.

First you need to install TensorFlow and TensorPack, then follow the preparation steps in the TensorPack-FasterRCNN example to install the dependencies and prepare the COCO dataset, as well as download the pre-trained model weights to be evaluated.

Scripts for model evaluation are provided in predict.py, and the model can be evaluated with the following commands:

./predict.py --evaluate output.json --load /path/to Trained-Model-Checkpoint --config SAME-AS-TRAINING

MMEval provides a evaluation tools for TensorPack-FasterRCNN that use mmeval.COCODetection. This evaluation script needs to be placed in the TensorPack-FasterRCNN example directory, and then the evaluation can be executed with the following command.

# run evaluation
python tensorpack_mmeval.py --load <model_path>

# launch multi-gpus evaluation by mpirun
mpirun -np 8 python tensorpack_mmeval.py --load <model_path>

We tested this evaluation script on COCO-MaskRCNN-R50C41x and got the same evaluation results as the TensorPack report.

Model mAP (box) mAP (mask) Configurations
COCO-MaskRCNN-R50C41x 36.2 31.8 MODE_FPN=False

PaddleSeg

PaddleSeg is a semantic segmentation algorithm library based on Paddle that supports many downstream tasks related to semantic segmentation.

This section shows how to use mmeval.MeanIoU for evaluation in PaddleSeg, and the related code can be found at mmeval/examples/paddleseg.

First you need to install Paddle and PaddleSeg, you can refer to the installation documentation in PaddleSeg. In addition, you need to download the pre-trained model to be evaluated, and prepare the evaluation data according to the configuration.

Scripts for model evaluation are provided in the PaddleSeg repo, and the model can be evaluated with the following commands:

python val.py --config <config_path> --model_path <model_path>

Note that the val.py script in the PaddleSeg only supports single-GPU evaluation, not multi-GPU evaluation yet.

MMEval provides a evaluation tools for PaddleSeg that use mmeval.MeanIoU, which can be executed with the following command:

# run evaluation
python ppseg_mmeval.py --config <config_path> --model_path <model_path>

# run evaluation with multi-gpus
python ppseg_mmeval.py --config <config_path> --model_path <model_path> --launcher paddle --num_process <num_gpus>

We tested this evaluation script on fastfcn_resnet50_os8_ade20k_480x480_120k and got the same evaluation results as the val.py in PaddleSeg.

Config Weights mIoU aAcc Kappa mDice
fastfcn_resnet50_os8_ade20k_480x480_120k model.pdparams 0.4373 0.8074 0.7928 0.5772

BaseMetric Design

During the evaluation process, the results of partial datasets are usually inferred on each GPU in data parallel to speed up the evaluation.

Most of the time, we can’t just reduce the metric results from each subset of the dataset as the metric result of the dataset.

Therefore, the usual practice is to save the inference results obtained by each process or the intermediate results of the metric calculation. Then perform an all-gather operation across all processes, and finally calculate the metric results of the entire evaluation dataset.

The above operations are completed by BaseMetric in MMEval, and its interface design is shown in the following:

classDiagram class BaseMetric BaseMetric : +{BaseDistBackend} dist_comm BaseMetric : +str dist_collect_mode BaseMetric : +dict dataset_meta BaseMetric : #list _results BaseMetric : +reset() BaseMetric : +compute() BaseMetric : +{abstractmethod} add() BaseMetric : +{abstractmethod} compute_metric()

The add and compute_metric methods are interfaces that need to be implemented by users. For more details, please refer to Custom Evaluation Metrics.

It can be seen from the [BaseMetric](mmeval.core.BaseMetric) interface that the main function of BaseMetric is to provide distributed evaluation. The basic process is as follows:

  1. The user calls the add method to save the inference result or the intermediate result of the metric calculation in the BaseMetric._results list.

  2. The user calls the compute method, and BaseMetric synchronizes the data in the _results list across processes and calls the user-defined compute_metric method to calculate the metrics.

In addition, BaseMetric also considers that in distributed evaluation, some processes may pad repeated data samples, in order to ensure the same number of data samples in all processes. Such behavior will affect the indicators correctness of the calculation. E.g. DistributedSampler in PyTorch.

To deal with this problem, BaseMetric.compute can receive a size parameter, which represents the actual number of samples in the evaluation dataset. After _results completes process synchronization, the padded samples will be removed according to dist_collect_mode to achieve correct metric calculation.

Note

Be aware that the intermediate results stored in _results should correspond one-to-one with the samples, in that we need to remove the padded samples for the most accurate result.

Distributed Communication Backend

The distributed communication requirements required by MMEval in the distributed evaluation mainly include the following:

  • All-gather the intermediate results of the metric saved in each process.

  • Broadcast the metric result calculated by the rank 0 process to all processes

In order to flexibly support multiple distributed communication libraries, MMEval abstracts the above distributed communication requirements and defines a distributed communication interface BaseDistBackend:

classDiagram class BaseDistBackend BaseDistBackend : +bool is_initialized BaseDistBackend : +int rank BaseDistBackend : +int world_size BaseDistBackend : +all_gather_object() BaseDistBackend : +broadcast_object()

To implement a distributed communication backend, you need to inherit BaseDistBackend and implement the above interfaces, where:

  • is_initialized: identifies whether the initialization of the distributed communication environment has been completed.

  • rank: the rank index of the current process group.

  • world_size: the world size of the current process group.

  • all_gather_object: perform the all_tather operation on any Python object that can be serialized by Pickle.

  • broadcast_object: broadcasts any Python object that can be serialized by Pickle.

Take the implementation of MPI4PyDist as an example:

from mpi4py import MPI


class MPI4PyDist(BaseDistBackend):
    """A distributed communication backend for mpi4py."""

    @property
    def is_initialized(self) -> bool:
        """Returns True if the distributed environment has been initialized."""
        return 'OMPI_COMM_WORLD_SIZE' in os.environ

    @property
    def rank(self) -> int:
        """Returns the rank index of the current process group."""
        comm = MPI.COMM_WORLD
        return comm.Get_rank()

    @property
    def world_size(self) -> int:
        """Returns the world size of the current process group."""
        comm = MPI.COMM_WORLD
        return comm.Get_size()

    def all_gather_object(self, obj: Any) -> List[Any]:
        """All gather the given object from the current process group and
        returns a list consisting gathered object of each process."""
        comm = MPI.COMM_WORLD
        return comm.allgather(obj)

    def broadcast_object(self, obj: Any, src: int = 0) -> Any:
        """Broadcast the given object from source process to the current
        process group."""
        comm = MPI.COMM_WORLD
        return comm.bcast(obj, root=src)

Some distributed communication backends have been implemented in MMEval, which can be viewed in the support matrix.

Multiple Dispatch

MMEval wants to support multiple machine learning frameworks. One of the simplest solutions is to have NumPy support for the computation of all metrics.

Since all machine learning frameworks have Tensor data types that can be converted to numpy.ndarray, this can satisfy most of the evaluation requirements.

However, there may be some problems in some cases:

  • NumPy has some common operators that have not been implemented yet, such as topk, which can affect the computational speed of the evaluation metrics.

  • It is time-consuming to move a large number of Tensors from CUDA devices to CPU memory.

Alternatively, if it is desired that the computation of the metrics for the rubric be differentiable, then the Tensor data type of the respective machine learning framework needs to be used for the computation.

To deal with the above, MMEval’s evaluation metrics provide some implementations of metrics computed with specific machine learning frameworks, which can be found in [support_matrix](.. /get_started/support_matrix.md).

Meanwhile, in order to deal with the dispatch problem of different metrics calculation methods, MMEval adopts a dynamic multi-distribution mechanism based on type hints, which can dynamically select corresponding calculation methods according to the input data types.

A simple example of multiple dispatch based on type hints is as below:

from mmeval.core import dispatch

@dispatch
def compute(x: int, y: int):
    print('this is int')

@dispatch
def compute(x: str, y: str):
    print('this is str')

compute(1, 1)
# this is int

compute('1', '1')
# this is str

Currently, we use plum-dispatch to implement multiple dispatch mechanism in MMEval. Based on plum-dispatch, some speed optimizations have been made and extended to support typing.ForwardRef.

Warning

Due to the dynamically typed feature of Python, determining the exact type of a variable at runtime can be time-consuming, especially when you encounter large nested structures of data. Therefore, the dynamic multi-dispatch mechanism based on type hints may have some performance problems, for more information see at:wesselb/plum/issues/53

mmeval.core

mmeval.core

base_metric

BaseMetric

Base class for metric.

dist

list_all_backends

Returns a list of all distributed backend names.

set_default_dist_backend

Set the given distributed backend as the default distributed backend.

get_dist_backend

Returns distributed backend by the given distributed backend name.

dispatch

dispatch

A Dispatcher inherited from plum.Dispatcher that resolve typing.ForwardRef.

mmeval.core.dist_backends

mmeval.core.dist_backends

dist_backends

BaseDistBackend

The base backend of distributed communication used by mmeval Metric.

TensorBaseDistBackend

A base backend of Tensor base distributed communication like PyTorch.

NonDist

A dummy distributed communication for non-distributed environment.

MPI4PyDist

A distributed communication backend for mpi4py.

TorchCPUDist

A cpu distributed communication backend for torch.distributed.

TorchCUDADist

A cuda distributed communication backend for torch.distributed.

TFHorovodDist

A distributed communication backend for horovod.tensorflow.

PaddleDist

A distributed communication backend for paddle.distributed.

OneFlowDist

A distributed communication backend for oneflow.

mmeval.fileio

File Backend

BaseStorageBackend

Abstract class of storage backends.

LocalBackend

Raw local storage backend.

HTTPBackend

HTTP and HTTPS storage bachend.

LmdbBackend

Lmdb storage backend.

MemcachedBackend

Memcached storage backend.

PetrelBackend

Petrel storage backend (for internal usage).

register_backend

Register a backend.

File Handler

BaseFileHandler

A base class for file handler.

JsonHandler

A Json handler that parse json data from file object.

PickleHandler

A Pickle handler that parse pickle data from file object.

YamlHandler

A Yaml handler that parse yaml data from file object.

register_handler

A decorator that register a handler for some file extensions.

File IO

load

Load data from json/yaml/pickle files.

exists

Check whether a file path exists.

get

Read bytes from a given filepath with ‘rb’ mode.

get_file_backend

Return a file backend based on the prefix of uri or backend_args.

get_local_path

Download data from filepath and write the data to local path.

get_text

Read text from a given filepath with ‘r’ mode.

isdir

Check whether a file path is a directory.

isfile

Check whether a file path is a file.

join_path

Concatenate all file paths.

list_dir_or_file

Scan a directory to find the interested directories or files in arbitrary order.

Parse File

dict_from_file

Load a text file and parse the content as a dict.

list_from_file

Load a text file and parse the content as a list of strings.

mmeval.metrics

mmeval.metrics

Metrics

Accuracy

Top-k accuracy evaluation metric.

SingleLabelMetric

alias of mmeval.metrics.precision_recall_f1score.SingleLabelPrecisionRecallF1score

MultiLabelMetric

alias of mmeval.metrics.precision_recall_f1score.MultiLabelPrecisionRecallF1score

AveragePrecision

Calculate the average precision with respect of classes.

MeanIoU

MeanIoU evaluation metric.

COCODetection

COCO object detection task evaluation metric.

ProposalRecall

Proposals recall evaluation metric.

VOCMeanAP

Pascal VOC evaluation metric.

OIDMeanAP

Open Images Dataset detection evaluation metric.

F1Score

Compute F1 scores.

HmeanIoU

HmeanIoU metric.

EndPointError

EndPointError evaluation metric.

PCKAccuracy

PCK accuracy evaluation metric, which is widely used in pose estimation.

MpiiPCKAccuracy

PCKh accuracy evaluation metric for MPII dataset.

JhmdbPCKAccuracy

PCK accuracy evaluation metric for Jhmdb dataset.

AVAMeanAP

AVA evaluation metric.

StructuralSimilarity

Calculate StructuralSimilarity (structural similarity).

SignalNoiseRatio

Signal-to-Noise Ratio.

PeakSignalNoiseRatio

Peak Signal-to-Noise Ratio.

MeanAbsoluteError

Mean Absolute Error metric for image.

MeanSquaredError

Mean Squared Error metric for image.

BLEU

Bilingual Evaluation Understudy metric.

SumAbsoluteDifferences

Sum of Absolute Differences metric for image.

GradientError

Gradient error for evaluating alpha matte prediction.

MattingMeanSquaredError

Mean Squared Error metric for image matting.

ConnectivityError

Connectivity error for evaluating alpha matte prediction.

DOTAMeanAP

DOTA evaluation metric.

ROUGE

Calculate Rouge Score used for automatic summarization.

NaturalImageQualityEvaluator

Calculate Natural Image Quality Evaluator(NIQE) metric.

Perplexity

Perplexity measures how well a language model predicts a text sample.

CharRecallPrecision

Calculate the char level recall & precision.

KeypointEndPointError

EPE evaluation metric.

KeypointAUC

AUC evaluation metric.

KeypointNME

NME evaluation metric.

WordAccuracy

Calculate the word level accuracy.

mmeval.utils

mmeval.utils

misc

try_import

Try to import a module.

has_method

Check whether the object has a method.

is_seq_of

Check whether it is a sequence of some type.

is_list_of

Check whether it is a list of some type.

is_tuple_of

Check whether it is a tuple of some type.

is_filepath

Check if the given object is Path-like.

Changelog of v0.x

Indices and tables

Read the Docs v: latest
Versions
latest
stable
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.