Multiple Sclerosis Resource Centre

Welcome to the Multiple Sclerosis Resource Centre. This website is intended for international healthcare professionals with an interest in Multiple Sclerosis. By clicking the link below you are declaring and confirming that you are a healthcare professional

You are here

Adaptive multi-level conditional random fields for detection and segmentation of small enhanced pathology in medical images

Medical Image Analysis, In Press, Corrected Proof, Available online 11 July 2015

Highlights

 

  • Introducing a probabilistic Adaptive Multi-level Conditional Random Fields (AMCRF) to address the problem of small enhanced pathology segmentation.
  • Incorporating higher order cliques to better model the variable interactions.
  • Exploring the effect of multiple higher order textural patterns in order to detect structures of interest.
  • Investigating the effect of several different parameter learning and inference algorithms for the proposed graphical model.
  • Testing the proposed model on large multi-center clinical trials from Relapsing-Remitting MS patients where results show 90% sensitivity with 16% false detection rate.

Abstract

Detection and segmentation of large structures in an image or within a region of interest have received great attention in the medical image processing domains. However, the problem of small pathology detection and segmentation still remains an unresolved challenge due to the small size of these pathologies, their low contrast and variable position, shape and texture. In many contexts, early detection of these pathologies is critical in diagnosis and assessing the outcome of treatment. In this paper, we propose a probabilistic Adaptive Multi-level Conditional Random Fields (AMCRF) with the incorporation of higher order cliques for detecting and segmenting such pathologies. In the first level of our graphical model, a voxel-based CRF is used to identify candidate lesions. In the second level, in order to further remove falsely detected regions, a new CRF is developed that incorporates higher order textural features, which are invariant to rotation and local intensity distortions. At this level, higher order textures are considered together with the voxel-wise cliques to refine boundaries and is therefore adaptive. The proposed algorithm is tested in the context of detecting enhancing Multiple Sclerosis (MS) lesions in brain MRI, where the problem is further complicated as many of the enhancing voxels are associated with normal structures (i.e. blood vessels) or noise in the MRI. The algorithm is trained and tested on large multi-center clinical trials from Relapsing-Remitting MS patients. The effect of several different parameter learning and inference techniques is further investigated. When tested on 120 cases, the proposed method reaches a lesion detection rate of 90%, with very few false positive lesion counts on average, ranging from 0.17 for very small (3–5 voxels) to 0 for very large (50+ voxels) regions. The proposed model is further tested on a very large clinical trial containing 2770 scans where a high sensitivity of 91% with an average false positive count of 0.5 is achieved. Incorporation of contextual information at different scales is also explored. Finally, superior performance is shown upon comparing with Support Vector Machine (SVM), Random Forest and variant of an MRF.

Graphical abstract

 

fx1

Keywords: Probabilistic graphical models, CRF, Automatic detection and segmentation, Multiple sclerosis, MRI.

1. Introduction

The task of pathology segmentation in medical imaging is a challenging problem due, in part, to the shortage of robust shape, size and location priors, and to the difficulty in modeling intensities and texture patterns given their large variability over a population. There exists a wide and diverse set of contexts, where it would be important to first detect and then to segment (possibly very small) pathologies among other candidates, which can be quite similar in appearance (Baek, et al, 2012, Johnson, et al, 2013, and Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2012b). In many cases, early detection of these pathologies can be crucial in disease staging and in assessing treatment outcome. This includes the domain where contrast-enhancing agents, such as Gadolinium, are injected into patients, resulting in images where new pathological activity, such as within cancer cells or lesions, becomes enhanced in some imaging modalities and thereby easier to see ( Fig. 1 (a)–(c)). In these contexts, the problem is more difficult than in a typical pathology segmentation context because, for one thing, other healthy structures are often enhanced as well, rendering the primary task the detection of all the pathologies of interest. These structures can vary substantially in size, location and texture and can be as small as 3 or 4 voxels, leaving little margin for error. In fact, failing to detect an existing pathology (i.e. False Negative – FN) or incorrectly labeling a healthy structure as pathology (i.e. False Positive – FP) have huge ramifications in both the diagnosis and the assessment of treatment effect. The problem is further complicated because the contrast between the target and background can be very low. Some non-probabilistic approaches such as those defined in Datta et al. (2007) ; He and Narayana (2002) have been proposed to address this issue. However, they depend on prior segmentation of other structures in order to remove the FPs.

gr1

Fig. 1 First row shows examples of small pathologies that required detection. Arrows point to: (a) enhanced MS lesions in brain MRI ( Karimaghaloo et al., 2012b ), (b) ductal carcinoma in breast cancer ( Johnson et al., 2013 ) and (c) hepatocellular carcinoma in liver cancer ( Baek et al., 2012 ). Second row shows examples of large structure segmentations in medical imaging: (d) shows different abdominal organs to be segmented from CT (liver in cyan, spleen in green, right kidney in yellow) ( Linguraru et al., 2010 ), (e) shows segmentation of left ventricle ( Ayed et al., 2009 ) and (f) shows segmentation of a brain tumor (in red) ( Subbanna et al., 2013 ). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

gr2

Fig. 2 Adaptive multi-level CRF framework. (a) Shows different stages of the algorithm. Numbers on the arrows indicate the order of the process. (b) Shows a test image. (c) Shows the result of the voxel-level CRF together with the bounding box surrounding each candidate (stage II). (d) Shows the final results of the AMCRF model (stage IV) with the incorporation of higher order features along with the voxel-wise interactions.

In the field of computer vision, probabilistic graphical models in the form of random fields have provided a principled way for capturing neighboring data dependencies. As a result, methods such as Markov Random Fields have been extensively used to model many segmentation problems (Blake, Kohli, Rother, 2011 and Li, Singh, 2009). However, computation of the joint distribution becomes intractable in the generative MRF resulting in simplifying assumptions (such as observations independencies). Furthermore, incorporation of data dependent interactions is not straight forward in a traditional MRF. Hence, discriminant variants of MRF i.e. Conditional Random Fields (CRF) ( Lafferty and et. al., 2001 ) have been proposed and are widely used both in computer vision and medical imaging (Ayed, Punithakumar, Li, Islam, Chong, 2009, Bhole, Pal, Rim, Wismüller, 2014, Boix, Gonfaus, van de Weijer, Bagdanov, Serrat, Gonzàlez, 2012, He, Zemel, Carreira-Perpinan, 2004, Hu, Grossberg, Mageras, 2008, Kohli, Torr, et al, 2009, Kumar, Hebert, 2006, Ladicky, Russell, Kohli, Torr, 2009, L’ubor Ladickỳ, Alahari, Russell, Torr, 2010, and Shotton, Winn, Rother, Criminisi, 2006). However, most of these methods are focused on the context of segmenting a central object or a healthy structure from the surrounding tissue in a known general region of interest ( Fig. 1 (d) and (e)). In these contexts, often rich features can be extracted based on intensity or texture patterns, that render the object distinctive from the surrounding background. Moreover, location, size and shape models can be learned and exploited in order to further improve the segmentation results. In the context of pathology segmentation, where the pathology of interest is large, and there is only one in the image (e.g. brain tumors – Fig. 1 (f)), techniques have managed to exploit some prior knowledge and texture information to delineate the pathology, particularly if one can leverage texture homogeneity within sub-regions (Bauer, Nolte, Reyes, 2011, Hao, Wang, Seong, Lee, Ren, Kim, 2012, Lee, Wang, Murtha, Brown, Greiner, 2008, and Subbanna, Precup, Collins, Arbel, 2013).

There has been some work (Karimaghaloo, Arnold, Collins, Arbel, 2012a, Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2010, and Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2012b) where adaptations of CRFs were proposed for the context of small enhanced pathology segmentation and were shown to outperform standard MRF, SVM and linear regression models. While (Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2010) and (Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2012b) incorporates mainly local, voxel-level features, Karimaghaloo et al. (2012a ) includes some higher order terms but the features used are not expressive enough to characterize the context and hence FPs still remain. Intensities at each pixel might be distorted due to the presence of noise or other artifacts. Hence, higher order textural patterns that are robust to local intensity distortions should be incorporated into the model to remove the FPs.

In this work, we propose an Adaptive Multi-level Conditional Random Field (AMCRF) classifier for the task of small enhanced pathology (commonly known as enhancing lesion) segmentation. The proposed model ( Fig. 2 ) works at two different levels of graphical modeling: in the first stage, we introduce a voxel-level CRF model, with cliques of up to size 3, to generate candidate lesions. At this level, the classifier is tuned to be highly sensitive at the expense of additional FP detections. Voxel-level labels are used to group together and identify candidate lesions. In the second stage, as we are left with only a relative few candidates, the model can now efficiently incorporate more computationally expensive higher order features. As opposed to traditional hierarchical graphical models, a novel adaptive CRF is developed to both remove FP lesion candidates and refine the boundaries of the detected lesions. To this end, both voxel-wise interactions AND additional higher order features are optimized together at the second stage of inference. The method extends preliminary work ( Karimaghaloo et al., 2013 ) in several ways including exploring the effect of different texture models (independently and combined) such as: local intensity histogram descriptors (spin image) ( Lazebnik et al., 2005 ), Rotationally Invariant Feature Transform (RIFT) ( Lazebnik et al., 2005 ), and Local Binary Pattern (LBP) ( Ojala et al., 2002 ). These textural descriptors encode intensity patterns and gradient orientations around a reference point and are invariant to rotation and local intensity distortions. Moreover, the relatively simple graphical structure used in Karimaghaloo et al. (2013) is replaced with a more complete model where higher order nodes and their corresponding pairwise edges are included to better capture variables interactions.

A CRF-based segmentation approach is proposed in Hao et al. (2012) for the context of breast lesion segmentation where different hypothesis based on all image cues are included to train a single CRF framework. However, their framework highly relies on texture homogeneity within sub-regions and the performance is only shown on relatively large breast lesions. Hence, the efficacy of their approach in detecting small lesions is not clear. The form of the higher order cliques defined in our model is very similar to the work of Boix et al. (2012) and L’ubor Ladickỳ et al. (2010) in computer vision for the problems of object (such as buildings, pedestrians) detection. However, their model focuses on the problem of multi-class segmentation and the type of the higher order cliques they exploit require complex learning and inference algorithms. The higher order term proposed in this work is specifically tailored for binary classification problems and it can be easily decomposed to pairwise interactions. As a result, conventional learning and inference methods are readily applicable.

gr3

Fig. 3 An example of a brain MR image with MS Gad-enhancing lesions. (a) and (b) show the pre- and post-contrast T1w MR images. (c) Shows the binary mask of all voxels having sufficient enhancement to be considered as enhancing lesion candidates. Based on common criteria for clinical trials, the binary mask depicts all voxels showing more than or equal to 20% enhancement when comparing the T1w pre- and post-contrast images. Here, the true lesions are marked with green while other non-lesional enhancements are shown with red. (d) Shows the only two manually labeled active lesions marked in green. (e) and (f) show zoomed images of one of the enhanced lesions in the pink square in (d) without and with manual labels respectively. Note the small size and low contrast of the lesion. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

Exact parameter learning in CRF models is often intractable due to the difficulties aroused by partition function computation (Alahari, Russell, Torr, 2010 and Nowozin, Gehler, Lampert, 2010). Therefore, different approximation methods have been proposed to address this problem. Methods such as pseudo negative log-likelihood (PNLL) ( Kumar and Hebert, 2006 ) (or piecewise negative pseudo-likelihood Sutton and McCallum, 2007 ) aim to approximate the likelihood function by factorizing it over the individual nodes (or edges) in the graph. High performance is often achieved when these methods are used together with the Iterated Conditional Modes (ICM) ( Besag, 1986 ) inference since both consider local marginals to approximate the original intractable problem ( Kumar and Hebert, 2006 ). There are also a wide range of approaches that rely on approximate inference methods (e.g. tree re-weighted belief propagation (TRBP) Kolmogorov, 2006 ) to infer the marginal probabilities induced by the partition function in the learning procedure. Recently, with the emergence of efficient decoding methods such as graph cuts ( Boykov and Kolmogorov, 2004 ) and TRBP, maximum margin learning approaches have also gained a lot of interest ( Szummer et al., 2008 ). We compare some of these approaches for learning and inference and assess their effect on the final segmentation results.

In this paper, the application we focused on is the problem of detecting and segmenting enhancing Multiple Sclerosis (MS) lesions in Magnetic Resonance Images (MRI) of patient brains. MS is the most common neurological disease of young adults, and is characterized by a wide range of symptoms. At present, there is no cure. T1-weighted (T1w) MRI scans obtained after injection of Gadolinium-based contrast agents are sensitive in detecting areas of acute inflammation known as Gad-enhancing lesions 1 which are typically very small ( Fig. 3 ). The number of these lesions is an important biomarker of disease activity and can be used in the development and evaluation of new treatments for MS ( Sormani et al., 2009 ). However, detecting these lesions inherits all the aforementioned difficulties as they are typically very small and the contrast agents cause blood vessels and other healthy structures to become similarly highlighted, making the problem analogous to searching for a “needle in a haystack” ( Fig. 3 (c)). All of these render modeling difficult, and maintaining low FPs and FNs very challenging. Clinical protocol for the identification of such pathologies stipulates that it be performed manually, or semi-manually. However, in addition to being time-consuming, manual detection and segmentation are prone to intra- and inter-expert variability, making analysis based on the results unreliable. Therefore, there is a great interest in accurately detecting these pathologies automatically ( García-Lorenzo et al., 2013 ).

The proposed algorithm is trained and tested on very large multi-center clinical trials from Relapsing-Remitting MS (RRMS) patients. At the target sensitivity of 90%, average false positive counts per scan range from 0.17 to 0 for very small lesions (3–5 voxels) to very large ones (over 50 voxels). Moreover, the effect of different learning and inference approaches is investigated and various higher order textural patterns at different scales are further examined. Finally, the performance of the AMCRF model is compared with some of the state of the art segmentation techniques such as Support vector Machine (SVM), Random Forest (RF) and a variant of MRF where superior performance is achieved by the proposed model.

2. Background

Let X and Y denote the input image and its associated labels, respectively. Assume each image consists of P voxels. The observation vectors and labels are described as stripin: si6.gif and stripin: si7.gif. Each stripin: si8.gif represents the feature vector at each voxel i (e.g. intensities at different MR sequences available). stripin: si9.gif assigns voxel i to one of L available classes (i.e. stripin: si10.gif). Given a test image, the goal of a probabilistic classification is to infer the posterior distribution of its labels given the observations, i.e. p(Y|X). Therefore, each voxel in X is assigned to the class which maximizes the joint posterior probability of labels. Representing an image by a graph where voxels and their dependencies are shown by nodes and edges respectively, the CRF models the posterior distribution as a Gibbs distribution:

FORMULA:

(1) stripin: si11.gif

where E(yc|X) > 0 is the clique energy associated to a configuration yc and stripin: si12.gif is its corresponding parameter. Z(X) is the partition function and C is the set of all cliques in the graph. stripin: si13.gif is the graph energy, which can be decomposed into individual clique energies. yc represents the joint labeling of clique c (stripin: si14.gif). Most classification problems in vision or medical imaging are formulated as a pairwise CRF whose energy term consists of cliques up to size two:

FORMULA:

(2) stripin: si15.gif

We now describe the AMCRF model, which is an adaptation of the classical CRF.

3. AMCRF

The voxel and lesion levels of the model are explained first ( Fig. 2 (a)). We then elaborate on how we automatically learn the model parameters using the training data.

3.1. Voxel-level CRF

At this level, a voxel-based CRF with cliques of size up to three (as shown in Fig. 4 (b)) is used to detect candidate lesions. The corresponding posterior is defined as:

FORMULA:

(3) stripin: si16.gif

where ϕ , φ and ψ represent the voxel-level potentials for the unary, pairwise and triplet cliques respectively. The voxel-level parameters, λv, modulate the effect of each term in the final decision and are learned at the training stage ( Section. 3.3 ). Ni represents the 2nd order neighborhood of voxel i (i.e. eight in plane and two out of plane neighbors – Fig. 4 (a)). The triplet cliques are only considered for the in-plane nodes due to the larger out of plane resolution of our data. The term δ2( · ) is for penalizing discrepancies in the labels of neighboring pairs. It is zero if the two labels are equal and is one otherwise. It should be noted that a separate weight is assigned to different clique configurations. As a result, there are two, four and eight λ’s associated to the unary, pairwise and triplet cliques. However, to prevent clutter, the dependency of λ on the labeling configuration is not explicitly shown in Eq. (3) . It should be noted that unlike many of the conventional models, our pairwise term is not a likelihood model capturing the intensity differences of neighboring voxels, but rather a probabilistic multi-way classifier assigning probabilities to different labeling configurations of the neighboring voxels. This is particularly useful in settings where the structures to be detected have small sizes and the delta term might cause over-smoothing. Considering the small sizes of these lesions, the unary potential term alone might not prevent over-smoothing. However, the presence of the pairwise and triplet data dependent terms help to penalize smoothing over the small lesions. In cases where neighboring voxels have similar labels, the data dependent terms have low costs (close to zero) so they will not increase the overall energy of the field.

gr4

Fig. 4 (a) The 2nd order neighborhood system used for the voxel-wise interactions. (b) The unary, pairwise and triplet cliques used to model the neighboring interactions in (a).

ϕ, φ and ψ are modeled as:

FORMULA:

(4) stripin: si17.gif

where yi is the label of the ith voxel while yj and yk are the labels of its neighboring voxels in different cliques ( Fig. 4 (b)). Similarly, stripin: si8.gif is the observation vector at the ith voxel and stripin: si18.gif and stripin: si19.gif are the neighboring observation vectors. Any classifier can be used to model the probabilities in Eq. (4) . In this work, we choose to use Random Forest (RF) ( Breiman, 2001 ). The RF classifier is computationally efficient both during the training and testing and also provides probabilistic outputs.

At the end of the voxel-level analysis, a probability of being a lesion is obtained at each voxel. These probabilities are thresholded to yield a binary labeling (0 or 1). At this stage, the goal is to maintain a high sensitivity at the expense of additional FPs. Hence, the threshold is selected accordingly. Candidate lesions are then determined as neighboring set of voxels (defined by 26-connectedness in three dimensions) labeled as 1. Some of these detected regions are enhancing lesions while the rest are false positives ( Fig. 2 (c)). A patch stripin: si20.gif whose size is proportional to the size of the detected region is considered around each candidate and the region inside the patch is forwarded to the next level (more details on the patch size selection are given in Section 4.2 ).

3.2. Lesion-level: adaptive CRF with texture features

At this level, a new CRF model is built for the region inside the patch where higher order textural descriptors are used together with the voxel-wise interactions to distinguish the enhancing lesions from the rest of candidates and also modify their boundaries if needed. It should be noted that computing the textural descriptors is often computationally expensive, but in our proposed multi-level framework, they are computed only at this level where we are left with a few candidates. Let stripin: si21.gif indicate the kth texture type and stripin: si22.gif indicate its corresponding higher order feature extracted for the region inside the patch. The lesion-level CRF is defined as:

FORMULA:

(5) stripin: si23.gif

where stripin: si24.gif and stripin: si25.gif indicate the observations and labels inside the patch and stripin: si26.gif indicates the set of modulating parameters. np denotes total number of voxels inside the patch. Note that the first four terms of the energy in Eq. (5) are similar to those used at the voxel-level ( Eq. (3) ). The only difference is the usage of separate parameters that are learned together with the parameter of the last two terms to allow for adaptive modification of the region boundaries.

A new binary variable stripin: si27.gif ( Fig. 5 ) is defined for each higher order feature stripin: si22.gif. Ψk and Ωk represent the higher order unary and pairwise terms corresponding to the kth texture and are defined as:

FORMULA:

(6) stripin: si28.gif

FORMULA:

(7) stripin: si29.gif

where stripin: si30.gif represents the likelihood of detecting a pathology inside the patch given stripin: si22.gif. In principal, any textural feature can be used to represent stripin: si22.gif. In this work, we chose 3 different descriptors: stripin: si31.gifstripin: si32.gif. These lead to histograms encoding the appearance pattern inside the patch based on the intensity and magnitude and orientation of the gradients. In the following, we briefly explain each descriptor.

gr5

Fig. 5 Schematic view of our higher order clique defined over the voxels inside the patch. A new variable, stripin: si1.gif is defined over all the variables inside the patch and its value is propagated to them via the pairwise edges. The in-plane diagonal edges and the out-of-plane edges are not shown to avoid clutter.

Spin image is a 2D histogram encoding spatial and intensity information in a neighborhood of a particular reference pixel. The neighborhood is a circular patch centered on the reference pixel (i.e. center of the patch). The two dimensions of the histogram are distance from the center pixel, and the normalized intensity. To generate the spin image, the intensity values of each patch is first normalize to the [0 1] range, resulting in an invariance to spatial intensity changes (e.g. as a result of bias fields in MRI). Every pixel in the patch contributes to all histogram bins according to its location and normalized intensity using a Parzen window weighting.

RIFT is a 2D histogram descriptor that encodes spatial information and gradient orientations in a neighborhood of a particular reference pixel (i.e. center of the patch). Its difference to SIFT is that it considers the gradient orientations relative to the radial direction, and therefore is rotation invariant. Each pixel inside the patch contributes to all histogram bins using a Parzen window weight. The Parzen window makes spin image and RIFT insensitive to small deformations and intensity distortions.

LBP encodes local patterns at circular neighborhoods of a particular reference point (i.e. center of the patch). A label is assigned to every point in the image by thresholding its neighborhood points with its value and considering the result as a binary number. The histogram of the labels is considered as the descriptor ( Ojala et al., 2002 ).

It should be noted that the three proposed textures are rotation invariant, which is a desired property in our context since lesions can have different orientations. The spin image is complementary to RIFT as it encodes the intensity pattern of the given patch while RIFT is based on the orientation of the gradients inside the patch. Contrary to the spin image and RIFT, which encode information relative to a reference point, the LBP encodes the pattern around each voxel inside the patch by assigning a code to it based on its eight neighbors. As such, it captures more local patterns compared to other two making it more sensitive to the pattern of small patches. Furthermore, all three features are scale invariant and also invariant to affine intensity distortions. The former is desired since lesions can have different sizes. The latter makes the results robust to intensity distortions happening due to bias field in homogeneities.

Due to the larger spatial resolution of our data set between axial slices, all higher order textures (Spin image, LBP and RIFT) are computed on 2D axial slices. The final value of stripin: si30.gif in Eq. (6) is obtained by a weighted sum of the probabilities in the individual slices it spans over. The weights are the ratio of the area in that slice to the overall volume of the detected area.

During training, we apply the voxel-level model to a subset of the training data to obtain a set of lesion candidates. Textural features are computed for each candidate and are saved according to whether it is a true or false detection. Our analysis showed that in order to distinguish between these textures, it is more efficient to use a histogram-based distance metric, such as the Earth Movers Distance (EMD) ( Rubner et al., 2000 ) as opposed to comparing their attributes one by one (as it is done by methods such as random forest). For this reason, we have adapted a kernel-based classifier such as the Relevance Vector Machine (RVM) ( Tipping, 2001 ) where the kernel matrix is computed based on the EMD distances between the textures. As three types of textures are used, three RVM classifiers are learned, one per each texture. These classifiers are then used to evaluate stripin: si30.gif in Eqs. (6) and (7) .

3.3. Parameter learning and inference

We now consider the problem of learning the parameters of the AMCRF model given a set of labeled training instances. The parameters are learned separately for the voxel- and lesion-level CRF (note that the lesion-level CRF includes its own set of voxel-level parameters). The standard maximum log-likelihood (or minimum negative log-likelihood – NLL) approach to find the optimum parameters at each level for M training samples is:

FORMULA:

(8) stripin: si33.gif

where λ is the set of model parameters and the partition function is evaluated as:

FORMULA:

(9) stripin: si34.gif

where stripin: si35.gif is the set of all possible image configurations with cardinality of Ln (L: the number of classes, n: the total number of image voxels). Once the parameters are learned, the task of inference is to compute the joint (or the marginal) distribution of all the output variables. Decoding, or finding the most likely configuration, can then be applied to find the labeling that maximizes the joint distribution of the output variables.

Due to the complexity in computation of partition function, different approximate learning and inference techniques have been proposed (Nowozin, Gehler, Lampert, 2010 and Szeliski, Zabih, Scharstein, Veksler, Kolmogorov, Agarwala, Tappen, Rother, 2008). In this paper, as a preliminary choice, we used PNLL learning together with the ICM inference mainly due to their simplicity and yet efficient performance. However, in order to properly investigate the impact of other learning and inference/decoding techniques, comparison with other methods are also included in the following sections.

4. Experimental results: enhancing MS lesions

The performance of our proposed framework is validated considering the problem of Gad-enhancing MS lesions detection in brain MRI.

4.1. Data, ground truth and pre-processing

The training and test data were acquired from multi-center clinical trials with RRMS patients with varying numbers of Gad-enhancing lesions, each located in different areas of the brain White Matter (WM). Each acquisition was composed of five sequences: pre- and post-contrast T1 images along with T2, Proton Density (PD) and Fluid Attenuation Inversion Recovery (FLAIR). Each sequence is a non-isotropic 3D volume containing 3 mm thick axial slices with 1 mm by 1 mm intra-plane resolution. Our voxel-wise observation vector stripin: si36.gif consists of the intensity of the above five MRI modalities, the spatial location of each voxel and three tissue priors: the white matter, partial volume and T2 MS lesion priors. The white matter prior is estimated by registering the ICBM (MNI) average brain atlas ( Fonov et al., 2011 ) to the patient images. The partial volume prior models mixtures of Cerebrospinal fluid and gray matter priors from the ICBM152 healthy tissue atlas. The T2 MS lesion prior was available from Elliott et al. (2013) , where manual T2 MS lesion segmentations for a series of clinical trials (for a total of 3714 RRMS scans) were all nonlinearly registered to ICBM space to provide a probabilistic lesion atlas. All registrations were performed using the method of Collins et al. (1995) . Furthermore, all intra-subject sequences are registered to a common coordinate space (the stereotaxic space) using a normalized cross correlation-based technique ( Collins et al., 1995 ).

Given that different MRI scanners are used in different data acquisition sites, the imaging organization specifies the required pulse sequences for each site (prior to the start of the clinical trial) with the objective of obtaining images as similar as possible. Before using the images, a few pre-processing steps are also applied on the data. These include the removal of non-brain portions of the image, correction for non-uniformity effects and bringing MR images into a common spatial and intensity space. Skull stripping is performed using the Brain Extraction Tool (BET) ( Smith, 2002 ) while intensity non-uniformity (NU) correction is performed using N3 ( Sled et al., 1998 ). An inter-patient intensity normalization is performed to map each patient into a global intensity space. This allows us to learn classifiers based on the intensity across many subjects. This normalization was done using the method of Nyul and Udupa (1999) , using intensity deciles as control points.

It should be noted that the training and testing data used in this work along with their corresponding manual labeling is provided by a company which offers various image analysis services (such as lesion segmentation/detection) to facilitate clinical trials for MS treatments. The manual labeling of gad lesions is performed in this company based on the criteria that enhanced lesions should show more than or equal to 20% enhancement when comparing the pre- and post- contrast T1w images. The manual labelings are obtained by a group of expert readers. Each volume is labeled by two different readers from this group. The two then meet to come to a consensus. The manual labeling is performed in the standard stereotaxic space. The expert readers consult all three views (axial, sagittal and coronal) through displaying software specifically designed for lesion labeling. If necessary, the resulting labels are verified by other expert readers. The labelings provided based on this protocol have been used for evaluating drug efficacy in many real clinical trials for MS around the world. This labeling is used as the ground truth in this work both for training the model and evaluating the results of the automatic technique.

4.2. Patch size selection

The patch size considered around the detected regions of the voxel-level CRF is determined as follows (see Fig. 6 ). If a shows the size of a given side of the detected region’s bounding box, the size of the corresponding side of the patch, stripin: si37.gif is set as:

FORMULA:

(10) stripin: si38.gif

This way, if any side of the bounding box is very small (<4 pixels), 4 pixels are added to that side to include enough of the surrounding information. For the other sizes, however, twice the size of the bounding box results in large enough regions to include the neighboring contextual patterns. We examined a few different sizes ( Section 4.10 ) and the aforementioned patch size was an empirical choice that consistently yielded the best performance on the training data.

gr6

Fig. 6 Illustration of the patch size considered around a detected region (marked in green). The bounding box and the patch are shown with green and red rectangles. a and b indicate the size of the two sides of the bounding box. Here, it is assumed that none of the sides of the bounding box is smaller than 4 voxels. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

4.3. Training and testing setup

Our training data contains brain MR scans from 1760 cases (each having five sequences mentioned in Section 4.1 ) from 180 different centers, with Gad-enhancing lesions manually labeled. Different parts of the proposed model are learned with different subsets of the training data to avoid over-fitting. To that end, the training data is divided into four disjoints subsets: s1, s2, s3 and s4. First, we use s1 to learn the parameters of the voxel-level CRF, i.e. the parameters of the RF classifier used in unary, pairwise and triplet cliques. s1 contains 200 MRI volumes which results in 11 × 106 voxels, 41 × 106 pairs of voxels and 27 × 106 triplet of voxels. Learning the voxel-wise classifier CRF was then completed by learning the λv modulating parameters using s2 which contains 150 MRI volumes. We then ran the voxel-level CRF on s3 to generate candidate lesions and used these candidates to train the RVM texture classifiers. In order to increase the robustness of the model, the training data used for learning the texture classifier should contain large amount of variability. As such, s3 is chosen as the largest subset of the training data containing 1260 MRI volumes. Finally, the parameters of the higher order terms together with the voxel-wise interactions, (i.e. stripin: si26.gif), were learned using s4 which contains 150 MRI volumes.

In order to examine the robustness of the proposed method with respect to the variabilities between different trials, we choose our test data from a separate clinical trial. This data comprised of 120 scans from an MS clinical trial, different from the training trial, acquired from 24 centers. In the last section of the experiments ( Section 4.12 ), we further tested the proposed model on a very large clinical trial containing 2770 scans captured from 142 centers.

4.4. Qualitative results

We first present some illustrative examples of the proposed model. Fig. 7 , shows an example of an FP that is successfully removed with the AMCRF model. We see that the voxel-level CRF has failed to remove the FP while the AMCRF model with incorporation of textural features has successfully removed it. In Fig. 8 , we show the adaptive nature of AMCRF, in that it changes the boundaries of the lesion found in the voxel-level CRF. The new boundaries better match the manual labeling. In other words, the lesion-level CRF does not only accept or remove candidates; it also adaptively changes its boundaries.

gr7

Fig. 7 Comparison between intermediate steps of the algorithm in removing a FP (shown in cyan). (a) and (b) show the T1w MR image with contrast (T1c) and the binary mask of all candidate lesions. (c) and (d) show zoomed views of the FP in the binary mask and T1c. The performance of the intermediate steps of the voxel-level CRF (CRFv) are shown in (e) CRFv with unary cliques, (f) CRFv with unary and pairwise cliques, (g) CRFv with unary, pairwise and triplet cliques. (h) shows the AMCRF performance using all three textures with cliques of up to size three. It is observed that the higher order textures remove the FP in (h). PNLL and ICM are used for parameter learning and inference. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

gr8

Fig. 8 Comparison between intermediate steps of the algorithm in adaptively capturing the boundaries of a TP. (a) and (b) show the T1w MR image with contrast and the binary mask of all candidate lesions. (c) and (d) show zoomed view of the TP in the binary mask and manual labeling. The performance of the intermediate steps the voxel-level CRF (CRFv) are shown in (e) CRFv with unary cliques, (f) CRFv with unary and pairwise cliques, (g) CRFv with unary, pairwise and triplet cliques, (h) shows the AMCRF performance using all three textures. AMCRF has adaptively corrected the boundaries of the lesion in (h). Yellow and green respectively show the manual label and the TP. PNLL and ICM are used for parameter learning and inference. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

4.5. Quantitative results

In the context of Gad-enhancing lesion segmentation, due to the pathological nature of the lesions, their exact boundaries are often ambiguous and even an expert has been shown to label the boundaries differently if asked to perform the task twice. Therefore, the overall Gad-enhancing lesion counts is the outcome measure typically used in clinical trials. As such, the primary focus of our work is to detect the individual lesions. Kappa or Dice coefficients (computed based on number of voxels) have been previously used to measure the performance of lesion segmentation algorithms (Akselrod-Ballin, Galun, Basri, Brandt, Gomori, Filippi, Valsasina, 2006, Anbeek, Vincken, van Osch, Bisschops, van der Grond, 2004, and Van Leemput, Maes, Vandermeulen, Colchester, Suetens, 2001). However, these metrics have a strong bias toward larger lesions as small lesions contribute little to an overall volumetric measure (Elliott, Collins, Arnold, Arbel and García-Lorenzo, Francis, Narayanan, Arnold, Collins, 2013). Hence, such measures do not reflect missed regions that are small. In fact, missing a few voxels of a big lesion and missing a small lesion entirely would lead to the same value when using these metrics. As a result, voxel-wise metrics are not directly applicable to our problem, so we use lesion-wise metrics (such as counts) to evaluate our classifier.

Prior to measuring the performance at any step, detected regions of size 1 or 2 voxels are removed according to a clinical protocol which stipulates that MS lesions should have at least 3 voxels (the same criteria was used for the manual labeling). We consider a detected region as a true positive (TP) if it has at least one voxel overlapping with our ground truth, otherwise it is counted as a false positive (FP). Any labeling in the ground truth that is not detected by our method is defined as a false negative (FN). Overall sensitivity and false discovery rate are evaluated as: stripin: si39.gif and stripin: si40.gif. By varying the acceptance threshold on the final probability at each voxel, we plot sensitivity vs. false discovery rate and sensitivity vs. average false positive count per scan.

4.6. Evaluating the effect of different potential functions and texture types

Experiments were performed to evaluate the effect of each term in the proposed AMCRF model. As the main goal of these experiments is to assess the effect of different textural features, the PNLL and ICM methods were adapted for the learning and inference primarily due to their fast and yet efficient performance. The effect of other learning and inference techniques are later evaluated in Section 4.7 . Results are presented in Fig. 9 . It is observed that incorporating cliques of up to size three improves the sensitivity of the results compared to cliques of up to size two 2 (see the dashed curves in Fig. 9 ). Moreover, exploiting the higher order texture features in the lesion-level improves the results of the voxel-level. This can be seen when comparing the solid lines to the dashed curves in Fig. 9 . Embedding the combination of all three stationary texture features (RIFT, spin image and LBP) with unary, pairwise and triplet cliques into the AMCRF leads to the best performance. It should be noted that the highest achievable sensitivity of the final result of the AMCRF model is determined based on the working point of the voxel-level CRF; this is 0.95 in our experiments point of saturation of the curves.

gr9

Fig. 9 (a) shows plots of sensitivity vs. false discovery rates, (b) shows plots of sensitivity vs. average false positive counts per scan. They compare performance of the voxel-wise CRF (CRFv) with unary, pairwise and triplet cliques (U., P., and T.) as well as the final classification results with combinations of RIFT, spin image and LBP features (R., S. and R.). Individual points on each curve are generated by varying the acceptance threshold on the final probability at each voxel. All AMCR curves are with using cliques of up to size three. Here, PNLL and ICM are used for learning and inference. It is observed that the AMCRF model using combination of the three higher order textures yields the best performance.

4.7. Evaluating the effect of different learning and inference methods

So far we have seen the performance of the PNLL learning and ICM inference for the proposed graphical approach. In this section, we further investigate some other combinations of learning and inference methods, summarized in Fig. 11 (c), to assess their effect on the final results. Plots of sensitivity vs. false discovery rates and sensitivity vs. average false positive counts are presented in Fig. 11 (a) and (b). All curves correspond to the AMCRF model with including all three textures. The 2nd order neighborhood with up to pairwise cliques ( Fig. 10 ) is adapted for all the models in the second to the forth rows of Fig. 11 (c). This is due to the complexities which arise in these approaches as the inference step requires minimizing significantly more complex energy functions if considering triplet interactions. At the convergence, both TRBP and ICM yield a posterior probability at each voxel. Decoding can be achieved by choosing the winning label for each voxel. For the (NLL, TRBP) combination in Fig. 11 (c), approximating the marginal probabilities in the learning procedure and computing the joint posterior of all the voxels in the final inference are both performed with the TRBP. Maximum margin approaches require a decoding method in the learning procedure to update their constraints. TRBP and graph cuts are used in this paper due to their high efficiency ( Szeliski et al., 2008 ).

gr10

Fig. 10 The unary and pairwise cliques used to model the neighboring interactions in Fig. 4 (a) when learning and inference approaches other than (PNLL, ICM) are used.

gr11

Fig. 11 Comparing the effect of different learning and inference techniques for the AMCRF classifier when including all textures (RIFT, spin image and LBP). (a) Shows plots of sensitivity vs. false discovery rates. (b) Shows plots of sensitivity vs. average false positive counts per scan. Zoomed in view of the region of interests are also shown for (a) and (b). Arrows point to the resulting working point when using the graph cut decoding. It is observed that the maximum margin learning along with the TRBP inference (the cyan curves) yield slightly higher sensitivities than the rest. Different combinations of learning and inference techniques used are summarized in (c). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

It is observed that almost all of the combinations yield comparable results. However, the maximum margin learning along with the TRBP inference yields slightly higher sensitivities than the rest. This is perhaps due to the strong performance of the TRBP optimization in reaching energies very close to the global optimum. The maximum margin learning is shown to be more effective than NLL (and PNLL) methods as it does not depend on the partition function computation ( Szummer et al., 2008 ). Since graph cuts does not yield a probabilistic output, there is only one working point corresponding to this method (marked with an arrow). The slight degradation in the graph cuts performance as compared to the rest might be due to the fact that it can only be applied on the sub-modular functions. Therefore, the energy function optimized by the graph cut is slightly modified to assure meeting this criteria. 3 Also, while the graph cuts technique is guaranteed to give the global minimum energy solution for the binary classification, this solution is not always the best solution for many of the real world problems (as also pointed out by other work (Meltzer, Yanover, Weiss, 2005 and Szeliski, Zabih, Scharstein, Veksler, Kolmogorov, Agarwala, Tappen, Rother, 2008)). For example, in the context of pathology segmentation, higher sensitivities at the expense of a few false positives is preferred over lower sensitivities with lower false positive counts. This is because false positives can be easily removed upon reviewing the results by a human expert as opposed to the harder task of looking for possible false negatives. Hence, thresholds corresponding to higher sensitivities are usually preferred in this context. The deterministic results obtained by graph cuts, even though might correspond to the lowest energy, does not permit the privilege of choosing a working point. For all of the aforementioned combinations, both learning and inference are performed using MATLAB on a 2.66GHz CPU. The code is not specifically optimized for fast processing. In all cases, inference is in general fast taking less than 1 min on an average per scan. However, training the AMCRF model (similar to all CRF models) is computationally more expensive and takes much longer. Specifically, it takes around 95, 120, 144 and 100 h for training (PNLL, ICM), (NLL, TRBP), (Max. Marg., TRBP) and (Max. Marg., GC) on a complete dataset of 1760 scans. The most time-consuming part for all models is training the voxel-level CRF where, for each training case, the graphical model is defined over all voxels in a 3D MRI volume.

Considering the training complexity of the aforementioned approaches and the fact that they yield almost equal performance, it would make sense to select the model that is less computationally expensive (i.e. PNLL, ICM) for this context.

4.8. Evaluating the effect of lesion size

Large lesions that span over multiple slices are much easier to capture than the ones with only a few voxels in a single slice. In this section, we examine the performance of the AMCRF algorithm as a function of lesion size. Table 1 shows sensitivity and false discovery rates for different lesion sizes from very small (3–5 voxels) to very large (101+ voxels) ones when operating at overall target sensitivities of 0.90. Results correspond to the AMCRF model with the incorporation of all three textures and using the maximum margin learning with TRBP inference. Sensitivities for different lesion sizes range from 75% for very small lesions (3–5 voxels) to 100% for very large ones (101+ voxels) while the average FP counts per scan ranges from 0.17 to 0.

Table 1 Performance of the AMCRF model (with incorporation of all three textures) for different lesion sizes (e.g. 3–5 voxels, 6–10 voxels, and so on) at target sensitivity of 90%. Lesions sizes are computed based on the manual labeling. These are reported in each category: TP=  total number of true positive counts, stripin: si4.gifles = total number of lesions based on the manual labeling (i.e. TP + FN), SENS=sensitivity (i.e. stripin: si5.gif), FP =total number of false positive counts, FDR=False Discovery Rate, AvgFP=average number of FP counts per scan. AvgFP in each category is obtained by dividing the number of false detections with different sizes (3–5, 6–10, etc.) by the total number of cases (i.e. 120). TP sizes are based on the size of the corresponding lesions in manual labeling.

  Overall 3–5 6–10 11–20 21–50 51–100 101+
stripin: si4.gifles 236 65 45 39 52 20 15
TP 212 49 41 37 51 19 15
SENS 0.90 0.75 0.91 0.95 0.98 0.95 1
FP 38 20 8 8 2 0 0
FDR 0.16 0.29 0.16 0.18 0.04 0 0
AvgFP 0.32 0.17 0.07 0.07 0.02 0 0
gr12

Fig. 12 Segmentation results of two enhanced lesions from two different patients. The T1w image after contrast is shown in (a) (and (h)) where each enhanced lesion is marked with an arrow. The rest of the images are the zoomed views of the enhanced lesions. The T1w before and after contrast are shown in (b) and (c) ((i) and (j)). The normalized enhancement map computed as stripin: si2.gif is shown (d) (and (k)) where the brighter voxels correspond to the stronger enhancements. The enhancement map thresholded to only show voxels with required amount of enhancements (i.e. more than 20% in our work) for gad lesions is shown in (e) (and (l)). The segmentation result is shown in (f) (and (m)) where the TP, FP and FN voxels are marked in green, blue and red. The manual labeling is shown in (g) (and (n)). The values of Dice, sensitivity and false discovery rate are reported for each case. It is observed that although almost all of the voxels are captured by the automatic technique, both cases have low Dice values due to over-segmentation compared to manual labeling. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

gr13

Fig. 13 Volume of detected regions in manual labeling versus the automatic technique for the 120 test scans. Each blue point in the plot represents one scan. The line of stripin: si3.gif is shown in red for comparison. Generally, the automatic technique tends to over-segment. It should be noted that on average there are about 5 × 105 voxels in each scan of our dataset out of which around 0.55 × 105 are enhanced which are mostly due to the healthy structures (as it can be seen from the plot, at most only around 650 voxels are due to the gad-lesions). In the most extreme case, the automatic technique over-segments by 324 voxels (in the scan shown with an arrow), which is well below the total number of enhanced voxels. This removes any concern that there might be a cases where the entire brain has been labeled as lesion. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

4.9. Segmentation analysis

Although the total number of gad-enhancing lesions is the main criteria used in the clinical trials for new MS treatments, in this section we also analyze the segmentation results of the detected regions. Table 2 shows the voxel-based sensitivity, false discovery rate and Dice values for all true positive regions when categorized based on their sizes (note that the size of the false positive regions are already reported in Table 1 ). In all cases, the sensitivity value is quite high meaning that even though we require only one voxel overlap with the manual labeling, in fact almost all of the voxels marked in the manual labeling are detected. It should be noted that the manual raters were given the task to label the main voxels for each lesion, and not to spend time refining the boundaries. Therefore, the automatic technique tends to over-segment compared to the manual labeling. This is why the voxel-based false detection rates are not very low.

Table 2 Voxel-based segmentation analysis of the detected lesions by AMCRF model (with incorporation of all three textures) for different lesion sizes (e.g. 3–5 voxels, 6–10 voxels, and so on) at the same working point of Table 1 . Lesion sizes are computed based on the manual labeling. Following metrics are computed in each column based on TP, FP, and FN voxels in the true positive detections: average sensitivity (Avg SENS), Average False Discovery Rate (Avg FDR) and average Dice score (Avg DICE).

  Overall 3–5 6–10 11–20 21–50 51–100 101+
Avg SENS (voxel-based) 0.96 0.95 0.94 0.97 0.97 0.98 0.98
Avg FDR (voxel-based) 0.38 0.45 0.42 0.35 0.33 0.30 0.35
Avg DICE (voxel-based) 0.61 0.54 0.55 0.63 0.65 0.68 0.65

Fig. 12 shows qualitative examples of the segmentation results of two detected regions. Although almost all voxels of the two shown lesions are captured by the automatic technique, both cases have relatively low Dice values due to the over-segmentation. In this case, Dice is not really an appropriate metric to use but the results are shown for completeness.

We also compared the overall volume detected by the automatic technique to that of the manual labeling. Results are presented in Fig. 13 .

4.10. Evaluating the effect of different scales

An evaluation of different scales of the considered area around the pathology of interest is presented in this section. To that end, in addition to the original patch scale (referred to as s0 in Fig. 14 (a)), two other scales are considered around each detected region of the voxel-level. The first scale, s1, is obtained by expanding s0 by one voxel in each direction. The primarily goal here is to investigate the sensitivity of the higher order features to the selected patch center (note that higher order features are computed relative to the patch center). Therefore, for each texture type, nine textures are computed for the region inside this scale with the centers shown in Fig. 14 (d). These nine textures are then combined to form a histogram representing that scale. This results in three signatures (one for each texture type) for s1. The signatures essentially encode the number of times certain cluster of patches appears compared to a dictionary (more details on how signatures are obtained can be found in Lazebnik et al. (2005) ). The second scale, s2, is 1.5 times bigger than s0. The aforementioned procedure is repeated ( Fig. 14 (f) and (g)) resulting in three signatures for s2.

gr14

Fig. 14 Illustrations of different scales considered around the detected region of the voxel-level (the green region in (a), (b) and (e)). (a) Shows the original scale, s0. This scale is represented by one texture per each texture type. (b) Shows the first scale: s1. This scale is represented by nine textures derived per each texture type using the centers shown in (d). The support area for three of these textures is shown in (c) each with a different color. (e) Shows the second scale: s2. Similar to s1, this scale is also represented by nine textures derived per each texture type using the centers shown in (g). The support area for three of these textures is shown in (f) each with a different color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

gr15

Fig. 15 Evaluating the effect of higher order textures at different scales. (a) Shows plots of sensitivity vs. false discovery rates. (b) Shows plots of sensitivity vs. average false positive counts per scan. Maximum margin learning together with TRBP decoding (Max.Marg, TRBP) are used for all curves.

These signatures are integrated into the lesion-level CRF as follows:

FORMULA:

(11) stripin: si43.gif

The only difference between Eqs. (11) and (5) is the summation over different scales in the last two terms. stripin: si44.gif represents the kth feature extracted at scale s and Ψk, s and Ωk, s are the corresponding unary and pairwise higher order potentials. 4 Similar to the analysis performed for the original scale ( Eq. (5) ), an RVM classifier is learned for each signature type at s1 and s2 (i.e. three RVM classifiers per scale) and the resulting probabilities are used in computing Ψk, s and Ωk, s similar to Eqs. (6) and (10) .

Fig. 15 shows the results for when only the original scale is used (previously shown in Fig. 11 ), and when it is combined with s1 and s2. It is observed that the original scale outperforms the other two. This is perhaps because majority of the lesions have small sizes and larger neighborhoods suppress the lesion related textures.

4.11. Comparison to MRF, RF and SVM

Finally, we compare the performance of the proposed AMCRF model with a few other techniques, namely Adaptive Multi-level MRF (AMMRF), Random Forest classifier (RF) and Support Vector Machine (SVM). AMMRF is implemented similar to AMCRF with the only difference that the Ising model is used for the pairwise interactions (as is typical for standard MRFs), instead of the data dependent terms used by the AMCRF. Both RF and SVM are widely used i.i.d. classifiers that do not consider any interactions between data samples. In this experiment, both RF and SVM classifiers are trained using subset s1 described in Section 4.3 .

gr16

Fig. 16 Comparing the performance of the proposed AMCRF model to Adaptive Multi-level MRF (AMMRF), Random Forest (RF) and Support Vector Machine (SVM). AMMRF is similar to AMCRF except that the Ising model is adapted for the pairwise interactions. (a) Shows plots of sensitivity vs. false discovery rates. (b) Shows plots of sensitivity vs. average false positive counts per scan. Maximum margin learning together with TRBP decoding (Max.Marg, TRBP) are used for both AMCRF and AMMRF. Node that SVM and RF are unary models whereas AMCRF and AMMRF both are multi-level approaches where the final sensitivity depends on the selected working point of the voxel-level. As a results, the corresponding curves never reach the sensitivity of 1.

gr17

Fig. 17 Evaluating the performance of the AMCRF model on a large dataset. Plot of sensitivity vs. false discovery rate and sensitivity vs. average false positive count are shown in (a) and (b). A sensitivity of 91% at the false discovery rate of 0.33 and average FP count of 0.46 is achieved.

A non-linear SVM with radial basis functions is used. 5 Quadratic programming optimization is exploited for finding its coefficients. A few different values are tested for the meta-parameter C in the range of 0.1 to 10. The one giving the smallest error on a validation set is chosen (C= 1.5).

The RF classifier 6 is basically the classifier used to model stripin: si45.gif of the unary function ( Eq. (4) ). It consists of hundred binary trees constructed in a top-down manner. The same voxel-wise features used in training the AMCRF are used for training the RF and SVM classifiers (i.e. the intensities of the five MRI modalities, the spatial location of each voxel and three tissue priors: the white matter, partial volume and T2 MS lesion priors). Each tree uses 104 voxels from of the available training voxels at random and each node in a given tree uses half of the available features at random. At each node, the feature that best splits the given training examples is chosen by a greedy search. The goodness of a split is measured using the entropy-based information gain metric.

It should be noted that the computational cost for training the SVM classifier is almost twice that of the RF ( ∼ 2 vs. ∼ 4 h) which is mainly due to the quadratic programming optimization.

Fig. 16 shows comparison of the results. Maximum margin learning together with the TRBP inference is used for both AMCRF and AMMRF. As expected, both AMMRF and AMCRF outperform the RF and SVM due to the strength achieved by using graphical representation. AMMRF yields lower sensitivity for any given false discovery rate due to the over-smoothing effect of the Ising model.

4.12. Testing on a large dataset

In this section we evaluate the performance of the proposed model on a very large clinical trial containing 2770 scans. Fig. 17 shows the plots of sensitivity versus the false discovery rate and the sensitivity versus the average false positive counts. Results correspond to the AMCRF model with the incorporation of all three textures and using the maximum margin learning with TRBP inference. A sensitivity of 91% at the average false positive count of 0.46 and false discovery rate of 33% is achieved.

Similar to the size-based analysis presented in Table 1 and Table 3 shows sensitivity, false discovery rates and average false positive counts for different lesion sizes from very small (3–5 voxels) to very large (101+ voxels) lesions for the AMCRF model when operating at overall sensitivity of 91%. Similar trend is observed: as the size increases, the sensitivity increases from 79% for the very small lesions to 99% for the very large ones. Likewise, the false discovery rate decreases from 63% for the small regions to 7% for the large ones. Hence, most of the false detections are at smaller sizes and there is also the risk of not having a perfect ground truth for those sizes as the human readers might have missed these small3 lesions.

Table 3 Performance of the AMCRF model on a large dataset for different lesion sizes (e.g. 3–5 voxels, 6–10 voxels, and so on) at target sensitivity of 91%. Lesions sizes are computed based on the manual labeling. These are reported in each category: TP=  total number of true positive counts, stripin: si4.gifles= total number of lesions based on the manual labeling (i.e. TP + FN), SENS= sensitivity (i.e. stripin: si5.gif), FP= total number of false positive counts, FDR=False Discovery Rate, AvgFP =average number of FP counts per scan. AvgFP in each category is obtained by dividing the number of false detections with different sizes (3–5, 6–10, etc.) by the total number of cases (i.e. 2770).

  Overall 3–5 6–10 11–20 21–50 51–100 101+
stripin: si4.gifles 2807 541 618 622 630 241 155
TP 2557 427 563 571 609 233 154
SENS 0.91 0.79 0.91 0.92 0.97 0.97 0.99
FP 1255 732 270 152 62 28 11
FDR 0.33 0.63 0.32 0.21 0.09 0.11 0.07
AvgFP 0.46 0.27 0.10 0.06 0.02 0.01 0.004

5. Conclusion

We presented AMCRF, a multi-level adaptive CRF framework that finds small enhanced pathologies with a minimum size of 3 voxels. The small size and various shape, location and texture of these pathologies together with the presence of numerous healthy tissue enhancements, make the problem more challenging than segmentation of large healthy structures. In addition to voxel-level interactions, the model includes higher order textures to correctly discriminate the enhanced pathologies from the pool of other enhancements. The performance is evaluated in the context of MS enhancing lesion segmentation where the model is trained and tested on a very large multi-center clinical trial and the effect of different components of the model is extensively evaluated. The performance of different combinations of learning and inference techniques is further tested. On 120 cases, at the overall target sensitivity of 90%, average false detection counts per scan ranges from 0.17 to 0 for very small to very large lesions. The proposed model is further tested on a very large clinical trial containing 2770 scans where a high sensitivity along with a low false discovery rate is achieved. Finally, the performance of the AMCRF model is compared with a few other techniques such as Support Vector Machine (SVM), Random Forest (RF) and a variant of MRF where superior performance is achieved by the proposed model. In conclusion, the high sensitivity along with the few false positive counts of the proposed model offer a fast and accurate solution to be employed in real clinical trials.

Acknowledgment

This work was supported by a Canadian National Science and Engineering Research Council Strategic Grant (STPGP 350547–07) and a Canadian National Science and Engineering Research Council collaborative Research and Development Grant (CRDPJ 411455–10).

References

  • Akselrod-Ballin, Galun, Basri, Brandt, Gomori, Filippi, Valsasina, 2006 Akselrod-Ballin A., Galun M., Basri R., Brandt A., Gomori M.J., Filippi M., Valsasina P. An integrated segmentation and classification approach applied to multiple sclerosis analysis. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition in: vol. 1 (IEEE, 2006) 1122-1129
  • Alahari, Russell, Torr, 2010 Alahari K., Russell C., Torr P.H. Efficient piecewise learning for conditional random fields. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR (IEEE, 2010) 895-901
  • Anbeek, Vincken, van Osch, Bisschops, van der Grond, 2004 Anbeek P., Vincken K.L., van Osch M.J., Bisschops R.H., van der Grond J. Probabilistic segmentation of white matter lesions in mr imaging. NeuroImage. 2004;21(3):1037-1044
  • Ayed, Punithakumar, Li, Islam, Chong, 2009 Ayed I.B., Punithakumar K., Li S., Islam A., Chong J. Left ventricle segmentation via graph cut distribution matching. Proceedings of Medical Image Computing and Computer-Assisted Intervention (Springer, 2009) 901-909
  • Baek, et al., 2012 Baek C., et al. Hepatocellular carcinoma in patients with chronic liver disease: a comparison of gadoxetic acid-enhanced MRI and multiphasic MDCT. Clin. Radiol.. 2012;67(2):148-156
  • Bauer, Nolte, Reyes, 2011 Bauer S., Nolte L.-P., Reyes M. Fully automatic segmentation of brain tumor images using support vector machine classification in combination with hierarchical conditional random field regularization. Proceedings ofMedical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2011) 354-361
  • Besag, 1986 Besag J. On the statistical analysis of dirty pictures. J. R. Stat. Soc.. 1986;48(3):259-302
  • Bhole, Pal, Rim, Wismüller, 2014 Bhole C., Pal C., Rim D., Wismüller A. 3d segmentation of abdominal CT imagery with graphical models, conditional random fields and learning. Mach. Vis. Appl.. 2014;25(2):301-325
  • Blake, Kohli, Rother, 2011 Blake A., Kohli P., Rother C. Markov Random Fields for Vision and Image Processing (MIT Press, 2011)
  • Boix, Gonfaus, van de Weijer, Bagdanov, Serrat, Gonzàlez, 2012 Boix X., Gonfaus J.M., van de Weijer J., Bagdanov A.D., Serrat J., Gonzàlez J. Harmony potentials. Int. J. Comput. Vis.. 2012;96(1):83-102
  • Boykov, Kolmogorov, 2004 Boykov Y., Kolmogorov V. An experimental comparison of min- cut/max-flow algorithms for energy minimization in vision. Pattern Anal. Mach. Intell.. 2004;26(9):1124-1137
  • Breiman, 2001 Breiman L. Random forests. Mach. Learn.. 2001;45(1):5-32
  • Collins, Holmes, Peters, Evans, 1995 Collins D.L., Holmes C.J., Peters T.M., Evans A.C. Automatic 3-d model-based neuroanatomical segmentation. Hum. Brain Mapp.. 1995;3(3):190-208
  • Datta, Sajja, He, Gupta, Wolinsky, Narayana, 2007 Datta S., Sajja B.R., He R., Gupta R.K., Wolinsky J.S., Narayana P.A. Segmentation of gadolinium-enhanced lesions on MRI in multiple sclerosis. J. Magn. Reson. Imaging. 2007;25(5):932-937
  • Elliott, Collins, Arnold, Arbel Elliott, C., Collins, D., Arnold, D., Arbel, T., 2013. Temporally Consistent Probabilistic Detection of new Multiple Sclerosis Lesions in Brain MRI.
  • Fonov, Evans, Botteron, Almli, McKinstry, Collins, 2011 Fonov V., Evans A.C., Botteron K., Almli C.R., McKinstry R.C., Collins D.L. Unbiased average age-appropriate atlases for pediatric studies. NeuroImage. 2011;54(1):313-327
  • García-Lorenzo, Francis, Narayanan, Arnold, Collins, 2013 García-Lorenzo D., Francis S., Narayanan S., Arnold D.L., Collins D.L. Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med. Image Anal.. 2013;17(1):1-18
  • Hao, Wang, Seong, Lee, Ren, Kim, 2012 Hao Z., Wang Q., Seong Y.K., Lee J.-H., Ren H., Kim J.-y. Combining CRF and multi-hypothesis detection for accurate lesion segmentation in breast sonograms. Proceedings of Medical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2012) 504-511
  • He, Narayana, 2002 He R., Narayana P.A. Automatic delineation of Gd enhancements on magnetic resonance images in multiple sclerosis. Med. Phys.. 2002;29:1536
  • He, Zemel, Carreira-Perpinan, 2004 He X., Zemel R., Carreira-Perpinan M. Multiscale conditional random fields for image labeling. Proceedings of IEEE Conference on Computer vision and pattern recognition. CVPR (, 2004) 695-702
  • Hu, Grossberg, Mageras, 2008 Hu Y.-C. J., Grossberg M.D., Mageras G.S. Semi-automatic medical image segmentation with adaptive local statistics in conditional random fields framework. Proceedings of the 30th IEEE Annual International Conference on Engineering in Medicine and Biology Society. EMBS (IEEE, 2008) 3099-3102
  • Johnson, et al., 2013 Johnson K.S., et al. Cancelation of MRI guided breast biopsies for suspicious breast lesions identified at 3.0 T MRI: reasons, rates, and outcomes. Acad. radiol. 2013;20(5):569-575
  • Karimaghaloo, Arnold, Collins, Arbel, 2012a Karimaghaloo Z., Arnold D.L., Collins D.L., Arbel T. Hierarchical conditional random fields for detection of gad-enhancing lesions in multiple sclerosis. Proceedings of Medical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2012) 379-386
  • Karimaghaloo, Rivaz, Arnold, Collins, Arbel, 2013 Karimaghaloo Z., Rivaz H., Arnold D.L., Collins D.L., Arbel T. Adaptive voxel, texture and temporal conditional random fields for detection of gad-enhancing multiple sclerosis lesions in brain MRI. Proceedings of Medical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2013) 543-550
  • Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2010 Karimaghaloo Z., Shah M., Francis S.J., Arnold D.L., Collins D.L., Arbel T. Detection of gad-enhancing lesions in multiple sclerosis using conditional random fields. Proceedings of Medical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2010) 41-48
  • Karimaghaloo, Shah, Francis, Arnold, Collins, Arbel, 2012b Karimaghaloo Z., Shah M., Francis S.J., Arnold D.L., Collins D.L., Arbel T. Automatic detection of gadolinium-enhancing multiple sclerosis lesions in brain MRI using conditional random fields. IEEE Trans. Med. Imaging. 2012;31(6):1181-1194
  • Kohli, Torr, et al., 2009 Kohli P., Torr P.H., et al. Robust higher order potentials for enforcing label consistency. Int. J. Comput. Vis.. 2009;82(3):302-324
  • Kolmogorov, 2006 Kolmogorov V. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell.. 2006;28(10):1568-1583
  • Kumar, Hebert, 2006 Kumar S., Hebert M. Discriminative random fields. Int. J. Comput. Vis.. 2006;68(2):179-201
  • Ladicky, Russell, Kohli, Torr, 2009 Ladicky L., Russell C., Kohli P., Torr P.H. Associative hierarchical CRFs for object class image segmentation. Proceeding of the 12th IEEE International Conference on Computer Vision (IEEE, 2009) 739-746
  • Lafferty, et. al., 2001 Lafferty J., et. al. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the Conference on Machine Learning Machine Learning (, 2001) 282-289
  • Lazebnik, Schmid, Ponce, 2005 Lazebnik S., Schmid C., Ponce J. A sparse texture representation using local affine regions. IEEE Pattern Anal. Mach. Intell.. 2005;27:1265-1278
  • Lee, Wang, Murtha, Brown, Greiner, 2008 Lee C., Wang S., Murtha A., Brown M., Greiner R. Segmenting brain tumors using pseudo-conditional random fields. Proceedings of Medical image computing and computer-assisted intervention. MICCAI (, 2008) 359-366
  • Li, Singh, 2009 Li S.Z., Singh S. in: Markov Random Field Modeling in Image Analysis. vol. 3 (Springer, 2009)
  • Linguraru, Pura, Chowdhury, Summers, 2010 Linguraru M.G., Pura J.A., Chowdhury A.S., Summers R.M. Multi-organ segmentation from multi-phase abdominal ct via 4d graphs using enhancement, shape and location optimization. Proceedings of Medical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2010) 89-96
  • L’ubor Ladickỳ, Alahari, Russell, Torr, 2010 L’ubor Ladickỳ P.S., Alahari K., Russell C., Torr P.H. What, where and how many? Combining object detectors and CRFs. Proceedings of the 11th European conference on Computer vision: Part IV (, 2010) 424-437
  • Meltzer, Yanover, Weiss, 2005 Meltzer T., Yanover C., Weiss Y. Globally optimal solutions for energy minimization in stereo vision using reweighted belief propagation. Proceedings of the Tenth IEEE International Conference on Computer Vision. ICCV in: vol. 1 (IEEE, 2005) 428-435
  • Nowozin, Gehler, Lampert, 2010 Nowozin S., Gehler P.V., Lampert C.H. On parameter learning in CRF-based approaches to object class image segmentation. Proceedings of European Conference on Computer Vision. ECCV (Springer, 2010) 98-111
  • Nyul, Udupa, 1999 Nyul L., Udupa J. On standardizing the MR image intensity scale. Comput. Assist. Tomogr.. 1999;42:1072-1081
  • Ojala, Pietikainen, Maenpaa, 2002 Ojala T., Pietikainen M., Maenpaa T.T. Multiresolution gray-scale and rotation invariant texture classification with local binary pattern. Pattern Anal. Mach. Intell.. 2002;24(7):971-987
  • Rubner, Tomasi, Guibas, 2000 Rubner Y., Tomasi C., Guibas L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis.. 2000;40(2):99-121
  • Shotton, Winn, Rother, Criminisi, 2006 Shotton J., Winn J., Rother C., Criminisi A. Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. Proceedings of European Conference on Computer Vision. ECCV (Springer, 2006) 1-15
  • Sled, Zijdenbos, Evans, 1998 Sled J.G., Zijdenbos A.P., Evans A.C. A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Trans. Med. Imaging. 1998;17(1):87-97
  • Smith, 2002 Smith S.M. Fast robust automated brain extraction. Hum. Brain Mapp.. 2002;17(3):143-155
  • Sormani, Bonzano, Roccatagliata, Cutter, Mancardi, Bruzzi, 2009 Sormani M.P., Bonzano L., Roccatagliata L., Cutter G.R., Mancardi G.L., Bruzzi P. Magnetic resonance imaging as a potential surrogate for relapses in multiple sclerosis: a meta-analytic approach. Ann. Neurol.. 2009;65(3):268-275
  • Subbanna, Precup, Collins, Arbel, 2013 Subbanna N.K., Precup D., Collins D.L., Arbel T. Hierarchical probabilistic gabor and MRF segmentation of brain tumours in MRI volumes. Proceedings of Medical Image Computing and Computer-Assisted Intervention. MICCAI (Springer, 2013) 751-758
  • Sutton, McCallum, 2007 Sutton C., McCallum A. Piecewise pseudo-likelihood for efficient training of conditional random fields. Proceedings of International Conference on Machine Learning. ICML (, 2007) 863-870
  • Szeliski, Zabih, Scharstein, Veksler, Kolmogorov, Agarwala, Tappen, Rother, 2008 Szeliski R., Zabih R., Scharstein D., Veksler O., Kolmogorov V., Agarwala A., Tappen M., Rother C. A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Trans. Pattern Anal. Mach. Intell.. 2008;30(6):1068-1080
  • Szummer, Kohli, Hoiem, 2008 Szummer M., Kohli P., Hoiem D. Learning CRFs using graph cuts. Proceedings of European Conference on Computer Vision. ECCV in: 5303 (, 2008) 582-595
  • Tipping, 2001 Tipping M.E. Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res.. 2001;1:211-244
  • Van Leemput, Maes, Vandermeulen, Colchester, Suetens, 2001 Van Leemput K., Maes F., Vandermeulen D., Colchester A., Suetens P. Automated segmentation of multiple sclerosis lesions by model outlier detection. IEEE Trans. Med. Imaging. 2001;20(8):677-688

Footnotes

a Centre for Intelligent Machines, McGill University, Montreal, Canada

b NeuroRx Research, Montreal, Canada

* Corresponding author.

1 The terms Gad-enhancing lesions, enhancing lesions and lesions are used interchangeably in this paper.

2 It should be noted that in computing the statistics of the voxel-level CRF, thresholds resulting in acceptance of all of the voxels in a given MRI volume were excluded because based on our TP and FP definitions they will be counted as one TP resulting in overall sensitivity of one and false discovery rate of zero. For this reason the maximum achievable sensitivity for the voxel-level curves is not one.

3 Data dependent terms are modified for the graph cuts model so as to have zero costs for stripin: si41.gif and stripin: si42.gif. This results in over-smoothing of some of the small lesions which is reflected in the lower sensitivity of the graph cut solution.

4 Note that features for s0 include RIFT, spin image and LBP textures whereas features in s1 and s2 consist of the signatures derived from RIFT, spin image and LBP textures.

5 MATLAB built-in svmtrain and svmclassify functions are used for training and classification.

6 A MATLAB-based implementation is used obtained from http://uk.mathworks.com/matlabcentral/fileexchange/31036-random-forest .


Search this site

Stay up-to-date with our monthly e-alert

If you want to regularly receive information on what is happening in MS research sign up to our e-alert.

Subscribe »

About the Editors

  • Prof Timothy Vartanian

    Timothy Vartanian, Professor at the Brain and Mind Research Institute and the Department of Neurology, Weill Cornell Medical College, Cornell...
  • Dr Claire S. Riley

    Claire S. Riley, MD is an assistant attending neurologist and assistant professor of neurology in the Neurological Institute, Columbia University,...
  • Dr Rebecca Farber

    Rebecca Farber, MD is an attending neurologist and assistant professor of neurology at the Neurological Institute, Columbia University, in New...

This online Resource Centre has been made possible by a donation from EMD Serono, Inc., a business of Merck KGaA, Darmstadt, Germany.

Note that EMD Serono, Inc., has no editorial control or influence over the content of this Resource Centre. The Resource Centre and all content therein are subject to an independent editorial review.

The Grant for Multiple Sclerosis Innovation
supports promising translational research projects by academic researchers to improve understanding of multiple sclerosis (MS) for the ultimate benefit of patients.  For full information and application details, please click here

Journal Editor's choice

Recommended by Prof. Brenda Banwell

Causes of death among persons with multiple sclerosis

Gary R. Cutter, Jeffrey Zimmerman, Amber R. Salter, et al.

Multiple Sclerosis and Related Disorders, September 2015, Vol 4 Issue 5