FUZZY CLUSTERING ALGORITHM AND ITS APPLICATION ON CARCINOMA TISSUE

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
2Forward
Citations 
0
Petitions 
2
Assignments
First Claim
1. A fuzzy Cmeans (FCM) clustering algorithm for processing spectral images of a tissue sample, wherein the algorithm automatically and simultaneously estimates the optimal values of K (number of nonredundant FCM clusters), and m (fuzziness index), based on the redundancy between FCM clusters.
2 Assignments
0 Petitions
Accused Products
Abstract
This invention relates to a method for identifying and classifying carcinomas on the skin of a subject by a FTIR or Raman spectrometer coupled with a microimaging system.
5 Citations
View as Search Results
Solid catalyst components for olefin polymerization and methods of making and using the same  
Patent #
US 9,663,595 B2
Filed 08/05/2014

Current Assignee
Braskem America Inc.

Sponsoring Entity
W R Grace Company  CONN

Learning pixel visual context from object characteristics to generate rich semantic images  
Patent #
US 10,445,557 B2
Filed 08/03/2017

Current Assignee
Definiens AG

Sponsoring Entity
Definiens AG

METHOD, SYSTEM AND APPARATUS FOR REALTIME CLASSIFICATION OF MUSCLE SIGNALS FROM SELF SELECTED INTENTIONAL MOVEMENTS  
Patent #
US 20100293115A1
Filed 08/21/2007

Current Assignee
BLOORVIEW KIDS REHABILITATION HOSPITAL

Sponsoring Entity
BLOORVIEW KIDS REHABILITATION HOSPITAL

Systems and methods for classification of biological datasets  
Patent #
US 20080118124A1
Filed 10/18/2007

Current Assignee
Trustees Of The University Of Pennsylvania

Sponsoring Entity
Trustees Of The University Of Pennsylvania

Complete mitochondrial genome sequences as a diagnostic tool for the health sciences  
Patent #
US 20050026167A1
Filed 12/11/2003

Current Assignee
Genesis Genomics Inc

Sponsoring Entity
Genesis Genomics Inc

17 Claims
 1. A fuzzy Cmeans (FCM) clustering algorithm for processing spectral images of a tissue sample, wherein the algorithm automatically and simultaneously estimates the optimal values of K (number of nonredundant FCM clusters), and m (fuzziness index), based on the redundancy between FCM clusters.
 6. A method for characterizing the tumor heterogeneity of a lesion comprising:
 a) scanning a lesion on a tissue sample by a FTIR or Raman spectrometer coupled with a microimaging system;
b) acquiring and storing spectra of a series of digital images of the lesion;
c) clustering the spectra by fuzzy Cmeans (FCM) clustering algorithm wherein the algorithm automatically and simultaneously estimates the optimal values of K (number of nonredundant FCM clusters), and m (fuzziness index), based on the redundancy between FCM clusters.  View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
 a) scanning a lesion on a tissue sample by a FTIR or Raman spectrometer coupled with a microimaging system;
1 Specification
This application is entitled to priority U.S. Provisional Patent Application No. 61/282767, filed Mar. 26, 2010. The content the application is hereby incorporated by reference in its entirety.
The biochemical changes related to carcinogenesis between cancerous and surrounding tissue areas are subtle. As a consequence, spectral images, such as IR and Raman spectra, need to be processed by powerful digital signal processing and pattern recognition methods in order to highlight these changes. To date, unsupervised “hard” clustering techniques including Kmeans (KM) or agglomerative hierarchical (AH) clustering have been usually applied to create colorcoded images allowing to localize tumoral tissue surrounded by other tissue structures (normal, inflammatory, fibrotic . . . ).
The particularity of “hard” clustering methods is that each pixel (spectrum) is assigned to only one cluster. Consequently, they neither allow to consider the progressive transition between noncancerous tissues and cancer lesions, nor to reveal every nuance of intratumoral heterogeneity. See Wolthuis, R.; Travo, A.; Nicolet, C.; Neuville, A.; Gaub, M. P.; Guennot, D.; Ly, E.; Manfait, M.; Jeannesson, P.; Piot, O. Analytical Chemistry 2008, 80, 84618469.
To overcome this drawback, fuzzy clustering methods such as fuzzy Cmeans (FCM) can be used instead of “hard” clustering algorithms. See Bezdek, J. C. Pattern recognition with fuzzy objective function algorithms; Plenum: New York, USA, 1981. Indeed, FCM allows each pixel to be assigned to every cluster with an associated membership value varying between 0 (no class membership) and 1 (highest degree of cluster membership). In IR spectroscopy, FCM has been used for data analyzing.
However, such as for “hard” KM clustering, the number of clusters K must be defined a priori by the user. The FCM results are thus dependent from the operatorexperience. In addition, FCM outcomes are dependent on another important parameter, called the fuzziness index m in the fuzzy logic literature. When m=1, FCM becomes identical to KM and when m increases, the clustering becomes fuzzier. At very high values of m, data will have an equal membership for all the clusters. In IR or Raman data processing, this can lead to create redundant cluster images, in which only some pixels differ from one cluster to another. However, the fuzziness index is classically fixed to 2 in the literature. The choice of an efficient tradeoff between K and m, necessary to fully exploit the information content of hyperspectral images, is still an open problem. See Mansfield, J. R.; Sowa, M. G.; Scarth, G. B.; Somorjai, R. L.; Mantsch, H. H., Analytical Chemistry 1997, 69, 33703374; and Richter, T.; Steiner, G.; AbuId, M. H.; Salzer, R.; Bergmann, R.; Rodig, H.; Johannsen, B., Vibrational Spectroscopy 2002, 28, 103110. Indeed, as recently shown for colorectal adenocarcinoma, when the (K, m) couple is not optimized, FCM clustering proved to be less efficient than AH clustering in terms of tissue histopathological recognition. See Lasch, P.; Haensch, W.; Naumann, D.; Diem, M., Biochimica et Biophysica Acta 2004, 1688, 176186.
The present invention offers a novel algorithm dedicated to spectral images of tumoral tissue, which can automatically estimate the optimal values of K, number of nonredundant FCM clusters, and m, fuzziness index, without any a priori knowledge of the dataset. This innovative algorithm is based on the redundancy between FCM clusters. This algorithm is particularly well adapted to localize tumoral areas and to highlight transition areas between tumor and surrounding tissue structures. For the infiltrative tumors, a progressive gradient in the membership values of the pixels of the peritumoral tissue is also revealed.
The present invention provides a fuzzy Cmeans (FCM) clustering algorithm for processing spectral images of a tissue sample. The algorithm automatically and simultaneously estimates the optimal values of K (number of nonredundant FCM clusters), and m (fuzziness index), based on the redundancy between FCM clusters.
The present invention also provides a method for characterizing the tumor heterogeneity of a lesion. According to the present invention, the characterization was conducted by the following steps: a) scanning a lesion on a tissue sample by a FTIR or Raman spectrometer coupled with a microimaging system; b) acquiring and storing spectra of a series of digital images of the lesion; c) clustering the spectra by fuzzy Cmeans (FCM) clustering algorithm. Further, the algorithm automatically and simultaneously estimates the optimal values of K (number of nonredundant FCM clusters), and m, (fuzziness index) based on the redundancy between FCM clusters.
The developed algorithm was applied on the IR datasets acquired on 13 biopsies of formalin fixed paraffinembedded human skin carcinomas: squamous cell carcinomas (SCC, n=3), basal cell carcinomas (BCC, n=4) and Bowen'"'"'s diseases (n=6). The samples were obtained from the tumor bank of the Pathology Department of the University Hospital of Reims (France). Ten micronthick slices were cut from samples and mounted, without any particular preparation, on a calcium fluoride (CaF2) (Crystran Ltd., Dorset, UK) window for FTIR imaging. Adjacent slices were cut and stained with hematoxylin and eosin (H&E) for conventional histology.
FTIR hyperspectral images were recorded with a Spectrum Spotlight 300 FTIR imaging system coupled to a Spectrum one FTIR spectrometer (Perkin Elmer Life Sciences, France) with a spatial resolution of 6.25 μtm and a spectral resolution of 4 cm^{−1}. The device was equipped with a nitrogencooled mercury cadmium telluride 16pixelline detector for imaging. Spectral images, also called datasets, were collected using 16 accumulations. Prior to each acquisition, a reference spectrum of the atmospheric environment and the CaF2 window was recorded with 240 accumulations. This reference spectrum was subsequently subtracted from each dataset automatically by a builtin function from the Perkin Elmer Spotlight software. Each image pixel represented an IR spectrum, which was the absorbance of one measurement point (6.25×6.25 μm^{2}) over 451 wavenumbers uniformly distributed between 900 and 1800 cm^{−1}. This spectral range, characterized as the fingerprint region, actually corresponded to the most informative region for the biological samples.
The samples were analyzed without previous chemical dewaxing, the recorded FTIR hyperspectral image must be digitally corrected for paraffin spectral contribution. To this end, an automated processing method based on extended multiplicative signal correction (EMSC) was applied on each recorded dataset. The details of the corresponding analytical method was fully described by Ly, E.; Piot, O.; Wolthuis, R.; Durlach, A.; Bernard, P.; and Manfait, M., (Analyst 2008, 133, 197205), which is herein adopted in its entirety. Briefly, a mean spectrum I was computed by averaging all Q recorded spectra I_{q }of each dataset. Light scattering effects were modeled with a fourthorder polynomial function P. The interference matrix M was composed of the average spectrum of paraffin and the first 9 principal components extracted from a FTIR spectral image recorded on a pure paraffin block, in order to take into account the spectral variability of the paraffin. Each recorded spectrum I_{q }is fitted with I, P, and M by using a least square approach:
I_{q}=α_{q}I+β_{q}P+γ_{q}M+e_{q}, q=1, . . . , Q.
The residue e_{q}, giving an estimation of the accuracy of the fitting model, is used to obtain the EMSCcorrected spectra:
I_{q}^{corr}=I+e_{q}/α_{q}.
After the application of EMSCbased preprocessing, paraffin contribution was neutralized and permitted to retain in the datasets only the spectral variability of the tissue and to normalize the corrected spectra around the mean spectrum. Two IR spectra before and after EMSCbased preprocessing are shown in
In addition, this preprocessing made it possible to discard from the analysis outliers spectra with poor signaltonoise ratio. The corresponding pixels were whitecolored at the clustering colorcoded images for better visualization.
The main objective of clustering is to find similarities between spectral datasets and then group similar spectra together in order to reveal areas of interest within tissue sections. In cancer research, clustering methods allow creating highly contrasted colorcoded images permitting to localize tumoral areas within a complex tissue. Details of the clustering method is described by Ly, E.; Piot, O.; Wolthuis, R.; Durlach, A.; Bernard, P.; and Manfait, M., (Analyst 2008, 133, 197205) and by Lasch, P.; Haensch, W.; Naumann, D.; and Diem, M. (Biochimica et Biophysica Acta 2004, 1688, 176186), which are adopted herein in their entirety.
KM clustering is a nonhierarchical partition clustering method. The aim of KM was to minimize an objective function based on a distance measure between each spectrum and the centroid of the cluster to which the spectrum was affected. This algorithm iteratively partitioned the data into K distinct clusters. Here, KM clustering was performed several times (n>10) to make sure a stable solution was reached, and to overcome the random initialization dependence. In this study, KM was applied using the Matlab Statistics Toolbox with the classical Euclidean distance. The process was continued until no spectrum was reassigned from one iteration to the following, otherwise it was stopped after 10^{4 }iterations.
AH clustering is a hierarchical partition clustering, in which each object (spectrum in our case) is one cluster at the beginning of the algorithm. At each iteration step, AH regroups the two clusters that are the most similar into a new cluster. The algorithm is stopped when the all spectra are combined into one single cluster. For Q spectra, the number of iterations equals to Q−1. AH clustering process is independent of initialization. However, like for KM, in AH clustering, the number of clusters K is empirically chosen. Compared to KM, AH clustering is significantly more time and resourceconsuming.
In order to reduce the computational time of AH clustering on our large dataset, we used here an efficient hybrid hierarchical agglomerative clustering (HHAC) technique that combined KM and AH clusterings using Euclidean distance and Ward'"'"'s algorithm, which was described by Vijaya, P. A.; Murty, M. N.; Subramanian, D. K. in Lecture Notes in Computer Science 2005, 3776/2005, 583588 and adopted herein in its entirety. KM was first applied to reduce the datasets to 1000 cluster centers. AH was then carried out on these 1000 KM centroids.
The FCM clustering is based on the minimization of the objective function J_{m}:
I_{m}=Σ_{q−1}^{Q}Σ_{k=1}^{K}u_{qk}^{m}∥I_{q}^{corr}−v_{k}∥^{2 }
defined as the sum of the within cluster errors (computed as the Euclidian distance, i.e. L2 norm, ∥.∥, between the Q available corrected spectra I_{q}^{corr }and the K cluster centroids v_{k}), weighted by the membership values u_{qk}. The cluster centroids and the membership values that minimize this objective function are obtained by using an iterative optimization procedure (see Bezdek, J. C. Pattern recognition with fuzzy objective function algorithms; Plenum: New York, USA, 1981). The weight is controlled by the fuzziness index m. Therefore, contrary to “hard” clustering, FCM permits to affect each spectrum I_{q}^{corr }to every cluster k (k=1, . . . , K) with the associated membership value u_{qk }varying between 0 and 1; the sum of the K cluster membership values for each spectrum being equal to 1, i.e. Σ_{k=1}^{K}u_{qk}=1.
Here we applied the FCM function from the Matlab Statistics Toolbox. A maximum number of 500 iterations and a setting of 10^{−5 }for the minimal amount of improvements (at the level of the sum of each spectrum/centroid distance) were used as the stopping criteria. However, FCM required to fix the number of clusters K and the fuzziness index m. An inappropriate choice of these parameters could lead to an uninterpretable clustering of the data. The development of an automatic method to optimally estimate these parameters was thus essential.
This innovative algorithm (RBA), based on the FCM clusters redundancy, aimed at determining an optimal couple (K_{opt}, m_{opt}) without any a priori knowledge of the dataset. We had chosen here the intercorrelation coefficient R_{ij}(K,m) between two clusters i and j as the measure of redundancy:
where c(i,j)=Σ_{q=1}^{Q}(u_{qi}−ū_{i})(u_{qj}−ū_{j}) is the covariance between the membership values of clusters i and j given by FCM for a couple (K,m), c(i,i)=Σ_{q−1}^{Q}(u_{qi}−ū_{i})^{2 }and c(j,j)=Σ_{q=1}^{Q}(u_{qj}−−ū_{j})^{2 }are the variances of the membership values of cluster i and j, with the means
The RBA is composed of three steps. Firstly, the iterative process for the reduction of the number of clusters was performed. For this step, N different values of the fuzziness index belonging to the set m={m_{1}, . . . , m_{n}, . . . , m_{N}} and L different values of the threshold belonging to the set s={s_{1}, . . . , s_{l}, . . . , s_{L}} were considered. m is composed of N different values of the fuzziness index m, uniformly distributed around the classical value m=2, while s is composed of L different values of threshold uniformly distributed into the high correlation coefficient range 50% to 95%. FCM clustering started with m_{1}, s_{L }and a large value of the number of clusters K, i.e. K=K_{max}. In a general manner, for a triplet of the values (m_{n}, s_{1}, K), the intercorrelation coefficients R_{ij}(K,m_{n}), with 1≦i,j≦K, were computed. If one of the R_{ij}(K,m_{n}) values was superior to s_{1}, a new FCM was run with K=K−1. Otherwise, if all the values of R_{ij}(K,m_{n}) were less than the threshold value s_{1}, the number of nonredundant clusters K_{nr}^{s}^{l}(m_{n}) (corresponding to the last value of K) was obtained. The subscript “nr” is used in the following to denote the nonredundancy of clusters.
By performing this procedure for the different values of m and a fixed threshold s_{l}, a curve of the number of nonredundant clusters K_{nr}^{s}^{l}(m) was obtained as a function of m. The iterative process of the reduction of the number of clusters for the next m (i.e. m_{n+1 }which belongs to the set m) should restart with an initial value of K equals to the number of nonredundant clusters estimated for the previous m, i.e. K=K_{nr}^{s}^{l}(m_{n}). However, the FCM algorithm being randomly initialized, the estimated number of nonredundant clusters could vary from one clustering to another. In order to take this possible variation into account, the initial value of K for the next m was set to the number of nonredundant clusters for the previous m plus two, i.e. K=K_{nr}^{s}^{l}(m_{n})+2, however without exceeding K_{max}. By executing this procedure for the all values of the set s, the resulting K_{nr}^{s}^{l}(m) curves were obtained for each threshold value s_{i}. The global procedure is depicted in
Secondly, the RBA consists in the optimal estimation of the number of clusters from the obtained curves. As presented in the Results and discussion section, these curves decreased rapidly and become stable at the {circumflex over (K)}_{opt}^{s}^{l }value, where “̂”denotes (here and hereafter) an estimator. Whatever the threshold s_{l }was, we usually observed that the breakings in these curves appeared for close values {circumflex over (K)}_{opt}^{s}^{l }and often for the same value. A majority voting algorithm is used to identify the final optimal value {circumflex over (K)}_{opt }of the number of clusters.
Finally, the optimal value {circumflex over (m)}_{opt }of the fuzziness index is computed by averaging the smallest values {circumflex over (m)}_{opt}^{s}^{l }for which the curves K_{nr}^{s}^{l}(m) presented a break at {circumflex over (k)}_{opt}:
{circumflex over (m)}_{opt}=mean_{s}_{l}_{EB}({circumflex over (m)}_{opt}^{s}^{l}), with {circumflex over (m)}_{opt}^{s}^{l}=min(arg(K_{nr}^{s}^{l}(m)={circumflex over (K)}_{opt}^{s}^{l})).
Hereafter, FCM clustering performed with these RBAoptimized parameters will be defined as FCMRBA.
The FCMRBA clustering was assessed on EMSCpreprocessed FTIR hyperspectral images acquired on thin tissue sections of 13 human skin carcinomas. The results were compared with KM, HHAC and classical FCM outcomes. To improve the reading of this section, we presented these comparative results for an infiltrative SCC. In addition, FCMRBA clustering data were given for noninfiltrative states of a superficial BCC and a Bowen'"'"'s disease, whereas corresponding KM, HHAC and FCM outcomes were presented in
The H&Estained histological image of the studied SCC sample, on which the tumor is outlined, is provided in
To highlight the distinctive histological regions of this paraffinembedded tissue section, KM clustering was applied with an empirical choice of 11 clusters. The resulting colorcoded image is shown in
Comparison of KM and HHAC images with the corresponding H&Estained section permitted an assignment of the clusters. As shown here for KM clustering (
These results indicate that “hard” clustering algorithms were able to retrieve the histological structures and especially to localize tumoral areas within the tissue section. However, the choice of the number of clusters was a difficult problem that is usually empirically resolved. When less than 11 clusters were chosen, the histological regions identified by clustering algorithms were mixed and the intratumor heterogeneity was no more revealed. With more than 11 clusters, no further interpretable information was obtained. Furthermore, the principal drawback of these “hard” clustering methods was that the cluster membership grade of each individual spectrum equaled to 0 or 1, which did not permit to differentiate the nuances of pixel membership. Consequently, these techniques did not allow to consider progressive transitions likely to exist at he invasion front of a tumor or between heterogeneous intratumoral areas.
Classical FCM clustering
The results obtained by using the FCM algorithm without optimized parameters on the same dataset are shown in
A visual comparison of the clusters presented in
These results demonstrated that classical FCM created noninformative redundant images in which only few pixels differed from one cluster to another. Therefore, it was essential to choose the optimal couple of K and m parameters to obtain a biologicallyrelevant clustering.
Simultaneous determination of optimal K and m parameters was performed using an innovative algorithm (RBA). In our investigation, a value of K_{max}=20, a set of fuzziness indices m={1.4, 1.5, . . . , 2.5}, and a set of thresholds s={0.5, 0.55, . . . , 0.95} were tested. The curves K_{nr}^{s}^{l}(m), representing the number of nonredundant clusters as a function of m obtained by this method for the different values of the threshold s_{l }are shown in
It has to be mentioned, that in our case, classical validity indices used to determine the optimal number of FCM clusters K failed to correlate with standard histopathology. Indeed, the partition coefficient and classification entropy (see Bezdek, J. C. Pattern recognition with fuzzy objective function algorithms; Plenum: New York, USA, 1981) applied with m=2 give an aberrant value of K=2 that did not permit to reveal the different tissue structures. These data reinforced the relevancy of our developed RBA in terms of tissue structure differentiation.
The images generated by the FCMRBA are depicted in
After having analyzed a SCC sample as a model of an infiltrative skin cancer, the FCMRBA outcomes were presented for a superficial BCC and a Bowen'"'"'s disease samples, both representative of noninvasive skin cancers. The optimization of FCM parameters by RBA are shown for these samples in
As shown in
For the Bowen'"'"'s disease sample, FCMRBA revealed 5 clusters that were assigned to the following histological structures: epidermis (cluster 1), dermis (clusters 2, 3 and 4) and tumor (cluster 5). Visual comparative analysis of clusters 1 and 5 indicated that the tumor was welllocalized within the normal epidermis. In addition, FCMRBA did not reveal the presence of a gradient in the membership values of the pixels at the tumor/neighboring epidermis interface. Contrary to the SCC and BCC studied samples, this absence of interconnectivity was in accordance with the fact that Bowen'"'"'s diseases corresponded to welllocalized in situ carcinomas.
Spectral microimaging associated with clustering techniques showed a great potential for the direct analysis of paraffinembedded tissue sections of human skin cancers. Our results demonstrated that FCM clustering is more powerful than classical “hard” clustering (KM and hierarchical classification) to reveal biologicallyrelevant information related to the tumor heterogeneity and invasiveness. Thus, we developed an original algorithm dedicated to the simultaneous determination of the optimal FCM parameters (number of clusters K, and fuzziness index m). This novel data processing makes FTIR or Raman microimaging a promising tool, independent of the intraobserver variability, for applications in routine diagnostic medicine.