# Minitab Correlation Spoilage Homework

Use the Spearman correlation coefficient to examine the strength and direction of the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate. To calculate the Spearman correlation, Minitab ranks the raw data. Then, Minitab calculates the correlation coefficient on the ranked data.

Strength

The correlation coefficient can range in value from −1 to +1. The larger the absolute value of the coefficient, the stronger the relationship between the variables.

For the Spearman correlation, an absolute value of 1 indicates that the rank-ordered data are perfectly linear. For example, a Spearman correlation of −1 means that the highest value for Variable A is associated with the lowest value for Variable B, the second highest value for Variable A is associated with the second lowest value for Variable B, and so on.

Direction

The sign of the coefficient indicates the direction of the relationship. If both variables tend to increase or decrease together, the coefficient is positive, and the line that represents the correlation slopes upward. If one variable tends to increase as the other decreases, the coefficient is negative, and the line that represents the correlation slopes downward.

The following plots show data with specific Spearman correlation coefficient values to illustrate different patterns in the strength and direction of the relationships between variables.

It is never appropriate to conclude that changes in one variable cause changes in another based on correlation alone. Only properly controlled experiments enable you to determine whether a relationship is causal.

## Authors

• ### I. Dalezios,

1. Department of Food Science & Technology, Cornell University, Geneva, NY, USA
Search for more papers by this author
• *

Present address: Laboratory of Dairy Research, Department of Food Science and Technology, Agricultural University of Athens, Athens, Greece.

• ### K.J. Siebert

1. Department of Food Science & Technology, Cornell University, Geneva, NY, USA
Search for more papers by this author

Siebert Department of Food Science & Technology, Cornell University, Geneva, NY 14456, USA (e-mail: kjs3@cornell.edu).

## Abstract

Aims: The goal of this study was to evaluate three pattern recognition methods for use in the identification of lactic acid bacteria.

Methods and Results: Lactic acid bacteria (21 unknown isolates and 30 well-characterized strains), including the Lactobacillus, Lactococcus, Streptococcus, Pediococcus and Oenococcus genera, were tested for 49 phenotypic responses (acid production on carbon sources). The results were scored in several ways. Three procedures, k-nearest neighbour analysis (KNN), k-means clustering and fuzzy c-means clustering (FCM), were applied to the data.

Conclusions: k-Nearest neighbour analysis performed better with five-point-scaled than with binary data, indicating that intermediate values are helpful to classification. k-Means clustering performed slightly better than KNN and was best with fuzzified data. The best overall results were obtained with FCM. Genus level classification was best with FCM using an exponent of 1·25.

Significance and Impact of the Study: The three pattern recognition methods offer some advantages over other approaches to organism classification.

## INTRODUCTION

The lactic acid bacteria (LAB) have a long history of importance in food technology (Stiles and Holzapfel 1997). The classification was originally constituted by grouping together all the bacteria that produce lactic acid regardless of other characteristics. The LAB are generally considered to comprise Gram-positive and usually catalase-negative bacteria that grow under microaerophilic to strictly anaerobic conditions and which do not form spores (Klein et al. 1998), although there are some exceptions. The LAB include a wide variety of cell types and physiological and biochemical behaviour. The taxonomy has undergone frequent changes as knowledge has improved. Recent rearrangements have led to the following LAB genera of food interest: Carnobacterium, Enterococcus, Lactobacillus, Lactococcus, Leuconostoc, Oenococcus, Pediococcus, Streptococcus, Tetragenococcus, Vagococcus and Weissella (Stiles and Holzapfel 1997). One of the problems in identifying members of the LAB is the ready transfer (via plasmids) of some characteristics between strains and species, leading to different behaviour between various strains of some species. Another classification problem is posed by organisms that exhibit similar phenotypic behaviour, but which are actually quite different genotypically. The large number of species encompassed by the LAB (e.g. there are over 70 validated species in the Lactobacillus genus) also makes identification difficult because there are many possible misidentifications. Attempts to classify LAB to the species level have usually met with much less success than was the case with other bacterial groups.

The identification of an unknown micro-organism requires comparison of its characteristics with either a classification scheme or with reference organisms. Such comparisons may be based upon experimental or literature data. The characters (measurements) used may represent phenotypic or genotypic assessments, or a combination of the two. Phenotypic characters that have frequently been used for bacterial classification include cell and colony morphology, methyl esters of fatty acids (Decallonne et al. 1991), patterns of proteins in the cell wall (Gatti et al. 1999) or entire cell (Tsakalidou et al. 1994) and a variety of physiological and biochemical tests (Klein et al. 1998). Genotypic characters that have been used include ribotyping (Zhong et al. 1998), DNA/DNA hybridization (Schillinger 1999), DNA homology (often represented as C + G percentage) and restriction fragment length polymorphism (Ståhl et al. 1990).

Cell morphology includes shape and tendency to form clumps or chains. Physiological tests include sensitivity to temperature, salt, etc. Biochemical tests include the ability to assimilate or ferment particular carbohydrates, to produce gas on certain substrates, to produce particular metabolic products (e.g. the D- or L-form of lactic acid), to grow on selective media, etc. Phenotypic tests have a long history of use in numerical taxonomy and often they have been applied to LAB in a large battery. In one study 50 carbohydrate fermentation tests, together with 22 other biochemical tests, were employed (Dykes et al. 1994). In other reports multiple attributes (as many as 79–108 characters), including biochemical and physiological tests together with cell and colony morphology, were used (Döring et al. 1988; Borch and Molin 1988; Lee et al. 1982; Schillinger and Lücke 1987; Shaw and Harding 1984). Other tests that are presumably ultimately phenotypically based include pyrolysis gas chromatography (MacFie et al. 1978) and direct probe mass spectrometry (Shaw et al. 1984).

Comparison of an unknown to reference organisms or to various groupings in a classification scheme is generally based on judging the similarity between a set of observations of the unknown and the reference. Each pattern represents an operational taxonomic unit (OTU); this can be a genus, a species, a strain or whatever is being considered. Similarity is judged by organism behaviour on a number of different, independent measurements. Often results of characterization tests are scored as having two possible outcomes, i.e. Gram-positive or Gram-negative, growth or no growth under some condition, ferments or doesn’t ferment some carbon source, etc. This is called binary scoring, as each test has only two possible outcomes (corresponding to 0 or 1). Sometimes an intermediate result is forced into one of the two choices as: negative (none or trace) vs positive (weak, definite or strong positive). Characterization of organisms assessed with binary tests can be accomplished with either a classification (decision) tree approach or by pattern matching.

A decision tree is a serial sequential process. At each branch, one of two outcomes for a test leads to another test. Finally, when the last branch is traversed, the sample is identified (as well as this is possible). This is slow, but conservative of materials and labour. It assumes invariant behaviour of samples and reference strains and no error in the results. Applications of decision trees to LAB have included them as complete or partial schemes. Decision trees have also been used as a first stage before either another decision tree (Schillinger and Lücke 1987) or a multivariate procedure (Wijtzes et al. 1997).

Another approach to identification/characterization involves matching patterns of character responses. In this case a data matrix that contains responses for multiple characters for an organism is used. This requires determination of all the characters to be used before the matching step. This parallel approach is inherently less time consuming than a sequential approach like a decision tree, but it does consume materials and perform tests that may not be useful in a particular situation.

Matching can, in some cases, be done by visual inspection of patterns. This has been done particularly with some of the genetically-based tests, where patterns in electrophoresis gels are compared (Zhong et al. 1998). This is suitable for comparing modest numbers of patterns, but becomes overwhelming with larger numbers of possibilities and computer programs are typically used to search for matches (Heyndrickx et al. 1996).

One widely used approach to pattern matching in microbiology is the calculation of pair-wise similarity coefficients between an unknown and each reference OTU (Sneath 1972). Most often this has been done as the simple matching similarity coefficient, SSM (Sokal and Sneath 1963). This expresses agreement in test results (whether the test outcome is negative or positive) for each character between a test and a reference organism and has been widely used in numeric taxonomy (Dykes et al. 1994; Shaw and Harding 1984; Borch and Molin 1988). The process is repeated for each reference organism and the identity of the test organism is generally considered to be that of the reference organism that provides the best match (highest SSM). A somewhat less frequently used approach is the Jaccard similarity coefficient, SJ (Sneath 1972). This considers agreement for only the tests with a positive response. Similarity coefficients of either type can be used to produce a dendrogram representing the similarity of patterns and relationships. This has been attractive, as it resembles the way taxonomic relationships have often been displayed.

A number of mostly non-hierarchical multivariate methods have found application in pattern matching. Principal components analysis (PCA) is a good dimensionality reduction and visualization tool and has found application in numerical taxonomy (MacFie et al. 1978; Ståhl et al. 1990), but it attempts to optimize variance explanation rather than discrimination. Hierarchical cluster analysis can be used to produce dendrograms representing OTU similarity. This typically uses the Euclidean (vector) distance of the OTU patterns from one another in multidimensional space as an expression of similarity. Patterns that are close together (short vector length) are considered more similar than those that are further apart. In addition to Euclidean distance, Pythagorean distance (Euclidean distance squared) and Mahalanobis distance (which takes into account correlation as well as distance) have occasionally been used as the similarity measures (MacFie et al. 1978; Shaw et al. 1985; Shaw and Harding 1984; Borch and Molin 1988). The SIMCA pattern recognition procedure (Ståhl et al. 1990) and non-linear mapping (Shaw et al. 1985) have also been used. With the multivariate methods, unlike the case for similarity coefficients, it is readily possible to employ not only binary scored data, but also semiquantitative or continuous property results.

One of the problems in making identifications is imprecision. Non-reproducibility of results can occur in several ways. A sample intended to represent a reference strain could be in error through contamination, mutation, misidentification or possibly changes in taxonomy. Errors can also occur in obtaining results. These include sampling errors, laboratory determination errors that lead to growth failures (false negatives), contamination leading to a positive response (false positives), scoring errors, data entry errors, etc. Another problem with identification is variability between strains when the OTU is species; it is well known that different strains of the same species may behave differently on individual character assessments. One type of sampling error could occur if a reference sample used to build a classification scheme is not a good representation of the OTU behaviour seen in nature (suppose that four of five strains of a species give a positive response for a character but, by chance, the reference chosen is the fifth one). With a decision tree any single error can and probably will result in incorrect identification. Similarity matching coefficients and multivariate approaches are more tolerant of errors. A detailed analysis of the effects of error on similarity coefficients has been made (Sneath and Johnson 1972).

Another source of imprecision and variability is an intermediate response in a test. Growth responses are sometimes characterized as ‘weak’ or ‘delayed’ or ‘slow’ rather than a clear positive or negative. This type of response cannot be represented by binary scoring schemes and must be forced to fit into one of the two binary choices for calculation of similarity coefficients. The multivariate methods can employ data representing multiple states (either semiquantitative or continuous property). This means that test outcomes scored as positive, negative or intermediate, or on some point scale, can be employed.

Methods that are both tolerant of error and which allow imprecise matching are thus called for. Ideally, the method should indicate how confident one can be of an identification and which, if any, possible alternate identifications are close.

Non-hierarchical cluster analysis methods have some attraction. With hierarchical clustering, each sample is joined together with the others and remains joined at all greater similarity levels. With non-hierarchical clustering, the grouping of samples can be rearranged at successive levels. That is, a sample clustered with another at one level may not be clustered with the same sample at another level. This type of clustering cannot be represented by a dendrogram. In both hierarchical and non-hierarchical clustering, each sample is considered to have membership in one and only one group at each stage; this can be expressed in terms of the ‘membership value’ of a sample. The membership value in one cluster will be 1 and its membership value in all other clusters will be 0.

With fuzzy clustering, samples can be represented as having partial membership in more than one group (Zadeh 1965). The sense of this is that a certain probability of one result outcome (as, for example, when 70% of the strains of a species behave one way for a character, and the rest another) can be represented. This may be a more realistic way to represent data with uncertainties (especially where variation in behaviour between strains occurs when the OTU is species). In fuzzy clustering this is represented by membership values between 0 and 1, where the total membership (summing to 1) may be distributed across more than one cluster, e.g. a sample may have a membership value of 0·7 in one cluster, 0·2 in a second and 0·1 in a third. This may also be advantageous for matching test results that do not fit well into binary scoring schemes. While fuzzy clustering has been employed in ecology, engineering and medicine (Equiha 1990; Marengo et al. 1991; Zhang et al. 1995; Foody 1996; Friedrichs et al. 1996), no prior reports of its application to microbial classification were discovered.

The overall process by which fuzzy logic is used to classify samples into clusters has two stages (Bezdek 1981; Jang and Gulley 1995). In the first stage the input data are fuzzified. Data are operated upon by two or more if-then rules using appropriate membership functions. The output of each rule is a fuzzy set. The fuzzy set outputs are then aggregated into a single output fuzzy set. This is defuzzified, or resolved, into a single number. In the second stage the fuzzified results are grouped together into clusters. The membership grade for each sample in each cluster is calculated in an iterative procedure. The weighting exponent m controls the extent of membership sharing between fuzzy clusters. As m approaches 1, fuzzy c-means clustering (FCM) converges toward k-means clustering (KMC) (with membership values of either ‘0’ or ‘1’). With larger values of m, the membership assignments are ‘fuzzier’ (tending toward lower membership values spread across more clusters). No theoretical basis for an optimal choice for m has emerged to date (Bezdek 1981). As a result m-values are chosen arbitrarily.

The objective of this study was to compare the ability of three pattern recognition techniques, k-nearest neighbour analysis (KNN), KMC (a non-hierarchical clustering technique) and FCM (a fuzzy clustering method) for classification of LAB. It was thought that the ability of fuzzy logic to exploit information contained in weak or variable phenotypic responses could be advantageous.

## MATERIALS AND METHODS

### Strains

The 30 known strains of LAB used in this study are listed in Table 1. An additional 21 unidentified LAB isolates, designated RK1–RK21, were supplied by Dr Carole Rehkugler (Department of Microbiology, Cornell University, Ithaca, NY, USA).

Table 1.   Known lactic acid bacteria strains used

All the strains were grown on de Man, Rogosa, Sharpe (MRS) agar (Difco Laboratories, Detroit, MI, USA) plates at either 30 or 37°C. Colonies were isolated and subcultured twice in MRS broth (Difco Laboratories) and then stored in 10% w/v skim milk (Difco Laboratories) at − 18°C.

### Carbohydrate metabolism tests

Media. API 50 CH strips and API 50 CHL medium (bioMerieux Vitek, Hazelwood, MO, USA) were used to study the carbohydrate fermentation ability of the test organisms. These strips test behaviour on 49 sole carbon sources including carbohydrates, heterosides, polyalcohols and uronic acids (see Table 2).

Table 2.   Carbon sources in the API 50 CH system

### Inoculum preparation and fermentation tests

Samples were taken out of the − 18°C freezer, thawed at room temperature and incubated at 30 or 37°C, depending on the optimum growth temperature of the organism. Unknowns were first cultured at both 30 and 37°C. If no difference in growth was seen, 30°C was used for future work. Organisms were subcultured three times in MRS broth and collected in the late log phase by centrifugation for 10 min in a Sorval RC5C refrigerated (4°C) centrifuge (DuPont Co., Newtown, CT, USA) with an SA600 rotor (5200 g). The biomass was washed twice with saline solution (0·8% w/v) followed by centrifugation at 5200 g for 10 min.

The API 50 CHL medium was inoculated into the tube portions of the API strips according to the manufacturer’s instructions. The cupules were filled with mineral oil in order to exclude air (oxygen). The strips were then incubated at 30 or 37°C, as appropriate for the organism. Observations of the indicator colour were made at 12, 24, 36 and 48 h. A colour change is indicative of fermentation of the corresponding carbon source. Results were recorded on a scale of zero to four depending on the colour intensity produced by the pH indicator (resulting from acidification of the medium). A score of zero was given to negative reactions and four was awarded to positive reactions of maximum intensity. Integers from one to three were assigned to intermediate strength responses. With a few exceptions, the carbon source fermentation profiles of the 30 known strains were determined twice, with the replications 4–6 weeks apart. The unknown strains were each analysed once.

### Data preprocessing and processing

The data analysis was carried out on the 48-h time point results. The data from the five-point (0–4) scale were converted into binary form (setting scores of 0–2 to ‘0’ and those of 3–4 to ‘1’) and also fuzzified (see next section). This resulted in three representations of the data set. The structural characteristics of the data space were explored with PCA using the SCAN Software for Chemometric Analysis Release 1 for Windows (MINITAB, State College, PA, USA). The PCA was carried out on the correlation matrix of the complete data set using the nonlinear iterative partial least squares (NIPALS) method.

### Fuzzy inference

A simple fuzzy inference rule set based on Mamdani’s system (Mamdani and Assilian 1975) was developed based on data from the literature and the API identification table. MATLAB (The Math Works, Natick, MA, USA) and the MATLAB Fuzzy Logic Toolbox were employed.

The inference system consisted of one input variable (here designated as ‘Colour’), which represented the fermentation result for a carbon source. Each input was processed by one of five input membership functions, each corresponding to one score for the reaction intensities from the API kit and nine rules. Three membership functions were then used to generate one output (‘Fermentation’) corresponding to a linguistic description of the result.

The rules of the fuzzy system used for each carbon source were:

1 if (Colour is 0) then (Fermentation is negative) (1);

2 if (Colour is 1) then (Fermentation is negative) (0·8);

3 if (Colour is 1) then (Fermentation is possible) (0·2);

4 if (Colour is 2) then (Fermentation is negative) (0·2);

5 if (Colour is 2) then (Fermentation is possible) (0·6);

6 if (Colour is 2) then (Fermentation is positive) (0·2);

7 if (Colour is 3) then (Fermentation is possible) (0·2);

8 if (Colour is 3) then (Fermentation is positive) (0·8) and

9 if (Colour is 4) then (Fermentation is positive) (1)

where the numbers in parentheses are the weights assigned to each rule.

The membership functions used were as follows.

1 Gaussian curve membership function. This depends on the two parameters σ and c:

2 Sigmoid curve membership function. This depends on parameters α and c:

Depending on the sign of α, a sigmoid membership function is open right or left and thus appropriate to represent concepts such as ‘very positive’ or ‘very negative’.

3 Trapezoidal membership function. This depends on four parameters a, b, c and d:

The parameters a and d locate the ‘feet’ of the trapezoid and the b and c locate the ‘shoulders’.

4 Triangular membership function. This depends on three parameters and is given by:

Parameters a and c locate the ‘feet’ of the triangle and b locates the peak.

The functions used are shown in Table 3.

Table 3.   Fuzzy membership function equation forms and coefficient values

The implication method used was the minimum method, which truncates the output fuzzy set. For aggregation, the maximum method was used and for defuzzification the centroid method, which assigns the centre of the area under the curve according to:

The result of the fuzzification was a single number in the range [0,1] for each carbon source for each organism.

### k-Nearest neighbour analysis

k-Nearest neighbour analysis was carried out on unweighted Euclidean distances of autoscaled binary, five-point and fuzzified data with the SCAN program. The value of k used in all cases was determined by cross-validation. Samples were initially assigned to 18 classes (11 Lactobacillus sp., two Lactococcus subsp., one for all three Pediococcus sp., one for Oenococcus, one for Leuconostoc oenos, one for Streptococcus thermophilus and one for all the unknowns).

### k-Means clustering

Non-hierarchical clustering was carried out on binary, five-point and fuzzified data using MacQueen’s algorithm and Euclidean distances of autoscaled data in the SCAN program. The initial partition (starting number of clusters) was specified as 18.

### Fuzzy clustering

The fuzzified results were clustered via Bezdek’s FCM algorithm (Bezdek 1981) using several different values of the weighting exponent m (1·25, 1·3, 1·4 and 1·7) with the MATLAB® numeric computing environment and the MATLAB Fuzzy Logic Toolbox.

## RESULTS

Tables S1–S6 are supplementary material viewable only in the online edition. Of the 49 carbon sources used, 12 were not fermented by any of the bacteria in this study. Those carbon sources were erythritol, D-arabinose, L-xylose, adonitol, inositol, inulin, glycogen, xylitol, D-lyxose, D-fucose, L-fucose and 2-keto-gluconate. D-arabitol, L-arabitol, glycerol and rhamnose were fermented by only a few of the unknown isolates (RKs) and these appeared to be useful characters for their discrimination.

The repeatability of the API tests appeared to be good, with an overall agreement of 83% between replications.

Principal components analysis revealed that 14 components explained 90% of the variation in the data set (including both known and unknown strains). Several two-way combinations of the 14 components were plotted; Fig. 1 is a representation of the data space as seen along the planes formed by the first two components. A number of clusters of samples can be seen. Some isolated samples were detected; the two Lactobacillus rhamnosus samples were located in the same place but it was far outside the other samples and these are off-scale on Fig. 1. The nearby samples in this two-dimensional space did not appear to be taxonomically similar. The additional plots revealed little additional information as to an inherent number of clusters in the data.

Fuzzification is intended to represent a linguistic concept mathematically. For example, ‘approximately some number’ is typically represented by a triangular or gaussian function centred about the number, while ‘between two numbers’ is represented by a trapezoid function. The effect of the fuzzification in this study was to transform each 0–4 integer scaled result into one of five fractional values (0·117, 0·342, 0·529, 0·632 and 0·780, respectively), which were not linearly spaced.

k-Nearest neighbour analysis was performed with the data in all three formats. In each case the specified initial assignments of the samples to classes was the same, and cross-validation was used to determine the optimal number for k. In all three cases the best k-value was determined to be 1. The results for one of these analyses, based on the five-point scale data, are shown in Table S1. The samples were assigned to a total of 16 clusters (no samples were assigned to clusters 15 or 17). Table S1 indicates the clusters to which particular samples were assigned and shows if the replicates of a known sample were placed in the same cluster (indicated by a 2 or, in a few cases, a 3). It is also possible to see whether the samples of a known species, or of a genus, were exclusive to a cluster.

In most cases the replicates of samples were assigned to the same clusters, and many of these were exclusive of known samples of other species or genera. There was some overlap between the Lact. casei and Lact. paracasei samples and between Lact. fermentum and Lact. plantarum samples. The majority of the Lactococcus lactis strains (including subsp. cremoris and subsp. lactis) were assigned to cluster 12 along with one of the Lact. brevis samples. The remaining five Lactococcus samples were assigned to clusters 11 and 13. The Oenococcus oeni and Leuc. oenos, which is now considered to be O. oeni (Dicks et al. 1995; Stiles and Holzapfel 1997), samples were all assigned to cluster 14, without any other samples. The three Pediococcus species were assigned to cluster 16, also without any other known samples. The two Strep. thermophilus samples were assigned to different clusters and both overlapped other genera.

k-Means clustering was carried out with a specification of 18 final clusters. The results obtained with the five-point-scaled data are shown in Table S2. The general pattern of results bears some similarity to the results obtained with KNN. The Pediococcus samples were again grouped together and did not share their cluster with any other known samples. Lactobacillus fermentum and Lact. plantarum again overlapped and, in this case, there was complete overlap between the Lact. casei and Lact. paracasei samples. The Oenococcus samples were all in cluster 1 along with the Lact. bulgaricus samples. The Lc. lactis subsp. cremoris and subsp. lactis samples were spread over six clusters, with some tendency to separate the two subspecies.

With FCM both the number of clusters to use and the value for the weighting exponent m must be specified. As with KMC, the number of clusters was set at 18. Results are expressed in terms of the membership value of a sample (between 0 and 1) in each cluster. The cluster in which a sample has the highest membership value can be considered to be its most likely assignment, although it actually has partial membership in additional clusters. Results were obtained for m-values of 1·25, 1·3, 1·4 and 1·7; these are shown in Tables S3–S6, respectively.

The results obtained with m=1·25 are shown in Table S3. Many of the same patterns previously seen were repeated here. The Oenococcus samples were all clustered together, as were the Lact. casei and Lact. paracasei samples. There was again some overlap between Lact. fermentum and Lact. plantarum. Unlike the results with KNN and KCM clustering, the Pediococcus samples were spread over two clusters, both of which were free of any other known samples. The Lactococcus samples were found in only three clusters (1, 9 and 11); one of these also contained Lact. helveticus, as was the case with KMC. Lactobacillus gasseri was placed in the same cluster as one of the Lc. lactis samples.

The m=1·3 FCM results are shown in Table S4 and were, in general, similar to those for m=1·25. The Lactococcus samples were spread over four clusters, one of which contained only subsp. lactis samples.

The results of FCM with m=1·4 (Table S5) were fairly similar to the m=1·3 results.

The m=1·7 results (Table S6) were the only FCM treatment that placed all three Pediococcus species in the same cluster, as did KNN and KMC. The Oenococcus samples were all grouped together, this time along with some Lact. plantarum samples. Lactobacillus casei and Lact. paracasei were once again in the same cluster. The Lactococcus species were mostly placed in cluster 15, with a few samples in four other clusters.

A summary of the performance of all the methods used, including the three data treatments employed with KNN and KMC, and the four m-values for FCM, is shown in Table 4. Some performance comparisons are shown in Figs 2 and 3.

Table 4.   Summary of classifications

The assignments of the unknown isolates by each method are summarized in Table 5. Where isolates were assigned to the same cluster as a reference strain, that is indicated. When isolates were placed in clusters to which no reference strain was assigned, the number of the cluster is indicated.

Table 5.   Summary of unknown classification results

## DISCUSSION

Most of the classification methods showed a consistent overlap between some of the Lact. fermentum and Lact. plantarum strains. These species are thought to be somewhat dissimilar, as the former is an obligate heterofermenter and the latter a facultative heterofermenter (Stiles and Holzapfel 1997).

In almost all classifications in this study, Lact. casei and Lact. paracasei were assigned to the same cluster. This is not surprising in view of the fact that both are facultative heterofermenters and considered to be particularly closely related (Stiles and Holzapfel 1997).

All three Pediococcus species were assigned to the same cluster with KNN, KMC and the loosest FCM method (with m=1·7). With the other three FCM variants, however, Pediococcus pentosaceus was always assigned to a different cluster from the other two species.

The two replicates of Strep. thermophilus were assigned to different classes with all the methods used. This presumably indicates that the results obtained with the replicates were fairly dissimilar in some respect. In most cases one or both of the replicates was assigned to a cluster with other genera. This outcome could be related to the difficulty in assigning this organism in a taxonomic structure (Stiles and Holzapfel 1997).

The three scoring approaches to the data (see Table 4) had relatively little effect on the outcome of KNN results. In all cases samples were assigned to 16 final clusters. There appeared to be some advantage to representing the data on a five-point rather than a binary scale (fewer replicates were assigned to the same cluster with the binary format data); this indicates that intermediate responses are indeed helpful in making classifications. There were no differences between five-point-scaled and fuzzified data, indicating that fuzzification is in no way beneficial for KNN clustering.

The results obtained with KMC (Table 4) were generally similar to those from KNN, except that 18 rather than 16 clusters resulted. For binary-scaled data, KMC was more effective than KNN at placing replicates in the same cluster, in spite of the fact that this should be favoured with fewer clusters (Fig. 2). In terms of placing replicates in the same cluster, KMC with five-point and fuzzified scaling were not much different from binary scaling. The KMC genus level classification was crisper (genera were spread over fewer clusters) with five-point than with binary scaling and was most crisp using the fuzzified data.

The FCM algorithm was applied only to the fuzzified data, but this was done using several different values of the exponent m, representing different degrees of fuzziness. The lowest value, m=1·25, is the closest to a hard partition (where m=1, which is equivalent to KMC) and this resulted in the highest membership values (see Fig. 3). Fuzzy c-means clustering resulted in slightly higher numbers of clusters containing replicates than did KNN or KMC. It also produced quite crisp classification of members of genera (Table 4). While an arbitrarily chosen membership value could be used as a requirement for classification, one should keep in mind that the number of clusters, c, also impacts these values. The average membership value tends toward 1/c. The fact that taking the maximum membership value rather than using some arbitrarily determined cut-off (e.g. a membership value of 0·5) produced results that were, in general, similar to those of other classification schemes indicates that this additional rigour is not beneficial.

The assignments of the unknown isolates to clusters by the various methods are summarized in Table 5. The most interesting pattern to look for here is whether particular unknowns are consistently assigned to the same clusters as reference samples. Many of the assignments were similar with the various procedures. RK1, for example, was assigned together with Lact. casei and paracasei by KNN, KMC and FCM with m=1·7. It was assigned together with Lact. rhamnosus by all the FCM procedures. RK5 and RK15 were grouped with Lact. casei/paracasei by KMC and, along with many other unknowns, with Lact. brevis by KNN; these samples were not grouped with any of the knowns by any FCM variant. With the exception of KNN, most of the other situations in which unknowns were grouped together with knowns had similar patterns, which is encouraging. RK2 was grouped with Lact. brevis or fermentum/plantarum. RK3, RK9 and RK10 were grouped either with two or all three of the Pediococcus species. RK4, where it matched knowns, was grouped with Lact. rhamnosus. RK19 was classed as Lact. brevis by four of the methods. The other unknowns that were classified were largely inconsistent between methods. This may be because they are, in fact, different species that somewhat resemble some of the known samples.

Some of the unknown samples were not assigned to the same clusters as knowns (Table 5, but were consistently grouped together (not necessarily in the same number cluster with different procedures, however). Presumably these samples are closely related to the other samples in the cluster, but are not represented by any of the knowns. RK6–RK8, RK11–RK13 and RK16–RK18 were nearly always assigned to the same cluster.

Another point of interest in this study was the fact that data were assumed to form spherical clusters and, accordingly, all methods used employed the Euclidean norm. However, the rather low values for m (e.g. Bezdek (1981) suggested 2 while Foody (1996) used 8), at which complete fuzzification settled in, may suggest different cluster shapes. Several fuzzy algorithms that take cluster shape into consideration exist (Gustafson and Kessel 1979; Chiu 1994; Yager and Filev 1994). However, implementing them was beyond the scope of this work.

Overall, of the approaches tested, FCM with m-values in the 1·3–1·4 range performed the best. Five-point rather than binary scaling improved both KNN and KMC performance. Fuzzification was a further benefit to KMC, but not KNN.

Fuzzy clustering performed better in classifying and identifying LAB than the other methods, apparently because it is better able to represent the nature of the response pattern. This will probably also be the case with other micro-organisms, especially those that exhibit variable response patterns.

## Acknowledgements

This material is based upon work supported by the Cooperative State Research, Education and Extension Service, US Department of Agriculture under Project NYG 623–496.

## SUPPLEMENTARY MATERIAL

Tables S1–S6 are available on the web at http://www.blackwell-science.com/products/journals/JAM/JAM1370/JAM1370sm.htm

Table S1. KNN results of 5-point scaled data

Table S2. KMC results of 5-point scaled data

Table S3. FCM results with m=1·25

Table S4. FCM results with m=1·3

Table S5. FCM results with m=1·4

Table S6. FCM results with m=1·7

### Footnotes

• *

Present address: Laboratory of Dairy Research, Department of Food Science and Technology, Agricultural University of Athens, Athens, Greece.

## Ancillary

### Article Information

#### DOI

10.1046/j.1365-2672.2001.01370.x

#### Format Available

Full text: HTML | PDF

Request Permissions

#### Publication History

• Issue online:
• Version of record online:

### References

• 1Bezdek, J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms New York: Plenum Press.
• 2Borch, E. & Molin, G. (1988) Numerical taxonomy of psychrotrophic lactic acid bacteria from prepacked meat and meat products. Antonie Van Leeuwenhoek54,301–323.
• 3Chiu, S.L. (1994) Fuzzy model identification based on cluster estimation. Journal of Intelligent and Fuzzy Systems2,267–278.
• 4Decallonne, J., Delmee, M., Wauthoz, P., El Lioui, M., Lambert, R. (1991) A rapid procedure for the identification of lactic acid bacteria based on the gas chromatographic analysis of the cellular fatty acids. Journal of Food Protection54,217–224.
• 5Dicks, L.M.T., Dellaglio, F., Collins, M.D. (1995) Proposal to reclassify Leuconostoc oenos as Oenococcus oeni (corrig.) gen. Nov, Comb Nov. International Journal of Systematic Bacteriology45,395–397.
• 6Döring, B., Ehrhardt, S., Lücke, F.K., Schillinger, U. (1988) Computer-assisted identification of lactic acid bacteria from meats. Systematic and Applied Microbiology11,67–74.
• 7Dykes, G.A., Britz, T.J., Von Holy, A. (1994) Numerical taxonomy and identification of lactic acid bacteria from spoiled, vacuum-packaged Vienna sausages. Journal of Applied Bacteriology76,246–252.
• 8Equiha, M. (1990) Fuzzy clustering of ecological data. Journal of Ecology78,519–534.
• 9Foody, G.M. (1996) Fuzzy modelling of vegetation from remotely sensed imagery. Ecological Modeling85,3–12.
• 10Friedrichs, M., Franzle, O., Salski, A. (1996) Fuzzy clustering of existing chemicals according to their ecotoxicological properties. Ecological Modeling85,27–40.
• 11Gatti, M., Fornasari, M.E., De Vecchi, P., Bonvini, B., Neviani, E. (1999) Protein pattern profile of the cell-wall surface in Lactobacillus helveticus strains. Scienza e Tecnica Lattiero Casearia50,431–441.
• 12Gustafson, D.E. & Kessel, W. (1979) Fuzzy clustering with a fuzzy covariance matrix. Proceedings of the IEEE-CDC2,761–766.
• 13Heyndrickx, M., Vauterin, L., Vandamme, P., Kersters, K., De Vos, P. (1996) Applicability of combined amplified ribosomal DNA restriction analysis (ARDRA) patterns in bacterial phylogeny and taxonomy. Journal of Microbiological Methods26,247–259.DOI: 10.1016/0167-7012(96)00916-5
• 14Jang, J.-S.R. & Gulley, N. (1995) Fuzzy Logic Toolbox for Use with MATLAB Natick, MA: The Math Works.
• 15Klein, G., Pack, A., Bonaparte, C., Reuter, G., Holzapfel, W.H., Huis-in′-t-Veld, J.H.J., Persin, C., Kasper, H. (1998) Taxonomy and physiology of probiotic lactic acid bacteria. International Journal of Food Microbiology41,103–125.DOI: 10.1016/s0168-1605(98)00049-x
• 16Lee, C.Y., Fung, D.Y.C., Kastner, C.L. (1982) Computer-assisted identification of bacteria on hot-boned and conventionally processed beef. Journal of Food Science47,363–367.
• 17MacFie, H.J.H., Gutteridge, C.S., Norris, J.R. (1978) Use of canonical variates analysis in differentiation of bacteria by pyrolysis gas liquid chromatography. Journal of General Microbiology104,67–74.
• 18Mamdani, E.H. & Assilian, S. (1975) An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies7,1–13.
Categories: 1