Comments: When more than one gel needs to be analyzed and compared, using for example the GelCompar computer program, the details of this protocol should be followed carefully or standardized in other ways. It is important to use at least three marker lanes in a 30 lane gel, one at both ends and one in the middle. This allows for correction of possible "smiling"effects.

Experimental protocol

1. In a 500 ml bottle add 3.75 g agarose to 250 ml 0.5 x TAE and dissolve the agarose by microwaving.

2. Pour the agarose (50-70 oC) into the gel tray and insert the comb.

3. After the gel has completely hardened carefully remove the comb and fill the wells with 0.5 x TAE.

4. Mix a 6 µl PCR sample with 1.2 µl loading buffer on a piece of Parafilm and load the gel; load 2 µl 1 kb DNA size ladder, mixed with 4 ml ddH2O and 1.2 µl loading buffer into the terminal wells and in the middle.

5. Run the gel in the cold room for 18-19 hours at 70 Volt, constant voltage, 25-30 mA. This corresponds to 2 V/cm, measured as the distance between the electrodes.

6. Stain the gel for 30 min. in an ethidium bromide solution of 0.6 g/ml in 0.5 x TAE (60 ml of a 10 mg/ml stock solution in 1 liter 0.5 x TAE), and destain for 30 min. in 0.5 x TAE.

Comment: Be careful and wear gloves at all times when you might touch the agarose gel. Ethidium bromide is a very powerful mutagen.

7. Visualize the bands on the gel under ultraviolet light and take a photograph (see Fig. 3A) or capture the image by a video camera and print (or store as a TIFF file).

Comment: Caution: Ultraviolet light is dangerous, particularly to the eyes. To minimize exposure wear a safety mask that efficiently blocks ultraviolet light.


When a high number or diverse rep-PCR fingerprints need to be compared, computer assistance becomes essential. When the quality of the raw data is high, computer assisted analysis generally generates high quality data as well. Therefore care should be taken to produce quality primary data. Many hard and software combinations are available for fragment (pattern) analysis. In our laboratory, two commercial software packages have been extensively applied to the analysis of rep-PCR fingerprints, namely the AMBIS system (Scanalytics, Waltham, MA, USA) and GelCompar (Applied Maths, Kortrijk, Belgium; Vauterin and Vauterin,

1992). Computer programs, such as AMBIS and GelCompar normalize fingerprints according to intra-gel size standards (Versalovic et al. 1994; Rossbach et al. 1995; de Bruijn et al. 1996a; 1996b; Schneider and de Bruijn 1996). In this section, we will first discuss some of the general parameters of computer assisted(phylogenetic) analysis and then focus on the use of the GelCompar system for the analysis of rep-PCR generated genomic fingerprints (Louws et al. 1996; Schneider and de Bruijn 1996; de Bruijn et al. 1996a; 1996b).

General parameters; band or curve based characterization of fingerprints.

Cluster analysis of a collection of genomic fingerprints obtained by rep-PCR can be carried out in different ways. The input of a clustering method is a proximity or resemblance matrix, the output a dendrogram, (Jardine and Sibson 1971), or 3D presentation (PCA; Hope 1968; Cooley and Lohnes 1971). Proximities can be described by a broad array of coefficients comparing one or more of the features of the fingerprints and resulting in similarity or dissimilarity units. Fingerprints in general can be analysed on a band-based or curve-based pattern. Bands can beused to characterize a well defined fingerprint of low complexity as an array of peak positions alone, or combined with the height or area of the peak. Using a band-based method a collection of these fingerprints can be described as a matrix of binary variables, band present 1, band absent 0. Bands can be assigned by hand (Woods et al. 1992; Judd et al. 1993; Reboli et al. 1994; Versalovic and Lupski, 1995; Koeth et al. 1995) or by a computer program, according to preset band searching settings (see Fig. 4; Versalovic et al. 1994; de Bruijn et al. 1996a; Schneider and de Bruijn 1996).

Manual as well as the band-based computer-assisted analysis methods often require tedious and laborious band assignment or checking steps and can be subjective in nature (see Fig. 4). Information contained in fingerprints of a high complexity, such as rep-PCR genomic fingerprint of patterns is captured not only in the number and position of peaks but also by different ratio's in peak heights and area's (Fig. 3, 4, 5, 6). Therefore, a binary system is not sufficient to describe these highly complex fingerprint patterns. Preferably these fingerprint patterns are analyzed using a curve-based protocol. The full complexity of rep-PCR genomic fingerprints can only be characterized by the densitometric curves, described as an array densitometric values (J.L.W. Rademaker, F.J. Louws, U. Rossbach and F.J. de Bruijn, unpublished results). The product-moment correlation coefficient, see below, allows for the direct comparison of these whole densitometric curves.

- Proximity coefficients.

The analysis of rep-PCR genomic fingerprints generally requires a simplification of the original data and can be used to calculate a proximity matrix. This calculation can either be based on dissimilarity, or similarity criteria (see Fig. 5). These (dis)similarities can be established using a wide array of coefficients.

The band-based comparison using the similarity coefficient defined by Jaccard (1908) is solely based on the presence of a band and its position as a binary variable. The coefficient derived by Dice (1945) also uses the band position, but adds more weight to matching bands. A more sophisticated "area-sensitive" similarity coefficient (GelCompar 4.0), takes into account the correspondence of bands expressed as in the coefficient of Jaccard, as well as the differences of the relative areas under each of the corresponding bands. The product-moment or Pearson correlation coefficient is applied to the array of densitometric values formed by the fingerprint (see Fig. 4, 5A). The product-moment correlation coefficient is a more robust and objective coefficient since whole curves are compared and subjective band-scoring is omitted. The product-moment correlation coefficient is independent of the relative concentrations of fingerprints and fairly insensitive to differences in background. Patterns such as the more complex rep-PCR genomic fingerprints benefit especially from these characteristics.

- Clustering methods.

Fig. 3 is an example of a gel with ERIC-PCR genomic fingerprints of 8 bacterial strains. In this small data set one can clearly discriminate the groups of fingerprints {1,2}, {4,5} and {7,8} and two more individual fingerprints in lane 3 and 6 which share some bands. These groups are called clusters (see Fig. 6). Finding these groups is the aim of cluster analysis. The basic assumption is that subsets can be characterized by possession of properties of coherence and isolation (Jardine and Sibson 1971). The goal is to form groups with highly similar fingerprints in such a way that the fingerprints in different groups are as dissimilar as possible. The example of the fingerprints in Fig. 3 and 6 is simple. There are few fingerprints and the differences are clear. When the number of fingerprints is higher, the complexity is higher, and the fingerprints are more similar it is more difficult and tedious to assign groups and mathematical algorithms become necessary to perform cluster analyses (Fig. 7, 8). The choice of a clustering method is not always obvious and depends on the nature of the original data and the purpose of the analysis. Cluster analysis, including PCA, is mostly used to describe, present and explain data. It is not a statistical test to prove or disprove a preconceived hypothesis (Jardine and Sibson 1971; Kendal 1975; Kaufman and Rousseeuw 1990). The application of more than one clustering method, and comparison of the resulting classifications can aid in the process of choosing the most appropriate representation of the data and be of confirmatory importance. Several algorithms for hierarchical or divisive clustering analyses leading to dendrograms are available. The unweighted pair-group method, using arithmetic averages (UPGMA) described by Sneath and Sokal, (1973) is frequently used. This method has also been applied to the analysis of rep-PCR genomic fingerprints (Versalovic et al. 1994; Versalovic and Lupski 1995; Koeth et al. 1995; Louws et al. 1995; van Belkum et al. 1996). Examples of this type of analysis are shown in Fig. 5B and C, and in Fig. 6 and 7, were the rep-PCR genomic fingerprint patterns of a variety of Xanthomonas strains are investigated (F.J. Louws, J.L.W. Rademaker, L. Vauterin, J. Swings and F.J. de Bruijn, unpublished results). The strains are clustering according to DNA-homology groups, as previously assigned by Vauterin et al. (1995). The fingerprint patterns shown in Fig. 7 (as well as Fig. 6), represent computer generated fingerprints, rearranged according to the "phylogenetic" tree. The Neighbour Joining approach (Saitou and Nei 1987), is a method which attempts to reflect evolutionary distances that can be used for reconstructing phylogenetic trees. Alternatively, the method of Ward (1963) can be used, which is intended for interval scaled measurements and makes use of Euclidean distances (Kaufman and Rousseeuw 1990). The agglomerative or partitioning principal component analysis (PCA) can also be applied (Versalovic et al. 1994; Vera Cruz et al. 1995, 1996). As a non-hierarchic clustering method, the PCA method is an interesting alternative to the hierarchical methods described above (UPGMA, Ward's and Neighbour Joining). For example, using GelCompar, PCA can be started directly from the densitometric curves, without applying any similarity coefficient, but using the original arrays of densitometric values instead. A set of Eigenvalues, derived from these curves, is the basis for calculating the three principal discriminating axes in a multi-dimensional space. Groups of entities (e.g. taxonomical units, strains) can be represented as clouds of dots in a spatial conformation (see Fig. 8). From a mathematical point of view, PCA is the most genuine grouping method and is an excellent method to discriminate between two to five groups. However, it is less suited for the discrimination of more than five groups, since the first three dimensions of a multidimensional system do not allow satisfactory representation of such complex structures.

- Phylogenetic studies

Phylogeny is the evolutionary history of lineages (Hillis 1993). Phylogenetic trees are a specific type of relational graphic representations, reflecting the genealogical or evolutionary connections in a group of organisms. Phylogenetic trees are a specific kind of dendrograms, not only because of the cluster analysis methods employed, but especially due to the nature of the raw data and the collection of strains studied. The anomalous phylogenies based on bacterial catalase gene sequences which do not demonstrate relationship to phylogenies based on rDNA sequences (Mayfield and Duvall 1996) form an interesting illustration of this phenomenon. Information of phylogenetic relations in a collection of closely related (sub) species, pathovars or strains can be obtained by cluster analysis of rep-PCR genomic fingerprints (see Fig. 6 and 7). Between genera it is virtually impossible to obtain information on phylogenetic relations using rep-PCR. However in a

Next pages 18-24