* Comments: *When
more than one gel needs to be analyzed and compared, using for
example the GelCompar computer program, the details of this protocol
should be followed carefully or standardized in other ways. It
is important to use at least three marker lanes in a 30 lane gel,
one at both ends and one in the middle. This allows for correction
of possible "smiling"effects.

**Experimental protocol
**

1. In a 500 ml bottle add 3.75 g agarose to 250 ml 0.5 x TAE and dissolve the agarose by microwaving.

2. Pour the agarose (50-70 ^{o}C) into the gel tray and
insert the comb.

3. After the gel has completely hardened carefully remove the comb and fill the wells with 0.5 x TAE.

4. Mix a 6 µl PCR sample with 1.2 µl loading
buffer on a piece of Parafilm and load the gel; load 2 µl
1 kb DNA size ladder, mixed with 4 ml
ddH_{2}O and 1.2 µl loading buffer into the terminal wells and
in the middle.

5. Run the gel in the cold room for 18-19 hours at 70 Volt, constant voltage, 25-30 mA. This corresponds to 2 V/cm, measured as the distance between the electrodes.

6. Stain the gel for 30 min. in an ethidium bromide
solution of 0.6 g/ml in 0.5 x TAE (60 ml
of a 10 mg/ml stock solution in 1 liter 0.5 x TAE), and destain
for 30 min. in 0.5 x TAE.

* Comment: *Be careful
and

7. Visualize the bands on the gel under ultraviolet
light and take a photograph (see Fig. 3A) or capture the image
by a video camera and print (or store as a TIFF file).

* Comment: *Caution:

**COMPUTER -ASSISTED REP-PCR GENOMIC FINGERPRINT
ANALYSIS**

When a high number or diverse rep-PCR fingerprints
need to be compared, computer assistance becomes essential. When
the quality of the raw data is high, computer assisted** **analysis
generally generates high quality data as well. Therefore care
should be taken to produce quality primary data. Many hard and
software combinations are available for fragment (pattern) analysis.
In our laboratory, two commercial software packages have been
extensively applied to the analysis of rep-PCR fingerprints, namely
the AMBIS system (Scanalytics, Waltham, MA, USA) and GelCompar
(Applied Maths, Kortrijk, Belgium; Vauterin and Vauterin,

1992). Computer programs, such as AMBIS and GelCompar
normalize fingerprints according to intra-gel size standards (Versalovic
*et al. *1994; Rossbach *et al. *1995; de Bruijn *et
al. *1996a; 1996b; Schneider and de Bruijn 1996). In this section,
we will first discuss some of the general parameters of computer
assisted(phylogenetic) analysis and then focus on the use of the
GelCompar system for the analysis of rep-PCR generated genomic
fingerprints (Louws *et al. *1996; Schneider and de Bruijn
1996; de Bruijn *et al.* 1996a; 1996b).

**General parameters; band or curve based characterization
of fingerprints.
**

Cluster analysis of a collection of genomic fingerprints
obtained by rep-PCR can be carried out in different ways. The
input of a clustering method is a proximity or resemblance matrix,
the output a dendrogram, (Jardine and Sibson 1971), or 3D presentation
(PCA; Hope 1968; Cooley and Lohnes 1971). Proximities can be described
by a broad array of coefficients comparing one or more of the
features of the fingerprints and resulting in similarity or dissimilarity
units. Fingerprints in general can be analysed on a band-based
or curve-based pattern. Bands can beused to characterize a well
defined fingerprint of low complexity as an array of peak positions
alone, or combined with the height or area of the peak. Using
a band-based method a collection of these fingerprints can be
described as a matrix of binary variables, band present 1, band
absent 0. Bands can be assigned by hand (Woods *et al. *1992;
Judd *et al. *1993; Reboli *et al.* 1994; Versalovic
and Lupski, 1995; Koeth *et al. *1995) or by a computer program,
according to preset band searching settings
(see Fig. 4; Versalovic *et al. *1994; de Bruijn *et al.
*1996a; Schneider and de Bruijn 1996).

Manual as well as the band-based
computer-assisted analysis methods often require tedious and laborious
band assignment or checking steps and can be subjective in nature
(see Fig. 4). Information contained in fingerprints of a high
complexity, such as rep-PCR genomic fingerprint of patterns is
captured not only in the number and position of peaks but also
by different ratio's in peak heights and area's
(Fig. 3, 4, 5, 6). Therefore, a binary system is not sufficient
to describe these highly complex fingerprint patterns. Preferably
these fingerprint patterns are analyzed using a curve-based protocol.
The full complexity of rep-PCR genomic fingerprints can only be
characterized by the densitometric curves, described as an array
densitometric values (J.L.W. Rademaker, F.J. Louws, U. Rossbach
and F.J. de Bruijn, unpublished results). The product-moment correlation
coefficient, see below, allows for the direct comparison of these
whole densitometric curves.

**- Proximity coefficients.**

The analysis of rep-PCR genomic fingerprints generally requires a simplification of the original data and can be used to calculate a proximity matrix. This calculation can either be based on dissimilarity, or similarity criteria (see Fig. 5). These (dis)similarities can be established using a wide array of coefficients.

The band-based comparison using the similarity coefficient
defined by Jaccard (1908) is solely based on the presence of a
band and its position as a binary variable.
The coefficient derived by Dice (1945) also uses the band position,
but adds more weight to matching bands. A more sophisticated "area-sensitive"
similarity coefficient (GelCompar 4.0), takes into account the
correspondence of bands expressed as in the coefficient
of Jaccard, as well as the differences of the relative areas under
each of the corresponding bands. The product-moment or Pearson
correlation coefficient is applied to the array of densitometric
values formed by the fingerprint (see Fig. 4, 5A). The product-moment
correlation coefficient is a more robust and objective coefficient
since whole curves are compared and subjective band-scoring is
omitted. The product-moment correlation coefficient is independent
of the relative concentrations of fingerprints and fairly insensitive
to differences in background. Patterns such as the more complex
rep-PCR genomic fingerprints benefit especially from these characteristics.

**- Clustering methods.**

Fig. 3 is an example of a gel with ERIC-PCR genomic
fingerprints of 8 bacterial strains. In this small data set one
can clearly discriminate the groups of fingerprints {1,2}, {4,5}
and {7,8} and two more individual fingerprints in lane 3 and 6
which share some bands. These groups are called clusters (see
Fig. 6). Finding these groups is the aim of cluster analysis.
The basic assumption is that subsets can be characterized by possession
of properties of coherence and isolation (Jardine and Sibson 1971).
The goal is to form groups with highly similar fingerprints in
such a way that the fingerprints in different groups are as dissimilar
as possible. The example of the fingerprints
in Fig. 3 and 6 is simple. There are few fingerprints and the
differences are clear. When the number of fingerprints is higher,
the complexity is higher, and the fingerprints are more similar
it is more difficult and tedious to assign groups and mathematical
algorithms become necessary to perform cluster analyses (Fig.
7, 8). The choice of a clustering method is not always obvious
and depends on the nature of the original data and the purpose
of the analysis. Cluster analysis, including PCA, is mostly used
to describe, present and explain data. It is not a statistical
test to prove or disprove a preconceived hypothesis (Jardine and
Sibson 1971; Kendal 1975; Kaufman and Rousseeuw 1990). The application
of more than one clustering method, and comparison of the resulting
classifications can aid in the process of choosing the most appropriate
representation of the data and be of confirmatory importance.
Several algorithms for hierarchical or divisive clustering
analyses leading to dendrograms are available. The unweighted
pair-group method, using arithmetic averages (UPGMA) described
by Sneath and Sokal, (1973) is frequently used. This method has
also been applied to the analysis of rep-PCR genomic fingerprints
(Versalovic *et al. *1994; Versalovic and Lupski 1995; Koeth
*et al. *1995; Louws* et al.* 1995; van Belkum *et
al. *1996). Examples of this type of analysis are shown in
Fig. 5B and C, and in Fig. 6 and 7, were the rep-PCR genomic fingerprint
patterns of a variety of *Xanthomonas* strains are investigated
(F.J. Louws, J.L.W. Rademaker, L. Vauterin, J. Swings and F.J.
de Bruijn, unpublished results). The strains are clustering according
to DNA-homology groups, as previously assigned by Vauterin *et
al.* (1995). The fingerprint patterns shown in Fig. 7 (as well
as Fig. 6), represent computer generated fingerprints, rearranged
according to the "phylogenetic" tree. The Neighbour
Joining approach (Saitou and Nei 1987), is a method which attempts
to reflect evolutionary distances that can be used for reconstructing
phylogenetic trees. Alternatively, the method of Ward (1963) can
be used, which is intended for interval scaled measurements and
makes use of Euclidean distances (Kaufman and Rousseeuw 1990).
The agglomerative or partitioning principal component
analysis (PCA) can also be applied (Versalovic *et al. *1994;
Vera Cruz *et al. *1995, 1996). As a non-hierarchic clustering
method, the PCA method is an interesting alternative to the hierarchical
methods described above (UPGMA, Ward's and Neighbour Joining).
For example, using GelCompar, PCA can be started directly from
the densitometric curves, without applying any similarity coefficient,
but using the original arrays of densitometric values instead.
A set of Eigenvalues, derived from these curves, is the basis
for calculating the three principal discriminating axes in a multi-dimensional
space. Groups of entities (e.g. taxonomical units, strains) can
be represented as clouds of dots in a spatial conformation (see
Fig. 8). From a mathematical point of view, PCA is the most genuine
grouping method and is an excellent method to discriminate between
two to five groups. However, it is less suited for the discrimination
of more than five groups, since the first three dimensions of
a multidimensional system do not allow satisfactory representation
of such complex structures.

**- Phylogenetic studies**

Phylogeny is the evolutionary history of lineages (Hillis 1993). Phylogenetic trees are a specific type of relational graphic representations, reflecting the genealogical or evolutionary connections in a group of organisms. Phylogenetic trees are a specific kind of dendrograms, not only because of the cluster analysis methods employed, but especially due to the nature of the raw data and the collection of strains studied. The anomalous phylogenies based on bacterial catalase gene sequences which do not demonstrate relationship to phylogenies based on rDNA sequences (Mayfield and Duvall 1996) form an interesting illustration of this phenomenon. Information of phylogenetic relations in a collection of closely related (sub) species, pathovars or strains can be obtained by cluster analysis of rep-PCR genomic fingerprints (see Fig. 6 and 7). Between genera it is virtually impossible to obtain information on phylogenetic relations using rep-PCR. However in a