# On Convergence Properties of the EM Algorithm for Gaussian Mixtures

Neural Computation, 8, 129{151, 1996.

On Convergence Properties of the EM Algorithm for Gaussian Mixtures

Lei Xu Department of Computer Science The Chinese University of Hong Kong Michael I. Jordan Department of Brain and Cognitive Sciences Massachusetts Institute of Technology

Abstract

We build up the mathematical connection between the \Expectation-Maximization" (EM) algorithm and gradient-based approaches for maximum likelihood learning of nite Gaussian mixtures. We show that the EM step in parameter space is obtained from the gradient via a projection matrix P , and we provide an explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties of P and provide new results analyzing the e ect that P has on the likelihood surface. Based on these mathematical results, we present a comparative discussion of the advantages and disadvantages of EM and other algorithms for the learning of Gaussian mixture models.

1 Introduction

The \Expectation-Maximization" (EM) algorithm is a general technique for maximum likelihood (ML) or maximum a posteriori (MAP) estimation. The recent emphasis in the neural network literature on probabilistic models has led to increased interest in EM as a possible alternative to gradient-based methods for optimization. EM has been used for variations on the traditional theme of Gaussian mixture modeling (Ghahramani & Jordan, 1994; Nowlan, 1991; Xu & Jordan, 1993a, b; Tresp, Ahmad & Neuneier, 1994; Xu, Jordan & Hinton, 1994) and has also been used for novel chain-structured and tree-structured architectures (Bengio & Frasconi, 1995; Jordan & Jacobs, 1994). The empirical results reported in these papers suggest that EM has considerable promise as an optimization method for such architectures. Moreover, new theoretical results have been obtained that link EM to other topics in learning theory (Amari, 1994; Jordan & Xu, 1993; Neal & Hinton, 1993; Xu & Jordan, 1993c; Yuille, Stolorz & Utans, 1994). 1

Despite these developments, there are grounds for caution about the promise of the EM algorithm. One reason for caution comes from consideration of theoretical convergence rates, which show that EM is a rst order algorithm.1 More precisely, there are two key results available in the statistical literature on the convergence of EM. First, it has been established that under mild conditions EM is guaranteed to converge toward a local maximum of the log likelihood l (Boyles, 1983; Dempster, Laird & Rubin, 1977; Redner & Walker, 1984; Wu, 1983). (Indeed the convergence is monotonic: l( (k+1)) l( (k)), where (k) is the value of the parameter vector at iteration k.) Second, considering EM as a mapping (k+1) = M ( (k) ) with xed point = M ( ), we @M ( ) ( (k) ? ) when (k+1) is near , and thus have (k+1) ? @

k

with

(k+1) ?

k k @M ( ) k k @ k @M ( ) k = 0 6 @

(k) ?

k;

almost surely. That is, EM is a rst order algorithm. The rst-order convergence of EM has been cited in the statistical literature as a major drawback. Redner and Walker (1984), in a widely-cited article, argued that superlinear (quasi-Newton, method of scoring) and second-order (Newton) methods should generally be preferred to EM. They reported empirical results demonstrating the slow convergence of EM on a Gaussian mixture model problem for which the mixture components were not well separated. These results did not include tests of competing algorithms, however. Moreover, even though the convergence toward the \optimal" parameter values was slow in these experiments, the convergence in likelihood was rapid. Indeed, Redner and Walker acknowledge that their results show that \... even when the component populations in a mixture are poorly separated, the EM algorithm can be expected to produce in a very small number of iterations parameter values such that the mixture density determined by them re ects the sample data very well." In the context of the current literature on learning, in which the predictive aspect of data modeling is emphasized at the expense of the traditional Fisherian statistician's concern over the \true" values of parameters, such rapid convergence in likelihood is a major desideratum of a learning algorithm and undercuts the critique of EM as a \slow" algorithm. In the current paper, we provide a comparative analysis of EM and other optimization methods. We emphasize the comparison between EM and other rst-order methods (gradient ascent, conjugate gradient methods), because these have tended to be the methods of choice in the neural network literature. However, we also compare EM to superlinear and second-order methods. We argue that EM has a number of advantages, including its naturalness at handling the probabilistic constraints of mixture problems and its guarantees of convergence. We also provide new results

An iterative algorithm is said to have a local convergence rate of order q r + o(k (k) ? k) for k su ciently large.

1

1 if k (k+1) ?

k=k k ? kq

( )

2

suggesting that under appropriate conditions EM may in fact approximate a superlinear method; this would explain some of the promising empirical results that have been obtained (Jordan & Jacobs, 1994), and would further temper the critique of EM o ered by Redner and Walker. The analysis in the current paper focuses on unsupervised learning; for related results in the supervised learning domain see Jordan and Xu (in press). The remainder of the paper is organized as follows. We rst brie y review the EM algorithm for Gaussian mixtures. The second section establishes a connection between EM and the gradient of the log likelihood. We then present a comparative discussion of the advantages and disadvantages of various optimization algorithms in the Gaussian mixture setting. We then present empirical results suggesting that EM regularizes the condition number of the e ective Hessian. The fourth section presents a theoretical analysis of this empirical nding. The nal section presents our conclusions.

2 The EM algorithm for Gaussian mixtures

We study the following probabilistic model:

P (xj ) =

and

PK

K X j =1

j P (xjmj ; j );

(1)

?1 (x?mj )

2 j jj where j 0 and j =1 j = 1. The parameter vector consists of the mixing proportions j , the mean vectors mj , and the covariance matrices j . Given K and given N independent, identically distributed samples fx(t)gN , we obtain the 1 2 following log likelihood:

1 2

P (xjmj ; j ) = p 1

e? (x?mj )T

1 2

j

l( ) = log

which can be optimized via the following iterative algorithm (see, e.g, Dempster, Laird & Rubin, 1977): PN (k) t=1 hj (t) (k+1) = j N

N (t)j ) = X log P (x(t) j ); P (x t=1 t=1

N Y

(2)

(k) (t) t=1 hj (t)x PN (k) t=1 hj (t) PN (k) (k) (t) (k) T (t) t=1 hj (t) x ? mj ] x ? mj ] (k+1) = (3) PN (k) j t=1 hj (t)x(t) 2 Although we focus on maximum likelihood (ML) estimation in this paper, it is straightforward to apply our results to maximum a posteriori (MAP) estimation by multiplying the likelihood by a prior.

m(k+1) = j

PN

3

where the posterior probabilities h(k) are de ned as follows: j

(k) P (x(t)jm(k); (k)) j j (k) (t) = j hj PK (k) (k) ; (k)) : i=1 i P (x(t)jmi i

3 Connection between EM and gradient ascent

In the following theorem we establish a relationship between the gradient of the log likelihood and the step in parameter space taken by the EM algorithm. In particular we show that the EM step can be obtained by premultiplying the gradient by a positive de nite matrix. We provide an explicit expression for the matrix.

Theorem 1 At each iteration of the EM algorithm Eq. (3), we have

( A(k+1) ? A(k) = PAk) @@l jA=A k A (k+1) ? m(k) = P (k) @l j mj mj @m mj =mjk j

( )

(4) (5)

( )

@l vec (k+1)] ? vec (k) ] = P (k) @ vec ] j j = k j j j j j

where

j

( )

(6) (7) (8)

1 ( PAk) = N fdiag (k) ; 1

( Pmkj) =

P (kj) =

(k) j PN (k) t=1 hj (t)

PN

; (k) ] ? A(k) (A(k) )T g K

(k) j

(k) (k) (t) j t=1 hj

2

(9)

where A denotes the vector of mixing proportions 1; ; K ]T , j indexes the mixture components (j = 1; ; K ), k denotes the iteration number, \vec B]" denotes the vector obtained by stacking the column vectors of the matrix B , and \ " denotes the Kronecker product. Moreover, given the P ( ( constraints K (k) = 1 and (k) 0, PAk) is a positive de nite matrix and the matrices Pmkj) j =1 j j and P (k) are positive de nite with probability one for N su ciently large. j

(1) and (2), we have

Proof. (1) We begin by considering the EM update for the mixing proportions i. From Eqs.

(k ( N @l j k = X P (x(t); 1k)); ; P (x(t); K ))]T : PK (k) (k) @ A A=A t=1 i=1 i P (x(t) ; i )

( )

4

( Premultiplying by PAk) , we obtain ( PAk) @@l jA=A(k) = A

N 1X = N h(k)(t); 1 t=1

N t=1

( N 1 X f (k) P (x(t); 1k) ); 1

(k ; (k)P (x(t) ; K ))]T ? A(k) PK (k) P (x(t); i(k) )g i=1 i K PK (k) P (x(t) ; i(k)) i=1 i

; h(k)(t)]T ? A(k) : K

N X (k) t=1

The update formula for A in Eq. (3) can be rewritten as 1 A(k+1) = A(k) + N

h1 (t);

; h(k)(t)]T ? A(k) : K

Combining the last two equations establishes the update rule for A (Eq. 4). Furthermore, for an ( arbitrary vector u, we have NuT PAk) u = uT diag (k) ; ; (k)]u ? (uT A(k) )2 . By Jensen's inequality 1 K we have

uT diag (k); 1

; (k)]u = K

> (

=

j =1 K X

K X (k) 2 u j j

j =1 (uT A(k) )2:

(k) u )2 j j

P

( ( Thus, uT PAk) u > 0 and PAk) is positive de nite given the constraints K (k) = 1 and (k) 0 j =1 j j for all j . (2) We now consider the EM update for the means mi. It follows from Eqs. (1) and (2) that

N X @l j = h(k) (t)( (k))?1 x(t) ? m(k) ]: j j @mj mj =mjk t=1 j

( )

( Premultiplying by Pmkj) yields ( @l Pmkj) @m jmj =mjk = j

( )

From Eq. (3), we have N h(k) (t) > t=1 j assuming that N is large enough such that the matrix is of full rank. Thus, it follows from Eq. (8) ( that Pmkj) is positive de nite with probability one. (3) Finally, we prove the third part of the theorem. It follows from Eqs. (1) and (2) that

P

(k) (t) (k) (t) hj (t)x ? mj t=1 t=1 hj = m(k+1) ? m(k) : j j 0; moreover, (k) is positive de nite with probability one j

PN

1

N X (k)

@l j @ j j=

( )

k j

N 1X = ? 2 h(k) (t)( (k))?1 f (k) ? x(t) ? m(k) ] x(t) ? m(k))]T g( (k))?1 : j j j j j j t=1

5

With this in mind, we rewrite the EM update formula for (k) as j

(k+1) = j

N X (k) 1 (k) + hj (t) x(t) ? m(k)] x(t) ? m(k)]T ? (k) PN (k) j j j j t=1 hj (t) t=1 2 (k) (k) + j (k) PN (k) V j j ; j t=1 hj (t)

= where

N 1 X h(k) (t)( (k))?1 f (k) ? x(t) ? m(k) ] x(t) ? m(k) ]T g( (k))?1 V j = ?2 j j j j j j t=1

= @@l j j = k : j j

( )

2 (k) @l (k+1) = (k) + j (k) PN (k) @ j j = (k) j : j j j t=1 hj (t) j Utilizing the identity vec ABC ] = (C T A)vec B ], we obtain vec (k+1)] = vec (k)] + PN 2 (k) ( (k) j j j t=1 hj (t) Thus P (k) = PN 2h k (t) ( (k) j j t j

( ) =1

That is, we have

(k) ) @l j j @ j j = (k) : j

(k)). Moreover, for an arbitrary matrix U , we have j

is positive de nite with probability one N is su ciently large. Thus it follows from Eq. (9) and PN (k) (k) 2 t=1 hj (t) > 0 that P j is positive de nite with probability one. Using the notation = mT ; ; mT ; vec 1 ]T ; ; vec K ]T ; AT ]T , and P ( ) = diag Pm ; ; 1 K PmK ; P , ; P K ; PA], we can combine the three updates in Theorem 1 into a single equation:

1 1

(k) )vec U ] = tr( (k)U (k)U T ) j j j (k)U )T ( (k)U )) = tr(( j j (k)U ]T vec (k)U ] 0; = vec j j where equality holds only when (k) U = 0 for all U . Equality is impossible, however, since (k) j j

vec U ]T ( (k) j

Under the conditions of Theorem 1, P ( (k) ) is a positive de nite matrix with probability one. Recalling that for a positive de nite matrix B , we have @@l T B @@l > 0, we have the following corollary: 6

(k+1) = (k) + P ( (k) ) @l j (k) ; @ =

(10)

Corollary 1 For each iteration of the EM algorithm given by Eq.(3), the search direction

(k) has a positive projection on the gradient of l.

(k+1) ?

That is, the EM algorithm can be viewed as a variable metric gradient ascent algorithm for which the projection matrix P ( (k) ) changes at each iteration as a function of the current parameter value (k) . Our results extend earlier results due to Baum and Sell (1968). Baum and Sell studied recursive equations of the following form:

x(k+1) = T (x(k)); T (x(k)) = T (x(k))1;

P

(k) x(k) ; T (x(k))K ]; T (x(k) ))i = PK i @J=@xi (k) (k) i=1 xi @J=@xi

where x(k) 0; K x(k) = 1, where J is a polynomial in x(k) having positive coe cients. They i=1 i i i showed that the search direction of this recursive formula, i.e., T (x(k)) ? x(k) , has a positive projection on the gradient of of J with respect to the x(k) (see also Levinson, Rabiner & Sondhi, 1983). It can be shown that Baum and Sell's recursive formula implies the EM update formula for A in a Gaussian mixture. Thus, the rst statement in Theorem 1 is a special case of Baum and Sell's earlier work. However, Baum and Sell's theorem is an existence theorem and does not provide an explicit expression for the matrix PA that transforms the gradient direction into the EM direction. Our theorem provides such an explicit form for PA . Moreover, we generalize Baum and Sell's results to handle the updates for mj and j , and we provide explicit expressions for the positive de nite transformation matrices Pmj and P j as well. It is also worthwhile to compare the EM algorithm to other gradient-based optimization methods. Newton's method is obtained by premultiplying the gradient by the inverse of the Hessian of the log likelihood: (k+1) = (k) + H ( (k) )?1 @l : (11) @ (k) Newton's method is the method of choice when it can be applied, but the algorithm is often di cult to use in practice. In particular, the algorithm can diverge when the Hessian becomes nearly singular; moreover, the computational costs of computing the inverse Hessian at each step can be considerable. An alternative is to approximate the inverse by a recursively updated matrix B(k+1) = B (k) + B (k) . Such a modi cation is called a quasi-Newton method. Conventional quasiNewton methods are unconstrained optimization methods, however, and must be modi ed in order to be used in the mixture setting (where there are probabilistic constraints on the parameters). In addition, quasi-Newton methods generally require that a one-dimensional search be performed at each iteration in order to guarantee convergence. The EM algorithm can be viewed as a special form of quasi-Newton method in which the projection matrix P ( (k) ) in Eq. (10) plays the role of B(k) . As we discuss in the remainder of the paper, this particular matrix has a number of favorable properties that make EM particularly attractive for optimization in the mixture setting. 7

4 Constrained optimization and general convergence

An important property of the matrix P is that the EM step in parameter space automatically satis es the probabilistic constraints of the mixture model in Eq. (1). The domain of contains P two regions that embody the probabilistic constraints: D1 = f : K (k) = 1g and D2 = f : j =1 j (k) 0, positive de niteg. For the EM algorithm the update for the mixing proportions j j j can be rewritten as follows: (k) P (x(t) jm(k); (k)) N (k+1) = 1 X j j j j N t=1 PK (k) P (x(t) jm(k); (k)) : i=1 i i i It is obvious that the iteration stays within D1 . Similarly, the update for j can be rewritten as:

(k) P (x(t)jm(k) ; (k)) N X 1 (k+1) = j j j x(t) ? m(k) ] x(t) ? m(k)]T PN (k) PK (k) j j j (t) jm(k); (k)) t=1 hj (t) t=1 i=1 i P (x i i

which stays within D2 for N su ciently large. Whereas EM automatically satis es the probabilistic constraints of a mixture model, other optimization techniques generally require modi cation to satisfy the constraints. One approach is to modify each iterative step to keep the parameters within the constrained domain. A number of such techniques have been developed, including feasible direction methods, active sets, gradient projection, reduced-gradient, and linearly-constrained quasi-Newton. These constrained methods all incur extra computational costs to check and maintain the constraints and, moreover, the theoretical convergence rates for such constrained algorithms need not be the same as that for the corresponding unconstrained algorithms. A second approach is to transform the constrained optimization problem into an unconstrained problem before using the unconstrained method. This can be accomplished via penalty and barrier functions, Lagrangian terms, or re-parameterization. Once again, the extra algorithmic machinery renders simple comparisons based on unconstrained convergence rates problematic. Moreover, it is not easy to meet the constraints on the covariance matrices in the mixture using such techniques. A second appealing property of P ( (k) ) is that each iteration of EM is guaranteed to increase the likelihood (i.e., l( (k+1)) l( (k))). This monotonic convergence of the likelihood is achieved without step-size parameters or line searches. Other gradient-based optimization techniques, including gradient descent, quasi-Newton, and Newton's method, do not provide such a simple theoretical guarantee, even assuming that the constrained problem has been transformed into an unconstrained one. For gradient ascent, the step size must be chosen to ensure that k (k+1) ? (k?1) k=k( (k) ? (k?1))k kI + H ( (k?1)))k < 1. This requires a one-dimensional line search or an optimization of at each iteration, which requires extra computation which can slow down the convergence. An alternative is to x to a very small value which generally makes kI + H ( (k?1)))k close to one and results in slow convergence. For Newton's method, the iterative 8

process is usually required to be near a solution, otherwise the Hessian may be inde nite and the iteration may not converge. Levenberg-Marquardt methods handle the inde nite Hessian matrix problem; however, a one-dimensional optimization or other form of search is required for a suitable scalar to be added to the diagonal elements of Hessian. Fisher scoring methods can also handle the inde nite Hessian matrix problem, but for non-quadratic nonlinear optimization Fisher scoring requires a stepsize that obeys kI + BH ( (k?1) ))k < 1, where B is the Fisher information matrix. Thus, problems similar to those of gradient ascent arise here as well. Finally, for the quasi-Newton methods or conjugate gradient methods, a one-dimensional line search is required at each iteration. In summary, all of these gradient-based methods incur extra computational costs at each iteration, rendering simple comparisons based on local convergence rates unreliable. For large scale problems, algorithms that change the parameters immediately after each data point (\on-line algorithms") are often signi cantly faster in practice than batch algorithms. The popularity of gradient descent algorithms for neural networks is in part to the ease of obtaining on-line variants of gradient descent. It is worth noting that on-line variants of the EM algorithm can be derived (Neal & Hinton, 1993, Titterington, 1984), and this is a further factor that weighs in favor of EM as compared to conjugate gradient and Newton methods.

5 Convergence rate comparisons

In this section, we provide a comparative theoretical discussion of the convergence rates of constrained gradient ascent and EM. For gradient ascent a local convergence result can by obtained by Taylor expanding the log likelihood around the maximum likelihood estimate . For su ciently large k we have:

k

and

(k+1) ?

k kI + H ( ))kk(

M

(k) ?

)k

(12)

(13) where H is the Hessian of l, is the step size, and r = maxfj1? M ?H ( )]j; j1? m ?H ( )]jg, where M A] and m A] denote the largest and smallest eigenvalues of A, respectively. Smaller values of r correspond to faster convergence rates. To guarantee convergence, we require r < 1 or 0 < < 2= M ?H ( )]. The minimum possible value of r is obtained when = 1= M H ( )] with

kI + H ( )k

I + H ( )] = r;

rmin = 1 ?

1?

m H( ?1 H (

)]= M H ( )] )];

where H ] = M H ]= m H ] is the condition number of H . Larger values of the condition number correspond to slower convergence. When H ] = 1 we have rmin = 0, which corresponds to a 9

superlinear rate of convergence. Indeed, Newton's method can be viewed as a method for obtaining a more desirable condition number|the inverse Hessian H ?1 balances the Hessian H such that the resulting condition number is one. E ectively, Newton can be regarded as gradient ascent on a new function with an e ective Hessian that is the identity matrix: Heff = H ?1H = I . In practice, however, H ] is usually quite large. The larger H ] is, the more di cult it is to compute H ?1 accurately. Hence it is di cult to balance the Hessian as desired. In addition, as we mentioned in the previous section, the Hessian varies from point to point in the parameter space, and at each iteration we need recompute the inverse Hessian. Quasi-Newton methods approximate H ( (k))?1 by a positive matrix B (k) that is easy to compute. The discussion thus far has treated unconstrained optimization. In order to compare gradient ascent with the EM algorithm on the constrained mixture estimation problem, we consider a gradient projection method: (k+1) = (k) + k @l (14) (k) where k is the projection matrix that projects the gradient @ @lk into D1 . This gradient projection iteration will remain in D1 as long as the initial parameter vector is in D1 . To keep the iteration within D2 , we choose an initial (0) 2 D2 and keep su ciently small at each iteration. Suppose that E = e1 ; ; em ] are a set of independent unit basis vectors that span the space D1. In this basis, (k) and k @ @lk become (ck) = E T (k) and @ @lk c = E T @ @lk , respectively, with k (k) ? c k = k (k) ? k. In this representation the projective gradient algorithm Eq. c (14) becomes simple gradient ascent: (k+1) = (k) + @ @lk . Moreover, Eq. (12) becomes c c c (k+1) ? k kE T (I + H ( ))kk( (k) ? )k. As a result, the convergence rate is bounded k by

( ) ( ) ( ) ( ) ( )

@

rc = kE T (I + H ( ))k q M E T (I + H ( ))(I + H ( ))T E ] q = M E T (I + 2 H ( ) + 2H 2( ))E ]:

Since H ( ) is negative de nite, we obtain 1 + 2 2 ?Hc ] ? 2 m ?Hc ]: (15) M In this equation Hc = E T H ( )E is the Hessian of l restricted to D1 . We see from this derivation that the convergence speed depends on Hc ] = M ?Hc ]= m ?Hc ]. q When Hc ] = 1, we have 1 + 2 2 (?Hc ) ? 2 m ?Hc ] = 1 ? ?Hc], which in principle can M be made to equal zero if is selected appropriately. In this case, a superlinear rate is obtained. Generally, however, Hc ] 6= 1, with smaller values of Hc ] corresponding to faster convergence. We now turn to an analysis of the EM algorithm. As we have seen EM keeps the parameter vector within D1 automatically. Thus, in the new basis the connection between EM and gradient

rc

q

10

ascent (cf. Eq. (10)) becomes and we have with

(k+1) = (k) + E T P ( (k) ) @l c c @

k

rc

(k+1) ?

k kE T (I + PH ( ))kk(

))k

q

(k) ?

)k

= kE T (I + PH (

M

E T (I + PH ( ))(I + PH ( ))T E ]:

(16)

The latter equation can be further manipulated to yield:

rc

q

1 + 2 E T PHE ] ? 2 m ?E T PHE ]: M

Thus we see that the convergence speed of EM depends on q E T PHE ] = M E T PHE ]= m E T PHE ]. When E T PHE ] = 1, M E T PHE ] = 1, we have 1 + 2 E T PHE ] ? 2 m ?E T PHE ] = M (1 ? M ?E T PHE ]) = 0. In this case, a superlinear rate is obtained. We discuss the possibility of obtaining superlinear convergence with EM in more detail below. These results show that the convergence of gradient ascent and EM both depend on the shape of the log likelihood as measured by the condition number. When H ] is near one, the con guration is quite regular, and the update direction points directly to the solution yielding fast convergence. When H ] is very large, the l surface has an elongated shape, and the search along the update direction is a zigzag path, making convergence very slow. The key idea of Newton and quasi-Newton methods is to reshape the surface. The nearer it is to a ball shape (Newton's method achieves this shape in the ideal case), the better the convergence. Quasi-Newton methods aim to achieve an e ective Hessian whose condition number is as close as possible to one. Interestingly, the results that we now present suggest that the projection matrix P for the EM algorithm also serves to e ectively reshape the likelihood yielding an e ective condition number that tends to one. We rst present empirical results that support this suggestion and then present a theoretical analysis. We sampled 1000 points from a simple nite mixture model given by

p(x) = 1 p1 (x) + 2p2 (x)

where

2 The parameter values were as follows: 1 = 0:7170; 2 = 0:2830; m1 = ?2; m2 = 2; 1 = 2 1; 2 = 1. We ran both the EM algorithm and gradient ascent on the data. At each step of the simulation, we calculated the condition number of the Hessian ( H ( (k))]), the condition number determining the rate of convergence of the gradient algorithm ( E T H ( (k) )E ]), and the condition

pi (x) = q 1 2 expf? 1 (x ? mi ) g: 2 2 i 2 i

2

11

number determining the rate of convergence of EM ( E T P ( (k) )H ( (k))E ]). We also calculated the largest eigenvalues of the matrices H ( (k)), E T H ( (k) )E , and E T P ( (k) )H ( (k))E . The results are shown in Fig. 1. As can be seen in Fig. 1(a), the condition numbers change rapidly in the vicinity of the 25th iteration and the corresponding Hessian matrices become inde nite. Afterward, the Hessians quickly become de nite and the condition numbers converge.3 As shown in Fig. 1(b), the condition numbers converge toward the values H ( (k))] = 47:5, E T H ( (k))E ] = 33:5, and E T P ( (k) )H ( (k))E ] = 3:6. That is, the matrix P has greatly reduced the condition number, by factors of 9 and 15. This signi cantly improves the shape of l and speeds up the convergence. We ran a second experiment in which the means of the component Gaussians were m1 = ?1 and m2 = 1. The results are similar to those shown in Fig. 1. Since the distance between two distributions is reduced into half, the shape of l becomes more irregular. The condition number H ( (k))] increases to 352, E T H ( (k))E ] increases to 216, and E T P ( (k) )H ( (k))E ] increases to 61. We see once again a signi cant improvement in the case of EM, by factors of 4 and 6. Fig. 3 shows that the matrix P has also reduced the largest eigenvalues of the Hessian from between 2000 to 3000 to around 1. This demonstrates clearly the stable convergence that is obtained via EM, without a line search or the need for external selection of a learning stepsize. In the remainder of the paper we provide some theoretical analyses that attempt to shed some light on these empirical results. To illustrate the issues involved, consider a degenerate mixture problem in which the mixture has a single component. (In this case 1 = 1.) Let us furthermore assume that the covariance matrix is xed (i.e., only the mean vector m is to be estimated). The Hessian with respect to the mean m is H = ?N ?1 and the EM projection matrix P is =N . ?1 ], which is larger than one whenever 6= cI . For gradient ascent, we have E T HE ] = EM, on the other hand, achieves a condition number of one exactly ( E T PHE ] = PH ] = I ] = 1 and M E T PHE ] = 1). Thus, EM and Newton's method are the same for this simple quadratic problem. For general non-quadratic optimization problems, Newton retains the quadratic assumption, yielding fast convergence but possible divergence. EM is a more conservative algorithm that retains the convergence guarantee but also maintains quasi-Newton behavior. We now analyze this behavior in more detail. We consider the special case of estimating the means in a Gaussian mixture when the Gaussians are well separated.

Theorem 2 Consider the EM algorithm in Eq. (3), where the parameters

to be known. Assume that the K Gaussian distributions are well separated, such that for su ciently large k the posterior probabilities h(k) (t) are nearly zero or one. For such k, the condition number j associated with EM is always smaller than the condition number associated with gradient ascent.

3

j and j are assumed

Interestingly, the EM algorithm converges soon afterward as well, showing that for this problem EM spends little time in the region of parameter space in which a local analysis is valid.

12

That is:

E T P ( (k) )H ( (k))E ] < E T H ( (k) )E ]: Furthermore, M E T P ( (k) )H ( (k))E ] approaches one as k goes to in nity.

Proof. The Hessian is

2 4

H = 6 .. .

where

H11 H12 6H 6 21 H22 6 HK 1 HK 2

. . .

H1K 3 H2K 7 7 7 HKK

. . .

7 5

(17)

Hij

N N (k))?1 X h(k) (t) + ( (k))?1 X (x(t))(x(t) ? m )(x(t) ? m )T ]( (k))?1 (18) = ?( j ij j i ij j i j t=1 t=1 with ij (x(t)) = ( ij ? h(k) (t))h(k) (t). The projection matrix P is i j (k) (k P (k) = diag P11 ); ; PKK ];

@ 2l @mi@mT j

where

t=1 (k) (t)(1 ? h(k) (t)) is negligible for su ciently large k, the second term in Eq. (18) can Given that hj j P be neglected, yielding Hii = ?( (k) )?1 N h(k) (t) and H = diag H11; ; HKK ]. This implies t=1 j j

N X ( Pjjk) = (k) = h(k) (t): j j

that PH = ?I , and thus PH ] = 1, whereas H ] 6= 1.

2

This theorem, although restrictive in its assumptions, gives some indication as to why the projection matrix in the EM algorithm appears to condition the Hessian, yielding improved convergence. In fact, we conjecture that the theorem can be extended to apply more widely, in particular to the case of the full EM update in which the mixing proportions and covariances are estimated, and also, within limits, to cases in which the means are not well separated. To obtain an initial indication as to possible conditions that can be usefully imposed on the separation of the mixture components, we have studied the case in which the second term in Eq. (18) is neglected only for Hii and is retained for the Hij components, where j 6= i. Consider, for example, the case of a univariate mixture having two mixture components. For xed mixing proportions and xed covariances, the Hessian matrix (Eq. 17) becomes: " # h11 h12 H= ;

h21 h22

13

and the projection matrix (Eq. 18) becomes:

"

?h?1 0 ; 11 P= 0 ?h?1 22

where and

N 1 X hii = ? 2(k) h(k) (t); i = 1; 2 i i t=1

#

If H is negative de nite, (i.e., h11 h22 ? h12h21 < 0), then we can show that the conclusions of Theorem 2 remain true, even for Gaussians that are not necessarily well-separated. The proof is achieved via the following lemma:

hij = 2(k)1 2(k) (1 ? h(k)(t))h(k) (t)(x(t) ? mj )T (x(t) ? mi ); i 6= j = 1; 2 i j t=1

i j

N X

Lemma 1 Consider the positive de nite matrix

=

"

11 21

12 22

#

? ? For the diagonal matrix B = diag 111; 221], we have B ] <

].

21 12 = 0, which gives

Proof. The eigenvalues of are the roots of ( 11 ? )( 22 ? ) ?

M m

= = =

11 + 22 +

11 + 22 ?

q

2

2 ( 11 + 22)2 ? 4( 11 22 ? 21 12 ) ] =

11 + 22 + 11 + 22 ?

and The condition number follows: ] can be written as

s

] = (1 + s)=(1 ? s) f (s), where s is de ned as

11 s = 1 ? 4( ( 22+? 212 12) : 11 22) Furthermore, the eigenvalues of B are the roots of (1 ? )(1 ? ) ? ( 21 12)=( 11 22 ) = 0, p p which gives M = 1 + ( 21 12 )=( 11 22) and m = 1 ? ( 21 12)=( 11 22). Thus, de ning p r = ( 21 12)=( 11 22), we have B ] = (1 + r)=(1 ? r) = f (r).

14

4(1 ? r2) r r ( 11 + 22 )2=( 11 22) p Given that ( 11 + 22)2 =( 11 22 ) 4, we have s > 1 1 ? (1 ? r2) = 1. That is, s > r. Since r r f (x) = (1 + x)=(1 ? x) is a monotonically increasing function for x > 0, we have f (s) > f (r). Therefore, B ] < ]. 2 We think that it should be possible to generalize this lemma beyond the univariate, twocomponent case, thereby weakening the conditions on separability in Theorem 2 in a more general setting.

We now examine the quotient s=r: s s = 1 1?

6 Conclusions

In this paper we have provided a comparative analysis of algorithms for the learning of Gaussian mixtures. We have focused on the EM algorithm and have forged a link between EM and gradient methods via the projection matrix P . We have also analyzed the convergence of EM in terms of properties of the matrix P and the e ect that P has on the likelihood surface. EM has a number of properties that make it a particularly attractive algorithm for mixture models. It enjoys automatic satisfaction of probabilistic constraints, monotonic convergence without the need to set a learning rate, and low computational overhead. Although EM has the reputation of being a slow algorithm, we feel that in the mixture setting the slowness of EM has been overstated. Although EM can indeed converge slowly for problems in which the mixture components are not well separated, the Hessian is poorly conditioned for such problems and thus other gradient-based algorithms (including Newton's method) are also likely to perform poorly. Moreover, if one's concern is convergence in likelihood, then EM generally performs well even for these ill-conditioned problems. Indeed the algorithm provides a certain amount of safety in such cases, despite the poor conditioning. It is also important to emphasize that the case of poorly separated mixture components can be viewed as a problem in model selection (too many mixture components are being included in the model), and should be handled by regularization techniques. The fact that EM is a rst order algorithm certainly implies that EM is no panacea, but does not imply that EM has no advantages over gradient ascent or superlinear methods. First, it is important to appreciate that convergence rate results are generally obtained for unconstrained optimization, and are not necessarily indicative of performance on constrained optimization problems. Also, as we have demonstrated, there are conditions under which the condition number of the e ective Hessian of the EM algorithm tends toward one, showing that EM can approximate a superlinear method. Finally, in cases of a poorly conditioned Hessian, superlinear convergence is not necessarily a virtue. In such cases many optimization schemes, including EM, essentially revert to gradient ascent. 15

We feel that EM will continue to play an important role in the development of learning systems that emphasize the predictive aspect of data modeling. EM has indeed played a critical role in the development of hidden Markov models (HMM's), an important example of predictive data modeling.4 EM generally converges rapidly in this setting. Similarly, in the case of hierarchical mixtures of experts the empirical results on convergence in likelihood have been quite promising (Jordan & Jacobs, 1994; Waterhouse & Robinson, 1994). Finally, EM can play an important conceptual role as an organizing principle in the design of learning algorithms. Its role in this case is to focus attention on the \missing variables" in the problem. This clari es the structure of the algorithm and invites comparisons with statistical physics, where missing variables often provide a powerful analytic tool.

Acknowledgement

This project was supported in part by the HK RGC Earmarked Grant CUHK250/94E, by a grant from the McDonnell-Pew Foundation, by a grant from ATR Human Information Processing Research Laboratories, by a grant from Siemens Corporation, by grant IRI-9013991 from the National Science Foundation, and by grant N00014-90-J-1942 from the O ce of Naval Research. Michael I. Jordan is an NSF Presidential Young Investigator.

References

works.

Amari, S. (in press) Information geometry of the EM and em algorithms for neural networks, Neural Net-

Baum, L.E., and Sell, G.R. (1968), Growth transformation for functions on manifolds, Pac. J. Math., 27, 211-227. Bengio, Y., and Frasconi, P., (1995), An input-output HMM architecture. Advances in Neural Information Processing Systems 6, eds., Tesauro, G., Touretzky, D.S., and Alspector, J., San Mateo, CA: Morgan Kaufmann. Boyles, R.A. (1983), On the convergence of the EM algorithm, J. of Royal Statistical Society, B45, No.1, 47-50. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977), Maximum likelihood from incomplete data via the EM algorithm, J. of Royal Statistical Society, B39, 1-38. Ghahramani, Z, and Jordan, M.I. (1994), Function approximation via density estimation using the EM approach, Advances in Neural Information Processing Systems 6, eds., Cowan, J.D., Tesauro, G., and Alspector, J., San Mateo, CA: Morgan Kaufmann, 120-127.

In most applications of HMM's, the \parameter estimation" process is employed solely to yield models with high likelihood; the parameters are not generally endowed with a particular meaning.

4

16

Jordan, M.I. and Jacobs, R.A. (1994), Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6, 181-214. Jordan, M.I. and Xu, L. (in press), Convergence results for the EM approach to mixtures-of-experts architectures, Neural Networks. Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. (1983), An introduction to the application of the theory of probabilistic functions of Markov process to automatic speech recognition, The Bell System Technical Journal, 62, 1035-1072. Neal, R. N. and Hinton, G. E. (1993), A new view of the EM algorithm that justi es incremental and other variants, University of Toronto, Department of Computer Science preprint. Nowlan, S.J. (1991). Soft competitive adaptation: Neural network learning algorithms based on tting statistical mixtures. Tech. Rep. CMU-CS-91-126, CMU, Pittsburgh, PA. Redner, R.A., and Walker, H.F. (1984), Mixture densities, maximum likelihood, and the EM algorithm, SIAM Review 26, 195-239. Titterington, D.M. (1984), Recursive parameter estimation using incomplete data, J. of Royal Statistical Society, B46, 257-267. Tresp, V, Ahmad, S. and Neuneier, R. (1994), Training neural networks with de cient data, Advances in Neural Information Processing Systems 6, eds., Cowan, J.D., Tesauro, G., and Alspector, J., San Mateo, CA: Morgan Kaufmann, 128-135. Waterhouse, S. R., and Robinson, A. J., (1994), Classi cation using hierarchical mixtures of experts, in IEEE Workshop on Neural Networks for Signal Processing. Wu. C.F. J. (1983), On the convergence properties of the EM algorithm, The Annals of Statistics, 11, 95-103. Xu, L., and Jordan, M.I. (1993a), Unsupervised learning by EM algorithm based on nite mixture of Gaussians, Proc. of WCNN'93, Portland, OR, Vol. II, 431-434. Xu, L., and Jordan, M.I. (1993b), EM learning on a generalized nite mixture model for combining multiple classi ers, Proc. of WCNN'93, Portland, OR, Vol. IV, 227-230. Xu, L., and Jordan, M.I. (1993c), Theoretical and experimental studies of the EM algorithm for unsupervised learning based on nite Gaussian mixtures, MIT Computational Cognitive Science, Technical Report 9302, Dept. of Brain and Cognitive Science, MIT, Cambridge, MA. Xu, L., Jordan, M.I. and Hinton, G.E. (1994), A Modi ed gating network for the mixtures of experts architecture, Proc. of WCNN'94, San Diego, Vol. 2, 405-410. Yuille, A. L., Stolorz, P. & Utans, J. (1994), Statistical physics, mixtures of distributions and the EM algorithm, Neural Computation 6, 334-340.

17

1000

10

3

0

the condition number with sign

-1000

the condition number

10

2

the original Hessian

-2000

solid - the original Hessian dash-dot -. the constrained Hessian dashed -- the EM-equivalent Hessian

the constrained Hessian

-3000

10

1

the EM-equivalent Hessian

-4000

0

-5000 0

10 20

30

40

50

10

20

30

40 50 60 the learning steps

70

80

90

100

60 70 the learning steps

80

90

100

(a)

(b)

Figure 1: Experimental results for the estimation of the parameters of a two-component Gaussian mixture. (a) The condition numbers as a function of the iteration number. (b) A zoomed version of (a) after discarding the rst 25 iterations. The terminology `original, constrained, and EMequivalent Hessians' refers to the matrices H; E T HE; E T PHE respectively.

1000

10

3

800 600

the condition number with sign

solid - the original Hessian dash-dot -. the constrained Hessian dashed -- the EM-equivalent Hessian

the condition number

400 200 0 -200 -400 -600 -800

solid - the original Hessian the constrained Hessian

10

2

the EM-equivalent Hessian

10

1

-1000 0

0

50

100

150

50

100

150

200 250 300 the learning steps

350

400

450

500

200 250 300 the learning steps

350

400

450

500

(a)

(b)

Figure 2: Experimental results for the estimation of the parameters of a two-component Gaussian mixture (cf. Fig. 1). The separation of the Gaussians is half the separation in Fig. 1. 18

10

4

10 the original Hessian the constrained Hessian

4

the original Hessian

10

3

10

3

the constrained Hessian

the maximum eigenvalue

10

2

the maximum eigenvalue

10

2

10

1

10

1

the EM-equivalent Hessian 10

0

the EM-equivalent Hessian 10

0

10

-1

0

10

20

30

40 50 60 the learning steps

70

80

90

100

10

-1

0

50

100

150

200 250 300 the learning steps

350

400

450

500

(a)

(b)

Figure 3: The largest eigenvalues of the matrices H; E T HE , and E T PHE plotted as a function of the number of iterations. The plot in (a) is for the experiment in Fig. 1; (b) is for the experiment reported in Fig. 2.

19