# A closed-form solution to non-rigid shape and motion recovery

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

Jing Xiao, Jin-xiang Chai, Takeo Kanade

The Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 {jxiao, jchai, tk}@cs.cmu.edu

Abstract. Recovery of three dimensional (3D) shape and motion of non-static scenes from a monocular video sequence is important for applications like robot navigation and human computer interaction. If every point in the scene randomly moves, it is impossible to recover the non-rigid shapes. In practice, many non-rigid objects, e.g. the human face under various expressions, deform with certain structures. Their shapes can be regarded as a weighted combination of certain shape bases. Shape and motion recovery under such situations has attracted much interest. Previous work on this problem [6, 4, 18] utilized only orthonormality constraints on the camera rotations (rotation constraints). This paper proves that using only the rotation constraints results in ambiguous and invalid solutions. The ambiguity arises from the fact that the shape bases are not unique. An arbitrary linear transformation of the bases produces another set of eligible bases. To eliminate the ambiguity, we propose a set of novel constraints, basis constraints, which uniquely determine the shape bases. We prove that, under the weak-perspective projection model, enforcing both the basis and the rotation constraints leads to a closed-form solution to the problem of non-rigid shape and motion recovery. The accuracy and robustness of our closed-form solution is evaluated quantitatively on synthetic data and qualitatively on real video sequences.

1

Introduction

Many years of work in structure from motion have led to signi?cant successes in recovery of 3D shapes and motion estimates from 2D monocular videos. Many reliable methods have been proposed for reconstruction of static scenes [17, 25, 19, 11, 13]. However, most biological objects and natural scenes vary their shapes: expressive faces, people walking beside buildings, etc. Recovering the structure and motion of these non-rigid objects is a challenging task. The e?ects of rigid motion, i.e. 3D rotation and translation, and non-rigid shape deformation, e.g. stretching, are coupled together in the image measurements. While it is impossible to reconstruct the shape if the scene deforms arbitrarily, in practice, many non-rigid objects, e.g. the human face under various expressions, deform with a class of structures. One class of solutions model non-rigid object shapes as weighted combinations of certain shape bases that are pre-learned by o?-line training [2, 3, 5, 9]. For instance, the geometry of a face is represented as a weighted combination of shape bases that correspond to various facial deformations. Then the recovery of shape and motion is simply a model ?tting problem. However, in many applications, e.g. reconstruction of a scene consisting of a moving car and a static building, the shape bases of the dynamic structure are di?cult to obtain before reconstruction. Several approaches have been proposed to solve the problem without a prior model [6, 18, 4]. Instead, they treat the model, i.e. shape bases, as part of the unknowns to be computed.

2

Jing Xiao et al.

They try to recover not only the non-rigid shape and motion, but also the shape model. This class of approaches so far has utilized only the orthonormality constraints on camera rotations (rotation constraints) to solve the problem. However, as shown in this paper, enforcing only the rotation constraints leads to ambiguous and invalid solutions. These approaches thus cannot guarantee the desired solution. They have to either require a priori knowledge on shape and motion, e.g. constant speed [10], or need non-linear optimization that involves large number of variables and hence requires a good initial estimate [18, 4]. Intuitively, the above ambiguity arises from the non-uniqueness of the shape bases: an arbitrary linear transformation of a set of shape bases yields a new set of eligible bases. Once the bases are determined uniquely, the ambiguity is eliminated. Therefore, instead of imposing only the rotation constraints, we identify and introduce another set of constraints on the shape bases (basis constraints), which implicitly determine the bases uniquely. This paper proves that, under the weak-perspective projection model, when both the basis and rotation constraints are imposed, a linear closed-form solution to the problem of non-rigid shape and motion recovery is achieved. Accordingly we develop a factorization method that applies both the metric constraints to compute the closed-form solution for the non-rigid shape, motion, and shape bases.

2

Previous Work

Recovering 3D object structure and motion from 2D image sequences has a rich history. Various approaches have been proposed for di?erent applications. The discussion in this section will focus on the factorization techniques, which are most closely related to our work. The factorization method was originally proposed by Tomasi and Kanade [17]. First it applies the rank constraint to factorize a set of feature locations tracked across the entire sequence. Then it uses the orthonormality constraints on the rotation matrices to recover the scene structure and camera rotations in one step. This approach works under the orthographic projection model. Poelman and Kanade [14] extended it to work under the weak perspective and para-perspective projection models. Triggs [19] generalized the factorization method to the recovery of scene geometry and camera motion under the perspective projection model. These methods work for static scenes. The factorization technique has been extended to the non-rigid cases [8, 10, 22, 12, 23, 20, 24, 16, 21]. Costeira and Kanade [8] proposed a method to reconstruct the structure of multiple independently moving objects under the weak-perspective camera model. This method ?rst separates the objects according to their respective factorization characteristics and then individually recovers their shapes using the original factorization technique. The similar reconstruction problem was also studied in [12, 24, 16]. Wolf and Shashua [22] derived a geometrical constraint, called the segmentation matrix, to reconstruct a scene containing two independently moving objects from two perspective views. Vidal and his colleagues [20, 21] generalized this approach for reconstructing dynamic scenes containing multiple independently moving objects. In cases where the dynamic scenes consist of both static objects and objects moving straight along respective directions, Han and Kanade [10] proposed a weak-perspective reconstruction method that achieves a closed-form solution, under the assumption of constant moving velocities. A more generalized solution to reconstructing the shapes that deform at constant velocity is presented in [23]. Bregler and his colleagues [6] for the ?rst time incorporated the basis representation of non-rigid shapes into the reconstruction process. By analyzing the low rank of the image measurements, they proposed a factorization-based method that enforces the orthonormality constraints on camera rotations to reconstruct the non-rigid shape and motion. Torresani and his colleagues [18] extended the method in [6] to a trilinear optimization approach. At each

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

3

step, two of the three types of unknowns, bases, coe?cients, and rotations, are ?xed and the remaining one is updated. The method in [6] is used to initialize the optimization process. Brand [4] proposed a similar non-linear optimization method that uses an extension of the method in [6] for initialization. All the three methods enforce the rotation constraints alone and thus cannot guarantee the optimal solution. Note that because both of the non-linear optimization methods involve a large number of variables, e.g. the number of coe?cients equals the product of the number of the images and the number of the shape bases, their performances greatly rely on the quality of the initial estimates.

3

Problem Statement

T

Given 2D locations of P feature points across F frames, {(u, v )f p |f = 1, ..., F, p = 1, ..., P }, our goal is to recover the motion of the non-rigid object relative to the camera, including rotations {Rf |f = 1, ..., F } and translations {tf |f = 1, ..., F }, and its 3D deforming shapes T {(x, y, z )f p |f = 1, ..., F, p = 1, ..., P }. Throughout this paper, we assume:

– – – the deforming shapes can be represented as weighted combinations of shape bases; the 3D structure and the camera motion are non-degenerate; the camera projection model is the weak-perspective projection model.

We follow the representation of [3, 6]. The non-rigid shapes are represented as weighted combinations of K shape bases {Bi , i = 1, ..., K }. The bases are 3 × P matrices controlling the deformation of P points. Then the 3D coordinate of the point p at the frame f is

K Xf p = (x, y, z )T f p = Σi=1 cf i bip

f = 1, ..., F, p = 1, ..., P

(1)

where bip is the pth column of Bi and cif is its combination coe?cient at the frame f . The image coordinate of Xf p under the weak perspective projection model is

xf p = (u, v )T f p = sf (Rf · Xf p + tf ) (2)

where Rf stands for the ?rst two rows of the fth camera rotation and tf = [tf x tf y ]T is its translation relative to the world origin. sf is the scalar of the weak perspective projection. Replacing Xf p using Eq. (1) and absorbing sf into cf i and tf , we have

xf p = cf 1 Rf ... cf K Rf · b 1p ... bKp + tf (3)

Suppose the image coordinates of all P feature points across F frames are known. We form a 2F ×P measurement matrix W by stacking all image coordinates. Then W = M B +T [11...1], where M is a 2F × 3K scaled rotation matrix, B is a 3K × P bases matrix, and T is a 2F × 1 translation vector, ? ? ? ?

M =?

?

c11 R1 . . . cF 1 RF

... c1K R1 b11 ? ? . . . . . , B = ? ? . . . . ... cF K RF bK 1

... b1P . . ? , T = tT ... tT . 1 F . . . ? ... bKP

T

(4)

As in [10, 6], since all points on the shape share the same translation, we position the world origin at the scene center and compute the translation vector by averaging the image

4

Jing Xiao et al.

projections of all points. This step does not introduce ambiguities, provided that the correspondences of all the points across all the images are given, i.e. there are no missing data. ? = M B. We then subtract it from W and obtain the registered measurement matrix W Under the non-degenerate cases, the 2F × 3K scaled rotation matrix M and the 3K × P shape bases matrix B are both of full rank, respectively min{2F, 3K } and min{3K, P }. Their ? , is of rank min{3K, 2F, P }. In practice, the frame number F and point number P product, W are usually much larger than the basis number K such that 2F > 3K and P > 3K . Thus the ?) ? is 3K and K is determined by K = rank(W ? into the product rank of W . We then factorize W 3 ? ? of a 2F × 3K matrix M and a 3K × P matrix B , using Singular Value Decomposition (SVD). This decomposition is only determined up to a non-singular 3K × 3K linear transformation. The true scaled rotation matrix M and bases matrix B are of the form,

? · G, M =M ? B = G?1 · B (5)

where the non-singular 3K × 3K matrix G is called the corrective transformation matrix. Once G is determined, M and B are obtained and thus the rotations, shape bases, and combination coe?cients are recovered. All the procedures above, except obtaining G, are standard and well-understood [3, 6]. The problem of nonrigid shape and motion recovery is now reduced to: given the measurement matrix W , how can we compute the corrective transformation matrix G?

4

Metric Constraints

G is made up of K triple-columns, denoted as gk , k = 1, . . . , K . Each of them is a 3K × 3 matrix. They are independent on each other because G is non-singular. According to Eq. (4,5), gk satis?es,

? gk = M c1k R1 ... cF k RF (6)

Then,

T T c2 c1k c2k R1 R2 1k R1 R1 T 2 T c2k R2 R2 ? c1k c2k R2 R1 =? . . ? . . . . T T c1k cF k RF R1 c2k cF k RF R2

?

T ?T ? gk gk M M

T . . . c1k cF k R1 RF T . . . c2k cF k R2 RF ? ? . .. ? . . . 2 T . . . cF k RF RF

?

(7)

We denote gk gk T by Qk , a 3K × 3K symmetric matrix. Once Qk is determined, gk can be computed using SVD. To compute Qk , two types of metric constraints are available and should be imposed: rotation constraints and basis constraints. While using only the rotation constraints [6, 4] leads to ambiguous and invalid solutions, enforcing both sets of constraints results in a linear closed-form solution for Qk . 4.1 Rotation Constraints

The orthonormality constraints on the rotation matrices are one of the most powerful metric constraints and they have been used in reconstructing the shape and motion for static objects [17, 14], multiple moving objects [8, 10], and non-rigid deforming objects [6, 18, 4]. According to Eq. (7), we have,

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

5

T T ? 2i?1:2i Qk M ?2 M j ?1:2j = cik cjk Ri Rj ,

i, j = 1, ...F

(8)

? 2i?1:2i represents the ith bi-row of M ? . Due to orthonormality of the rotation matrices, where M we have,

T 2 ? 2i?1:2i Qk M ?2 M i?1:2i = cik I2×2 ,

i = 1, ..., F

(9)

where I2×2 is a 2 × 2 identity matrix. The two diagonal elements of Eq. (9) yield one linear constraints on Qk , since cik is unknown. The two o?-diagonal constraints are identical, because Qk is symmetric. For all F frames, we obtain 2F linear constraints as follows,

T ? 2i?1 Qk M ?2 ?T ? M i?1 ? M2i Qk M2i = 0, T ?2 ? 2i?1 Qk M M i = 0,

i = 1, ..., F i = 1, ..., F

(10) (11)

We consider the entries of Qk as variables and neglect the nonlinear constraints implicitly T , where gk has only 9K free entries. Since Qk is embedded in the formation of Qk = gk gk 2 (9K +3K ) independent variables. It appears that, when enough images symmetric, it contains 2

2

+3K ) are given, i.e. 2F ≥ (9K 2 , the rotation constraints in Eq. (10,11) should be su?cient to determine Qk via the linear least-square method. However, it is not true in general. We will show that most of the rotation constraints are redundant and they are inherently insu?cient to resolve Qk .

4.2

Why Are Rotation Constraints Not Su?cient?

Under speci?c assumptions like static scene or constant speed of deformation, the orthonormality constraints are su?cient to reconstruct the 3D shapes and camera rotations [17, 10]. In general cases, however, no matter how many images or feature points are given, the solution of the rotation constraints in Eq. (10,11) is inherently ambiguous. De?nition 1. A 3K × 3K symmetric matrix Y is called a block-skew-symmetric matrix, if all the diagonal 3 × 3 blocks are zero matrices and each o?-diagonal 3 × 3 block is a skew symmetric matrix.

0 yij 1 yij 2 ?yij 1 0 yij 3 ?yij 2 ?yij 3 0

Yij =

T T = ?Yij = Yji , i=j

(12) (13)

Yii = 03×3 ,

i, j = 1, ..., K

Each o?-diagonal block consists of 3 independent elements. Because Y is skew-symmetric and ?1) ?1) independent o?-diagonal blocks, it includes 3K (K independent elements. has K (K 2 2 De?nition 2. A 3K × 3K symmetric matrix Z is called a block-scaled-identity matrix, if each 3 × 3 block is a scaled identity matrix, i.e. Zij = λij I3×3 , where λij is the only variable. Because Z is symmetric, the total number of variables in Z equals the number of independent +1) blocks, K (K . 2

6

Jing Xiao et al.

Theorem 1. The general solution of the rotation constraints in Eq. (10,11) is GHk GT , where G is the desired corrective transformation matrix, and Hk is the summation of an arbitrary block-skew-symmetric matrix and an arbitrary block-scaled-identity matrix, given a minimum 2 2 +K +K of K 2 images containing di?erent (non-proportional) shapes and another K 2 images with non-degenerate (non-planar) rotations. ? k as the general solution of Eq. (10,11). Since G is a non-singular Proof. Let us denote Q ? k G?T . We then prove ? square matrix, Qk can be represented as GHk GT , where Hk = G?1 Q that Hk must be the summation of a block-skew-symmetric matrix and a block-scaled-identity matrix. According to Eq. (5,9),

? ?T ? c2 ik I2×2 = M2i?1:2i Qk M2i?1:2i T ?2 ? 2i?1:2i GHk GT M =M i?1:2i

T = M2i?1:2i Hk M2 i?1:2i ,

i = 1, ..., F

(14)

Hk consists of K 2 3 × 3 blocks, denoted as Hkmn , m,n=1,. . .,K . Combining Eq. (4) and (14), we have,

K 2 K T T 2 Ri Σm =1 (cim Hkmm + Σn=m+1 cim cin (Hkmn + Hkmn ))Ri = cik I2×2 ,

i = 1, ..., F

(15)

K 2 K T Denote the 3 × 3 symmetric matrix Σm =1 (cim Hkmm + Σn=m+1 cim cin (Hkmn + Hkmn )) by T ? = c2 Γik . Then Eq. (15) becomes Ri Γik Ri ik I2×2 . Let Γik be its homogeneous solution, i.e. T ? Ri Γik Ri = 02×2 . So we are looking for a symmetric matrix Γ? ik in the null space of the map T . Because the two rows of the 2 × 3 matrix Ri are orthonormal, we have, X →Ri XRi

T T Γ? ik = ri3 δik + δik ri3

(16)

where ri3 is a unitary 1 × 3 vector that are orthogonal to both rows of Ri . δik is an arbitrary T has 3 degrees of freedom. 1 × 3 vector, indicating the null space of the map X →Ri XRi 2 T 2 Γik = cik I3×3 is a particular solution of Ri Γik Ri = cik I2×2 . Thus the general solution of Eq. (15) is,

K 2 K T 2 ? Σm =1 (cim Hkmm + Σn=m+1 cim cin (Hkmn + Hkmn )) = Γik = cik I3×3 + αik Γik ,

i = 1, ..., F

(17)

where αik is an arbitrary scalar. 2 +K Let us denote N = K 2 . Each image yields a set of N coe?cients (c2 i1 , ci1 ci2 , . . ., ci1 ciK , 2 2 ci2 , ci2 ci3 , . . ., ciK ?1 ciK , ciK ) in Eq. (17). We call them the quadratic-form coe?cients. The quadratic-form coe?cients associated with images with non-proportional shapes are generally independent. In our test, the samples of randomly generated N di?erent (non-proportional) shapes always yield independent N sets of quadratic-form coe?cients. Thus given N images with di?erent (non-proportional) shapes, the associated N quadratic-form coe?cient sets span the space of the quadratic-form coe?cient in Eq. (17). The linear combinations of these N sets compose the quadratic-form coe?cients associated with any other images. Since Hkmm and Hkmn are common in all images, for any additional image j = 1, . . . , F , Γjk is described as follows,

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

7

=?

c2 jk I3×3

N Γjk = Σl =1 λlk Γlk N 2 ? + αjk Γ? jk = Σl=1 λlk (clk I3×3 + αlk Γlk ) N ? ? =? αjk Γjk = Σl=1 λlk (αlk Γlk )

(18)

? where λlk is the combination weights. We substitute Γ? jk and Γlk by Eq. (16) and absorb αjk and αlk into δjk and δlk respectively due to their arbitrary values. Eq. (18) is then rewritten as,

T T N T T rj 3 δjk + δjk rj 3 ? Σl=1 λlk (rl3 δlk + δlk rl3 ) = 0

(19)

Due to symmetry, Eq. (19) yields 6 linear constraints on δlk , l = 1, . . . , N and δjk that in total include 3N + 3 variables. These constraints depend on the rotations in the images. Given x additional images with non-degenerate (non-planar) rotations, we obtain 6x di?erent linear constraints and 3N + 3x free variables. When 6x ≥ (3N + 3x), i.e. x ≥ N , these constraints lead to a unique solution, δlk = δjk = 0, l = 1, . . . , N, j = 1, . . . , x. Therefore, given a minimum of N images with non-degenerate rotations together with the N images containing di?erent shapes, Eq. (17) can be rewritten as follows,

K 2 K T 2 Γik = Σm =1 (cim Hkmm + Σn=m+1 cim cin (Hkmn + Hkmn )) = cik I3×3 , i = 1, . . . , 2N

(20)

Eq. (20) completely depend on the coe?cients. According to Eq. (18, 20), for any image N 2 2 j = 2N + 1, . . . , F , Γjk = ΣlN =1 λlk Γlk = Σl=1 λlk clk I3×3 = cjk I3×3 . Therefore Eq. (20) is T ) by Θkmn . Because the right side of Eq. (20) satis?ed for all F images. Denote (Hkmn + Hkmn o is a scaled identity matrix, for each o?-diagonal element ho kmm and θkmn , we achieve the following linear equation,

K 2 o K o Σm =1 (cim hkmm + Σn=m+1 cim cin θkmn )) = 0, i = 1, . . . , F

(21)

Using the N given images containing di?erent shapes, Eq. (21) leads to a non-singular linear o equation set on ho kmm and θkmn . The right sides of the equations are zeros. Thus the solution is a zero vector, i.e. the o?-diagonal elements of Hkmm and Θkmn are all zeros. Similarly, we can derive the constraint as Eq. (21) on the di?erence between any two diagonal elements. Therefore the di?erence is zero, i.e. the diagonal elements are all equivalent. We thus have,

Hkmm = λkmm I3×3 , m = 1, ..., K Hkmn +

T Hkmn

(22) (23)

= λkmn I3×3 , m = 1, ..., K, n = m + 1, ..., K

where λkmm and λkmn are arbitrary scalars. According to Eq. (22), the diagonal block Hkmm λkmn T is a scaled identity matrix. Due to Eq. (23), Hkmn ? λkmn 2 I3×3 = ?(Hkmn ? 2 I3×3 ) , i.e. λkmn Hkmn ? 2 I3×3 is skew-symmetric. Thus the o?-diagonal block Hkmn equals the summation λkmn of a scaled identity block, λkmn 2 I3×3 , and a skew-symmetric block, Hkmn ? 2 I3×3 . Since λkmm and λkmn are arbitrary, the entire matrix Hk is the summation of an arbitrary blockskew-symmetric matrix and an arbitrary block-scaled-identity matrix, given a minimum of N images with non-degenerate rotations together with N images containing di?erent shapes, in total 2N = K 2 + K images.

8

Jing Xiao et al.

Let Yk denote the block-skew-symmetric matrix and Zk denote the block-scaled-identity ?1) +1) and K (K independent elematrix in Hk . Since Yk and Zk respectively contain 3K (K 2 2 2 ments, Hk include 2K ? K free elements, i.e. the solution of the rotation constraints has 2K 2 ? K degrees of freedom. In rigid cases, i.e. K = 1, the solution is unique, as suggested in [17]. For non-rigid objects, i.e. K > 1, the rotation constraints result in an ambiguous solution space. This space contains invalid solutions. Speci?cally, because the desired Qk = gk gk T is positive semi-de?nite, the solution GHk GT is not valid when Hk is not positive semi-de?nite. The solution space includes many instances that refer to non-positive-semi-de?nite Hk . For example, when the block-scaled-identity matrix Zk is zero, Hk equals the block-skew-symmetric matrix Yk , which is not positive semi-de?nite. 4.3 Basis Constraints

The only di?erence between non-rigid and rigid situations is that the non-rigid shape is a weighted combination of certain shape bases. The rotation constraints are su?cient for recovering the rigid shapes, but they cannot determine a unique set of shape bases in the non-rigid cases. Instead any non-singular linear transformation applied on the bases leads to another set of eligible bases. Intuitively, the non-uniqueness of the bases results in the ambiguity of the solution by enforcing the rotation constraints alone. We thus introduce the basis constraints that determine a unique basis set and resolve the ambiguity. Because the deformable shapes lie in a K -basis linear space, any K independent shapes in the space form an eligible bases set. We measure the independence of the K shapes using the condition number of the matrix that stacks the 3D coordinates of the shapes like the measurement matrix W . The condition number is in?nitely big, if any of the K shapes is dependent on the other shapes, i.e. it can be described as a linear combination of the other shapes. If the K shapes are orthonormal, i.e. completely independent and equally dominant, the condition number achieves its minimum, 1. Between 1 and in?nity, a smaller condition number means that the shapes are more independent and more equally dominant. ? are, The registered image measurements of these K shapes in W ? ? ? ?? ? ?W ? ? 2 ? . ? . .

?K W ?1 W

? ? ? ? ?=? ?

R1 0 . . . 0

0 R2 . . . 0

... ... . . . ...

0 S1 0 ? ? S2 ? ?? . ? . ?? . ? . . . SK RK

(24)

? i and Si , i = 1, . . . , K are respectively the image measurements and the 3D coorwhere W dinates of the K shapes. On the right side of Eq. (24), since the matrix composed of the rotations is orthonormal, the projecting process does not change the singular values of the 3D coordinate matrix. Therefore, the condition number of the measurement matrix has the same power as that of the 3D coordinate matrix to represent the independence of the K shapes. Accordingly we compute the condition number of the measurement matrix in each group of K images and identify a group of K images with the smallest condition number. The 3D shapes in these K images are speci?ed as the shape bases. Note that so far we have not recovered the bases explicitly, but decided in which frames they are located. This information implicitly determines a unique set of bases. We denote the selected frames as the ?rst K images in the sequence. The corresponding coe?cients are,

cii = 1, i = 1, ..., K cij = 0, i, j = 1, ..., K, i = j (25)

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

9

Let ? denote the set, {(i, j )|i = 1, ..., K ; i = k ; j = 1, ..., F }. According to Eq. (8,25), we have,

T ? 2i?1 Qk M ?2 M j ?1 = T ?2 ? 2 i Qk M M j = T ?2 ? 2i?1 Qk M M j = 0, T ? 2j ? 1 = 0 , ? 2 i Qk M M

1, i = j = k 0, (i, j ) ∈ ? 1, i = j = k 0, (i, j ) ∈ ? (i, j ) ∈ ? or i = j = k (i, j ) ∈ ? or i = j = k

(26) (27) (28) (29)

These 4F (K ? 1) linear constraints are called the basis constraints.

5

A Closed-Form Solution

Due to Theorem 1, enforcing the rotation constraints on Qk leads to the ambiguous solution GHk GT . Hk consists of K 2 3 × 3 blocks, Hkmn , m,n=1,. . .,K . Hkmn contains four independent entries as follows,

Hkmn = h1 h2 h3 ?h2 h1 h4 ?h3 ?h4 h1 (30)

Lemma 1 Hkmn is a zero matrix if,

T Ri Hkmn Rj = 02×2

(31)

where Ri and Rj are non-degenerate 2 × 3 rotation matrices. Proof. First we prove that the rank of Hkmn is at most 2. Due to the orthonormality of rotation matrices, from Eq. (31), we have,

T T T T Hkmn = ri 3 δik + δjk rj 3 = ri3 δjk

δik rj 3

(32)

where ri3 and rj 3 respectively are the cross products of the two rows of Ri and those of Rj . δik and δjk are arbitrary 1 × 3 vectors. Because both matrices on the right side of Eq. (32) are at most of rank 2, the rank of Hkmn is at most 2. 4 Next, we prove h1 = 0. Since Hkmn is 3 × 3 matrix of rank 2, its determinant, h1 ( i=1 hi 2 ), equals 0. Therefore h1 = 0, i.e. Hkmn is a skew-symmetric matrix. Finally we prove h2 = h3 = h4 = 0. Denote the rows of Ri and Rj as ri1 , ri2 , rj 1 , and rj 2 respectively. Since h1 = 0, we can rewrite Eq. (31) as follows,

ri1 · (h × rj 1 ) ri1 · (h × rj 2 ) ri2 · (h × rj 1 ) ri2 · (h × rj 2 ) = 02×2 (33)

where h = (?h4 h3 ? h2 ). Eq. (33) means that the vector h is located in the intersection of the four planes determined by (ri1 , rj 1 ), (ri1 , rj 2 ), (ri2 , rj 1 ), and (ri2 , rj 2 ). Since Ri and Rj are non-degenerate rotation matrices, ri1 , ri2 , rj 1 , and rj 2 do not lie in the same plane, hence the four planes intersect at the origin, i.e. h = (?h4 h3 ? h2 ) = 01×3 . Therefore Hkmn is a zero matrix.

10

Jing Xiao et al.

Using Lemma 1, we derive the following theorem, Theorem 2. Enforcing both the basis constraints and the rotation constraints results in a 2 +K closed-form solution of Qk , given a minimum of K 2 images containing di?erent (non2 K +K images with non-degenerate (non-planar) proportional) shapes together with another 2 rotations.

+K Proof. According to Theorem 1, given a minimum of K 2 images containing di?erent shapes K 2 +K images with non-degenerate rotations, using the rotation constraints we and another 2 achieve the ambiguous solution of Qk = GHk GT , where G is the desired corrective transformation matrix and Hk is the summation of an arbitrary block-skew-symmetric matrix and an arbitrary block-scaled-identity matrix. Due to the basis constraints, replacing Qk in Eq. (26?29) with GHk GT ,

2

T T ? 2k?1:2k GHk GT M ?2 M (34) k?1:2k = M2k?1:2k Hk M2k?1:2k = I2×2 T T T ? ? M2i?1:2i GHk G M2j ?1:2j = M2i?1:2i Hk M2j ?1:2j = 02×2 , i = 1, ..., K, j = 1, ..., F, i = k (35)

From Eq. (4), we have,

T K K T M2i?1:2i Hk M2 j ?1:2j = Σm=1 Σn=1 cim cjn Ri Hkmn Rj , i, j = 1, ..., F

(36)

where Hkmn is the 3 × 3 block of Hk . According to Eq. (25),

T T M2i?1:2i Hk M2 j ?1:2j = Ri Hkij Rj , i, j = 1, ..., K

(37)

Combining Eq. (34,35) and (37), we have,

Rk Hkkk Rk T = I2×2 Ri Hkij Rj T = 02×2 , i, j = 1, ..., K, i = k (38) (39)

According to Eq. (22), the kth diagonal block of Hk , Hkkk , equals λkkk I3×3 . Thus by Eq. (38), 2 +K λkkk = 1 and Hkkk = I3×3 . We denote K of the K 2 given images with non-degenerate rotations as the ?rst K images in the sequence. Due to Lemma 1 and Eq. (39), Hkij , i, j = 1, ..., K, i = k , are zero matrices. Since Hk is symmetric, the other blocks in Hk , Hkij , i = k, j = 1, ..., K, j = k , are also zero matrices. Thus,

GHk GT = (g1 . . . gK )Hk (g1 . . . gK )T = (0 . . . gk . . . 0)(g1 . . . gK )T

T = gk gk = Qk

(40)

i.e. a closed-form solution of the desired Qk has been achieved.

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

11

T According to Theorem 2, we compute Qk = gk gk , k = 1, ..., K , by solving the linear equations, Eq. (10?11,26?29), via the least square methods. We then recover gk by decomposing Qk via SVD. The decomposition of Qk is up to an arbitrary 3 × 3 orthonormal transformation Φk , since (gk Φk )(gk Φk )T also equals Qk . This ambiguity arises from the fact that g1 , . . ., gK are estimated independently under di?erent coordinate systems. To resolve the ambiguity, we need to transform g1 , . . ., gK to be under a common reference coordinate system. Due to Eq. (6), M2i?1:2i gk = cik Ri , i = 1, . . . , F . Because the rotation matrix Ri is orM2i?1:2i gk thonormal, i.e. Ri = 1, we have Ri = ± M . The sign of Ri determines which 2i?1:2i gk orientations are in front of the camera. It can be either positive or negative, determined by the reference coordinate system. Since g1 , . . ., gK are estimated independently, they lead to respective rotation sets, each two of which are di?erent up to a 3 × 3 orthonormal transformation. We choose one set of the rotations to specify the reference coordinate system. Then the signs of the other sets of rotations are determined in such a way that these rotations are consistent with the corresponding references. Finally the orthogonal Procrustes method [15] is applied to compute the orthonormal transformations from the rotation sets to the reference. One of the reviewers pointed out that, computing such transformation Φk is equivalent to computing a homography at in?nity. For details, please refer to [11].

4 3 2 1 0 ?1 ?2 ?3 ?3 ?2 ?1 0 1 2 3 4 10 4 7 3 5 1 2 6 8 9

9 6 5 4 3 2 1 0 0 1 2 1 2 3 3 4 4 5 5 6 6 7 10 5 4 1 3 2 6 8 6 5 4 3 2 0 1 0 0 1 2 1 2 3 3 4 4 5 5 6 6 7 10 5 4 1 3 2 6 8 0 9

(a)

9 6 5 4 3 2 1 0 0 1 2 1 2 3 3 4 4 5 5 6 6 7 10 5 4 1 3 2 6 8 6 5 4 3 2 0 1 0 0 7 10 5 4 1 3 9

(b)

9 6 5 2 6 8 4 3 2 0 1 1 2 1 2 3 3 4 4 5 5 6 6 0 0 7 10 5 4 1 3

(c)

2 6 8 0 1 2

1

2

3 3 4 4 5 5 6 6

(d)

(e)

(f )

Fig. 1. A static cube and 3 points moving along straight lines. (a) Input image. (b) Ground truth 3D shape and camera trajectory. (c) Reconstruction by the closed-form solution. (d) Reconstruction by the method in [6]. (e) Reconstruction by the method in [4] after 4000 iterations. (f) Reconstruction by the tri-linear method [18] after 4000 iterations.

The transformed g1 , . . ., gK form the desired corrective transformation G. The coe?cients are then computed by Eq. (6), and the shape bases are recovered by Eq. (5). Their combinations reconstruct the non-rigid 3D shapes. Note that one could derive the similar basis and rotation constraints directly on Q = GGT , T , where gk is where G is the entire corrective transformation matrix, instead of Qk = gk gk

12

Jing Xiao et al.

the kth triple-column of G. It is easy to show that enforcing these constraints also results in a linear closed-form solution of Q. However, the decomposition of Q to recover G is up to a 3K × 3K orthonormal ambiguity transformation. Eliminating this ambiguity alone is as hard as the original problem of determining G.

6

Performance Evaluation

The performance of the closed-form solution was evaluated in a number of experiments. 6.1 Comparison with Three Previous Methods

100 90 Relative Reconstruction Error (%) 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Frame Number Relative Reconstruction Error (%)

100

Relative errors on motion Relative errors on shape Relative errors on image measurement

90 80 70 60 50 40 30 20 10 0 1 2

Relative errors on motion Relative errors on shape Relative errors on image measurement

3

4

5

6

7 8 9 10 11 12 13 14 15 16 Frame Number

(a)

100 90 Relative Reconstruction Error (%) 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Frame Number Relative Reconstruction Error (%) 100

(b)

90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Frame Number

Relative errors on motion Relative errors on shape Relative errors on image measurement

Relative errors on motion Relative errors on shape Relative errors on image measurement

(c)

(d)

Fig. 2. The relative errors on reconstruction of a static cube and 3 points moving along straight lines. (a) By the closed-form solution. (b) By the method in [6]. (c) By the method in [4] after 4000 iterations. (d) By the trilinear method [18] after 4000 iterations. The scaling of the error axis is [0%, 100%]. Note that our method achieved zero reconstruction errors.

We ?rst compared the solution with three related methods [6, 4, 18] in a simple noiseless setting. Fig.1 shows a scene consisting of a static cube and 3 moving points. The measurement included 10 points: 7 visible vertices of the cube and 3 moving points. The 3 points moved along the three axes simultaneously at varying speed. This setting involved K = 2 shape bases,

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

13

one for the static cube and another for the linear motions. While the points were moving, the camera was rotating around the scene. A sequence of 16 frames were captured. One of them is shown in Fig.1.(a). Fig.1.(b) demonstrates the ground truth shape in this frame and the ground truth camera trajectory from the ?rst frame till this frame. The three orthogonal green bars show the present camera pose and the red bars display the camera poses in the previous frames. Fig.1.(c) to (f) show the structures and camera trajectories reconstructed using the closed-form solution, the method in [6], the method in [4], and the tri-linear method [18], respectively. While the closed-form solution achieved the exact reconstruction with zero error, all the three previous methods resulted in apparent errors, even for such a simple noiseless setting. Fig.2 demonstrates the reconstruction errors of the four methods on camera rotations, shapes, and image measurements. The error was computed as the percentage relative to the ?T ruth . Note that because the space of rotations is a manground truth, Reconstruction T ruth ifold, a better error measurement for rotations is the Riemannian distance, d(Ri , Rj ) =

j acos( ). However it is measured in degrees. For consistency, we used the relative 2 percentage for all the three reconstruction errors.

trace(Ri RT )

6.2

Quantitative Evaluation on Synthetic Data

Our approach was then quantitatively evaluated on the synthetic data. We evaluated the accuracy and robustness on three factors: deformation strength, number of shape bases, and noise level. The deformation strength shows how close to rigid the shape is. It is represented max( Bi , Bj ) by the mean power ratio between each two bases, i.e. meani,j min( Bi , Bj ) . Larger ratio means weaker deformation, i.e. the shape is closer to rigid. The number of shape bases represents the ?exibility of the shape. A bigger basis number means that the shape is more ?exible. Assuming a Gaussian white noise, we represent the noise strength level by the ratio between the Frobenius norm of the noise and the measurement, i.e. noise . In general, when ? W noise exists, a weaker deformation leads to better performance, because some deformation mode is more dominant and the noise relative to the dominant basis is weaker; a bigger basis number results in poorer performance, because the noise relative to each individual basis is stronger. Fig.3.(a) and (b) show the performance of our algorithm under various deformation strength and noise levels on a two bases setting. The power ratios were respectively 20 , 21 , ..., and 28 . Four levels of Gaussian white noise were imposed. Their strength levels were 0%, 5%, 10%, and 20% respectively. We tested a number of trials on each setting and computed the average reconstruction errors on the rotations and 3D shapes. The errors were measured by the relative percentage as in Section 6.1. Fig.3.(c) and (d) show the performance of our method under di?erent numbers of shape bases and noise levels. The basis number was 2, 3, ... , and 10 respectively. The bases had equal powers and thus none of them was dominant. The same noise as in the last experiment was imposed. In both experiments, when the noise level was 0%, the closed-form solution recovered the exact rotations and shapes with zero error. When there was noise, it achieved reasonable accuracy, e.g. the maximum reconstruction error was less than 15% when the noise level was 20%. As we expected, under the same noise level, the performance was better when the power ratio was larger and poorer when the basis number was bigger. Note that in all the experiments, the condition number of the linear system consisting of both basis constraints and rotation constraints had order of magnitude O(10) to O(102 ), even if the basis number was big and the deformation was strong. It suggests that our closed-form solution is numerically stable.

14

Jing Xiao et al.

Relative reconstruction errors on rotation (%)

18 16 14 12 10 8 6 4 2 0 0

||noise|| = 0%*||W|| ||noise|| = 5%*||W|| ||noise|| = 10%*||W|| ||noise|| = 20%*||W||

Relative reconstruction errors on shape (%)

20

20 18 16 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 Log of the power ratios between the two bases (log(ratio))

||noise|| = 0%*||W|| ||noise|| = 5%*||W|| ||noise|| = 10%*||W|| ||noise|| = 20%*||W||

1 2 3 4 5 6 7 8 9 10 Log of the power ratios between the two bases (log(ratio))

(a)

Relative reconstruction errors on rotations (%) 18 16 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Number of shape bases 8 9 10

(b)

Relative reconstruction errors on shape (%) 20 18 16 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Number of shape bases 8 9 10

20

||noise|| = 0%*||W|| ||noise|| = 5%*||W|| ||noise|| = 10%*||W|| ||noise|| = 20%*||W||

||noise|| = 0%*||W|| ||noise|| = 5%*||W|| ||noise|| = 10%*||W|| ||noise|| = 20%*||W||

(c)

(d)

Fig. 3. (a)&(b) Reconstruction errors on rotations and shapes under di?erent levels of noise and deformation strength. (c)&(d) Reconstruction errors on rotations and shapes under di?erent levels of noise and various basis numbers. Each curve respectively refers to a noise level. The scaling of the error axis is [0%, 20%].

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

15

6.3

Qualitative Evaluation on Real Video Sequences

(a)

(b)

(c)

(d)

(e)

(f )

Fig. 4. Reconstruction of three moving objects in the static background. (a)&(d) Two input images with marked features. (b)&(e) Reconstruction by the closed-form solution. The yellow lines show the recovered trajectories from the beginning of the sequence until the present frames. (c)&(f ) Reconstruction by the method in [4]. The yellow-circled area shows that the plane, which should be on top of the slope, was mistakenly located underneath the slope.

Finally we examined our approach qualitatively on a number of real video sequences. One example is shown in Fig.4. The sequence was taken of an indoor scene by a handhold camera. Three objects, a car, a plane, and a toy person, moved along ?xed directions and at varying speeds. The rest of the scene was static. The car and the person moved on the ?oor and the plane moved along a slope. The scene structure was composed of two bases, one for the static objects and another for the linear motions. 32 feature points tracked across 18 images were used for reconstruction. Two of the them are shown in Fig.4.(a) and (d). ? was estimated in such a way that 99% of the energy of W ? could remain The rank of W after the factorization using the rank constraint. The number of bases was thus determined by ?) (W . Then the camera rotations and dynamic scene structures were reconstructed K = rank 3 using the proposed method. With the recovered shapes, we could view the scene appearance from any novel directions. An example is shown in Fig.4.(b) and (e). The wireframes show the scene shapes and the yellow lines show the trajectories of the moving objects from the beginning of the sequence until the present frames. The reconstruction was consistent with our observation, e.g. the plane moved linearly on top of the slope. Fig.4.(c) and (f) show the reconstruction using the method in [4]. The recovered shapes of the boxes were distorted and the plane was incorrectly located underneath the slope, as shown in the yellow circles. Note that occlusion was not taken into account when rendering these images. Thus in the regions that should be occluded, e.g. the area behind the slope, the stretched texture of the occluding objects appeared.

16

Jing Xiao et al.

Human faces are highly non-rigid objects and 3D face shapes can be regarded as weighted combinations of certain shape bases that refer to various facial expressions. Thus our approach is capable of reconstructing the deformable 3D face shapes from the 2D image sequence. One example is shown in Fig.5. The sequence consisted of 236 images that contained facial expressions like eye blinking and mouth opening. 60 feature points were tracked using an e?cient 2D Active Appearance Model (AAM) method [1]. Fig.5.(a) and (d) display two of the input images with marked feature points. After reconstructing the shapes and poses, we could view the 3D face appearances in any novel poses. Two examples are shown respectively in Fig.5.(b) and (e). Their corresponding 3D shape wireframes, as shown in Fig.5.(c) and (f), exhibit the recovered facial deformations such as mouth opening and eye closure. Note that the feature correspondences in these experiments were noisy, especially for those features on the sides of the face. The reconstruction performance of our approach demonstrates its robustness to the image noise.

(a)

(b)

(c)

(d)

(e)

(f )

Fig. 5. Reconstruction of shapes of human faces carrying expressions. (a)&(d) Input images. (b)&(e) Reconstructed 3D face appearances in novel poses. (c)&(f ) Shape wireframes demonstrating the recovered facial deformations such as mouth opening and eye closure.

7

Conclusion and Discussion

This paper proposes a linear closed-form solution to the problem of non-rigid shape and motion recovery from a single-camera video. In particular, we have proven that enforcing only the rotation constraints results in ambiguous and invalid solutions. We thus introduce the basis constraints to resolve this ambiguity. We have also proven that imposing both the linear constraints leads to a unique reconstruction of the non-rigid shape and motion. The performance of our algorithm is demonstrated by experiments on both simulated data and real video data. Our algorithm has also been successfully applied to separate the local deformations from the global rotations and translations in the 3D motion capture data [7].

A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

17

Currently our approach does not consider the degenerate deformations. A shape basis is degenerate, if its rank is either 1 or 2, i.e. it limits the shape deformation within a 2D plane. Such planar deformations occur in structures like dynamic scenes or expressive human faces. For example, when a scene consists of several buildings and one car moving straight along a street, the shape basis referring to the rank-1 car translation is degenerate. It is conceivable that, in such degenerate cases, the basis constraints cannot completely resolve the ambiguity of the rotation constraints. We are now exploring how to extend the current method to reconstructing the shapes involving degenerate deformations. Another limitation of our approach is that we assume the weak perspective projection model. It would be interesting to see how the proposed solution could be extended to the full perspective projection model.

Acknowledgments

We would like to thank Simon Baker, Iain Matthews, and Mei Han for providing the image data and feature correspondence used in Section 6.3, thank the ECCV and IJCV reviewers for their valuable comments, and thank Jessica Hodgins for proofreading the paper. Jinxiang Chai was supported by the NSF through EIA0196217 and IIS0205224. Jing Xiao and Takeo Kanade were partly supported by grant R01 MH51435 from the National Institute of Mental Health.

References

1. S. Baker, I. Matthews, “ Equivalence and e?ciency of image alignment algorithms,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2001. 2. B. Bascle, A. Blake,“ Separability of pose and expression in facial tracing and animation,”Proc. Int. Conf. Computer Vision, pp. 323-328, 1998. 3. V. Blanz, T. Vetter, “ A morphable model for the synthesis of 3D faces,” Proc. SIGGRAPH’99, pp. 187-194, 1999. 4. M. Brand, “ Morphable 3D models from video,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2001. 5. M. Brand, R. Bhotika, “ Flexible ?ow for 3D nonrigid tracking and shape recovery,”Proc. Int. Conf. Computer Vision and Pattern Recognition, 2001. 6. C. Bregler, A. Hertzmann, H. Biermann, “ Recovering non-rigid 3D shape from image streams,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2000. 7. J. Chai, J. Xiao, J. Hodgins, “ Vision-based control of 3D facial animation,” Eurographics/ACM Symposium on Computer Animation, 2003. 8. J. Costeira, T. Kanade, “ A multibody factorization method for independently moving-objects,” Int. Journal of Computer Vision, 29(3):159-179, 1998. 9. S.B. Gokturk, J.Y Bouguet, R. Grzeszczuk, “ A data driven model for monocular face tracking,” Proc. Int. Conf. Computer Vision, 2001. 10. M. Han, T. Kanade, “ Reconstruction of a scene with multiple linearly moving objects,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2000. 11. R.I. Hartley, A. Zisserman, “ Multiple View Geometry in Computer Vision,” Cambridge University Press, 2000. 12. K. Kanatani, “ Motion Segmentation by Subspace Separation and Model Selection,” Proc. Int. Conf. Computer Vision, 2001. 13. S. Mahamud, M. Hebert, “ Iterative Projective Reconstruction from Multiple Views,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2000. 14. C. Poelman, T. Kanade, “ A paraperspective factorization method for shape and motion recovery,” IEEE Trans. Pattern Analysis and Machine Intelligence, 19(3):206-218, 1997. 15. P. H. Sch¨ onemann, “ A generalized solution of the orthogonal procrustes problem,” Psychometrika, 31(1):1-10, 1966.

18

Jing Xiao et al.

16. Y. Sugaya, K. Kanatani, “ Geometric Structure of Degeneracy for Multi-body Motion Segmentation,” ECCV Workshop on Statistical Methods in Video Processing, 2004. 17. C. Tomasi, T. Kanade, “ Shape and motion from image streams under orthography: A factorization method,” Int. Journal of Computer Vision, 9(2):137-154, 1992. 18. L. Torresani, D. Yang, G. Alexander, C. Bregler, “ Tracking and modeling non-rigid objects with rank constraints,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2001. 19. B. Triggs, “ Factorization methods for projective structure and motion,” Proc. Int. Conf. Computer Vision and Pattern Recognition,1996. 20. R. Vidal, S. Soatto, Y. Ma, S. Sastry, “ Segmentation of dynamic scenes from the multibody fundamental matrix,” ECCV Workshop on Vision and Modeling of Dynamic Scenes, 2002. 21. R. Vidal, R. Hartley, “ Motion Segmentation with Missing Data using PowerFactorization and GPCA,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2004. 22. L. Wolf, A. Shashua, “ Two-body segmentation from two perspective views,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2001. 23. L. Wolf, A. Shashua, “ On projection matrices P k → P 2 , k = 3, . . . , 6, and their applications in computer vision,” Int. Journal of Computer Vision, 48(1):53-67, 2002. 24. L. Zelnik-Manor, M. Irani, “ Degeneracies, Dependencies, and their Implications in Multi-body and Multi-Sequence Factorizations,” Proc. Int. Conf. Computer Vision and Pattern Recognition, 2003. 25. Z. Zhang, R. Deriche, O. Faugeras, Q. Luong, “ A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry,” Arti?cial Intelligence, 78(1/2):87119, 1995.