# Recovering non-rigid 3D shape from image streams

Recovering Non-Rigid 3D Shape from Image Streams

Christoph Bregler Aaron Hertzmann Henning Biermann Media Research Laboratory Department of Computer Science New York University 719 Broadway, 12th Floor New York, NY 10003 bregler,hertzman,biermann @mrl.nyu.edu NYU Computer Science Technical Report June 16, 1999

Abstract This paper addresses the problem of recovering 3D non-rigid shape models from image sequences. For example, given a video recording of a talking person, we would like to estimate a 3D model of the lips and the full head and its internal modes of variation. Many solutions that recover 3D shape from 2D image sequences have been proposed; these so-called structure-from-motion techniques usually assume that the 3D object is rigid. For example, Tomasi and Kanade’s factorization technique is based on a rigid shape matrix, which produces a tracking matrix of rank 3 under orthographic projection. We propose a novel technique based on a non-rigid model, where the 3D shape in each frame is a linear combination of a set of basis shapes. Under this model, the tracking matrix is of higher rank, and can be factored in a three step process to yield to pose, con?guration and shape. We demonstrate this simple but effective algorithm on video sequences of speaking people. We were able to recover 3D non-rigid facial models with high accuracy.

1 Introduction

This paper demonstrates a new technique for recovering 3D non-rigid shape models from 2D image sequences recorded with a single camera. For example, this technique can be applied to video recordings of a talking person. It extracts a 3D model of the human face, including all facial expressions and lip movements. Previous work has treated the two problems of recovering 3D shapes from 2D image sequences and of discovering a parameterization of non-rigid shape deformations separately. Most techniques that address the structure-from-motion problem are limited to rigid objects. For example, Tomasi and Kanade’s factorization technique [8] recovers a shape matrix from image sequences. Under orthographic projection, it can be shown that the 2D tracking data matrix has rank 3 and can be factored into 3D pose and 3D shape with

Authors current address: Gates 138, 353 Serra Mall, Stanford University, Stanford, CA 94305-9010, bregler@stanford.edu

1

?

?

?

?

the use of the singular value decomposition (SVD). Unfortunately these techniques can not be applied to nonrigid deforming objects, since they are based on the rigidity assumption. Most techniques that learn models of shape variations do so on the 2D appearance, and do not recover 3D structure. Popular methods are based on Principal Components Analysis. If the object deforms with linear degrees of freedom, the covariance matrix of the shape measurements has rank . The principal modes of variation can be recovered with the use of SVD. We show how 3D non-rigid shape models can be recovered under scaled orthographic projection. The 3D shape in each frame is a linear combination of a set of basis shapes. Under this model, the 2D tracking matrix is of rank and can be factored into 3D pose, object con?guration and 3D basis shapes with the use of SVD. We demonstrated the effectiveness of this technique on several data sets, including challenging recordings of human faces during speech and varying facial expressions. Section 2 summarizes related approaches, Section 3 describes our algorithm, and Section 4 discusses our experiments.

2 Previous Work

Many methods have been proposed to solve the Structure-from-motion problem. One of the most in?uential of these was proposed by Tomasi and Kanade [8] who demonstrated the factorization method for rigid objects and orthographic projections. Many extensions have been proposed, such as the multi-body factorization method of Coseira and Kanade [4] that relaxes the rigidity constraint. In this method, independently and a permutation algorithm moving objects are allowed, which results in a tracking matrix of rank that identi?es the submatrix corresponding to each object. More recently, Bascle and Blake [1] proposed a solution for factoring facial expressions and pose during tracking. Although it exploits the bilinearity of 3D pose and nonrigid object con?guration, it requires a set of basis images selected before factorization is performed. The discovery of these basis images is not part of their algorithm. Various authors have demonstrated estimation of non-rigid appearance in 2D using Principal Components Analysis [9, 6, 3]. Basu [2] demonstrates how the parameters can be iteratively ?tted to a video sequence, starting from an initial lip model. [7, 5] propose methods for recovering 3D facial models and their expressions from multiple images. These methods require key-frame images to be hand-selected, and the 3D reconstruction requires either user interaction or the placement of ?ducials on the subject’s face. None of these techniques are able to estimate nonrigid 3D shape models from single-view 2D video streams without any initialization. In the next section, we demonstrate how this task can be solved by multiple factorization steps.

3 Factorization Algorithm

. Each key-frame is a We describe the shape of the non-rigid object as a key-frame basis set matrix describing points. The shape of a speci?c con?guration is a linear combination of this basis

2

! ?

¤

¤

?¨? ? ? ? §

¤

?

¤

$

¤

?

$ " %#?

¤

??

l ??? 4

??? ?? ?? ?? ?? ??? ? I H & w?

D T j ? f ? iD? ? f hg? Q ?? D e T ?? ? D? ?? ? ? D Q T e ?? §D ?? ? ? ? ???? ? § dQ ?

§ T ? i§f ? ?hg? Q f § T ? §? ? ?§ Q T ? § §? ? § dQ ?

? V ! U? ! T ?? ? ? Q ??? ???RP

:

contains the ?rst 2 rows of the full camera rotation matrix, and is the camera translation. The scale of the projection is coded in . As in Tomasi-Kanade, we eliminate by subtracting the mean of all points, and henceforth can assume that is centered at the origin. We can rewrite the linear combination in (2) as a matrix-matrix multiplication:

We add a temporal index to each point, and denote the tracked points in frame as assume we have point tracking data over frames and code them in the tracking matrix

I H

( ? ????? 4 ? ? ???@ ( 3 ? ? § ? ????

y @ § ?& 3

abD T § x ET D w§ Q Q

X I H

I ?

? ( 3 v§ 3

? h h a & @ A uCptCpr§ qCp s p a #X §201! p p bD T` hifgR!5? 4 ! e d 24c@ & D 3 ( ) Q

points of a con?guration

YT Q §

§ ET Q X V! T? WUS! Q RP

@ GF! ? EC#@ 97! ¨? 8 6 D BA 8 6 ?? 3

! ? 5! 4 3

set:

Under a scaled orthographic projection, the :

are projected into

I H

?

$

§ 0 21! (

)

& '?

Using (4) we can write:

r p om ? n ( ? ? ? 2? § ????

p ? f hg? ? ? ? ?? ???? ? § ?

q n ? f @ ( o m hg? ?hg? 3 f ? ? @ ? § ? ? ?( 3 @ ( ?§?3

3

??? ???

l & k?

image points

. We

(5) (4) (3) (2) (1)

@ § ? f hg? 3 @ § ? ? 3 @ § ?§?3

@

3.1 Basis Shape Factorization

Equation (5) shows that the tracking matrix has rank and can be factored into matrixes: contains for each time frame the pose and con?guration weights . codes the key-frame basis shapes . The factorization can be done using singular value decomposition (SVD) by only considering the singular vectors and singular values (?rst columns in , , ): ?rst SVD:

3.2 Factoring Pose from Con?guration

which shows that is of rank and can be factored into the pose and con?guration weights SVD. We successively apply the reordering and factorization to all time blocks of .

3.3 Adjusting Pose and Shape

In the ?nal step, we need to enforce the orthonormality of the rotation matrices. As in [8], a linear transis found by solving a least squares problem1 . The transformation maps all into an formation orthonormal . The inverse transformation must be applied to the key-frame basis to keep the factorization consistent: . We are now done. Given tracking data , we can estimate a non-rigid 3D shape matrix with degrees of freedom, and the corresponding camera rotations and con?guration weights for each time frame.

? ? ? ? ? ¤ ? ? ? ? ?? ? ? ¤ ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ?? ? ? §???o???o ?? 7F??o??So ?? ?z??o??W? ?7Fg?o??? ?? ??i?

1

The least squares problem enforces orthonomality of all

:

,

4

¤

! ? ??? 3

wt ? ??? w @

ws

?

( ? Cp 3 u ? uC § 3 ??? Cpp 3 u

We can reorder the elements of

into a new matrix

:

by

ws

ws

H

a

A (3 u Cp ( 3

p

?

?? ??? w @

( 3 A § 24 ? y 3 u Cprs?rvp p p p p q ?? ?? § 3 A (3 ( ( ( § ( 3 p 3 p 3 p 3 s Cp qCp A § 3 3 A §3 §3 § §3 s Cp § q § p p p 3 3 vp 3 3 3 s Cp qvp p p p ? ??? ? § § ( ( 3 ? A § 3 3 3 u 3 sCp ( 3 § qvp ( 3 Cp § 3 Cp § 3 § vp § 3 s q p p ? ? ??? @ ??? (p 3 p ??? @ ?p? ??§ 3 ? ? ?

? ?H

? ? ? ???H V ??f ?

In the second step, we extract the camera rotations Although is a matrix, it only contains correspond to one single time frame , namely rows index ):

and shape basis weights from the matrix . free variables. Consider the rows of that and row ( for convenience we drop the time

s

¤

! ?? ??? 3

H

DEB A|t 4 EB {z#w v 4 w I 4 x'EB f ? w A ws & y wu & D ( f I u ? ( ¤P ?? ??? @ ? ? ??? ?? ??? ?? Xy

t

v

? ??? ( 3

? ? § ?? ??? 3

¤

?

¤

I H ? ? ! w ? 4 § ??& ! ?

? ??? @ ? ??? ? ? &

& ??? ? ??

&

&

? ???

? 4 ??? r& ??? @ ? w@ ?

?

?

¤

? " ? ~}H

? ? ???¤?? ? ? ? ?????? 9¨??o??So? 57F???o??W ??

?? ??? ?

?

?

ws

?

! 5? ¤ V U? P

?

(6)

,

4 Experiments

This work is motivated by our efforts in image-based facial animation. In order to test these methods, we collected video of people speaking sentences with various facial expressions. The video recordings contain rigid head motions, and non-rigid lip, eye, and other facial motions. We tracked important facial features with an appearance-based 2D tracking technique2 . Figure 1 and 7 shows example tracking results for video1 and video-2. For facial animation, we want explicit control over the rigid head pose and the implicit facial variations. In the following, we show how we were able to extract a 3D non-rigid face model parameterized by these degrees of freedom. We applied our method to two different video sequences. The ?rst is a public broadcast originally recorded on ?lm in the early 1960’s (video-1) and contains video frames. The second video was recorded in our lab (video-2) and contains video frames. Both recordings are challenging for 3D reconstructions, since they contain very few out-of-plane head motions. In a ?rst experiment, we computed the reconstruction error based on the number of degrees of freedom ( ) for video-1. We factorized the tracking data, and computed the back-projection of the estimated model, con?guration, and pose into the image. Figure 2 shows the SSD error between the backprojected points and image measurements. For the error vanishes. For the remainder of the paper, we set . Figure 3 and 4 shows . To for example frames of video-1 and the reconstructed 3D matrix rotated by the corresponding illustrate the 3D data better, we ?t a shaded smooth surface to the 3D shape points. We also investigated the discovered modes of variation. We computed the mean and standard deviations in video-1. Figure 5 and 6 shows standard deviations of the second and third modes of ). Mode 1 covers scale change, mode 2 cover some aspect of mouth opening, and mode 3 covers ( eye opening. The remaining modes pick up more subtle and less intuitive variations. Figure 8 shows the reconstruction results for video-2. The results on these video databases are very encouraging. Given the limited range of out-of-plane face orientations, the 3D details that we could recover from the lip shape is quite surprising. We are planning to record a “ground-truth” video of non-rigid objects. This will allow us to quantify the exact reconstruction error in 3D.

5 Discussion

We have presented a simple but effective new technique for recovering 3D non-rigid shape models from 2D image streams. It is a three step procedure using multiple factorizations. We were able to recover 3D models for video recordings of talking people. Although these are very concouraging results, we plan to evaluate this technique and its limitations on a larger data set of man-made articulated objects. This problem will be somewhat easier than the face database, but it will give us ground-truth values for performance evaluations. Reconstructing non-rigid models from single-view video recordings has many potential applications. In addition, we intend to apply this technique to our image-based facial animation system and to a model based tracking system.

2

We used a learned PCA-based tracker similar to [6]

5

? ??? @

? vt&

¤

¤

?? H v?v?

?

° ??

??? YY?

H

A 2? ?? ??§ ? ? ??( 3 ? §? 3 ?

? vk&

¤

6 Figure 1: Example images from video-1 with overlayed tracking points. We track the eye brows, upper and lower eye lids, nose points, outer and inner boundary of the lips, and the chin contour.

±

1.4

1.2

1

0.8 SSD 0.6 0.4 0.2 0 0

2

4

6

8 10 12 Degrees of freedom K

14

16

18

20

Figure 2: Average pixel SSD error of back-projected face model for different degrees of freedom:

Figure 3: 3D reconstructed shape and pose for ?rst frame of Figure 1

7

¤

Figure 4: 3D reconstructed shape and pose for last frame of Figure 1

Figure 5: Variation along mode 2 of the nonrigid face model. The mouth deforms.

8

Figure 6: Variation along mode 3 of the nonrigid face model. The eyes close.

Figure 7: Example images from video-2 with overlayed tracking points.

Figure 8: Front and side view for the reconstructions from video-2. 9

Acknowledgments

We like to thank Ken Perlin, Denis Zorin, and Davi Geiger for fruitful discussions, and for supporting this research, Clilly Castiglia and Steve Cooney for helping with the data collection, and New York University, California State MICRO program and Interval Research for partial funding.

References

[1] B. Bascle and A. Blake. Separability of pose and expression in facial tracking and animation. In Proc. Int. Conf. Computer Vision, 1998. [2] S. Basu. A three-dimensional model of muman lip motion. In EECS Master Thesis, MIT Media Lab Report 417, 1997. [3] A. Blake, M. Isard, and D. Reynard. Learning to track the visual motion of contours. In J. Arti?cial Intelligence, 1995. [4] J. Costeira and T. Kanade. A multi-body factorization method for motion analysis. Int. J. of Computer Vision, pages 159–180, Sep 1998. [5] Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, and Fr? d? ric Pighin. Making faces. e e In Michael Cohen, editor, SIGGRAPH 98 Conference Proceedings, Annual Conference Series, pages 55–66. ACM SIGGRAPH, Addison Wesley, July 1998. ISBN 0-89791-999-8. [6] A. Lanitis, Taylor C.J., Cootes T.F., and Ahmed T. Automatic interpretation of human faces and hand gestures using ?exible models. In International Workshop on Automatic Face- and Gesture-Recognition, 1995. [7] Fr? d? ric Pighin, Jamie Hecker, Dani Lischinski, Richard Szeliski, and David H. Salesin. Synthesizing e e realistic facial expressions from photographs. In Michael Cohen, editor, SIGGRAPH 98 Conference Proceedings, Annual Conference Series, pages 75–84. ACM SIGGRAPH, Addison Wesley, July 1998. ISBN 0-89791-999-8. [8] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. Int. J. of Computer Vision, 9(2):137–154, 1992. [9] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86, 1991.

10