Visual Tracking for Seamless 3D Interactions inAugmented Reality
Fraunhofer Institute for Applied Information Technology,Collaborative Virtual and Augmented Environments,
Schloss Birlinghoven, 53754 Sankt Augustin, Germanychunrong.email@example.com
Abstract. This paper presents a computer vision based approach for creating 3Dtangible interfaces, which can facilitate realtime and flexible interactions withthe augmented virtual world. This approach uses realworld objects and freehand gestures as interaction handles. The identity of these objects/gestures aswell as their 3D pose in the physical world can be tracked in realtime. Once theobjects and gestures are perceived and localized, the corresponding virtual objectscan be manipulated dynamically by human operators who are operating on thosereal objects. Since the tracking algorithm is robust against background clutterand adaptable to illumination changes, it performs well in realworld scenarios,where both objects and cameras move rapidly in unconstrained environments.
Augmented Reality (AR) deals mainly with the visual enhancement of the physicalworld. The interactive aspect of AR requires tangible interfaces  that can invoke dy-namic actions and changes in the augmented 3D space. On the one hand, the conceptof tangible interfaces makes it possible to develop interactive AR applications. On theother hand, reliable systems that can retrieve the identity and location of realworld ob-jects have to be developed. It is obvious that successful AR interactions depend amongother things largely on the robust processing and tracking of realworld objects. Ac-cording to , many AR systems will not be able to run without accurate registrationof the real world.
Various means can be employed for the tracking of realworld objects including me-chanical, electromagnetic, acoustic, inertial, optical and image based devices . Wefavor the image based tracking method because it is noninvasive and can be applied inboth static and dynamic situations. Unlike other approaches, image based visual track-ing is a closedloop approach that tackles simultaneously the registration and interac-tion problem. Images can provide a visual feedback on the registration performance sothat an AR user can know how closely the real and virtual objects match each other.With this visual feedback, interactions with the virtual world can take place more natu-rally and efficiently. The author thanks the whole CVAE group as well as colleagues outside Fraunhofer for their
kind support and discussions.
G. Bebis et al. (Eds.): ISVC 2005, LNCS 3804, pp. 321328, 2005.c Springer-Verlag Berlin Heidelberg 2005
322 C. Yuan
One popular approach to the visual tracking problem is using marker objects. In, 2D ARToolkit markers are used to render virtual objects onto them. A cube withdifferent colors on each side of its surface has been used in , where the cube islocalized in an image by the CSC color segmentation algorithm. In , the 3D poseof a dotted pattern is recovered using a pair of stereo cameras. Because these markerobjects are designed only for tracking, they are not suitable for interaction purposes.
A few other works suggest using hand gestures as tangible interfaces. In , a point-ing posture is detected based on human body segmentation by combining backgroundsubtraction method and region categorization. Another example is the augmented deskinterface . Here arms of a user are segmented from the infrared input image using asimple threshold operation. After that, fingertips are searched for within regions withfixed size using template matching algorithm. Gestures are then recognized based onmultiple fingertip trajectories.
In this paper, we present a new approach which is capable of realtime tracking of thephysical world as well as the creation of natural and easy to use interfaces. By relatingrealworld objects to their counterparts in the augmented virtual world one by one, aset of interaction units can be constructed so that the virtual world can be manipulatedseamlessly by AR users operating on those real objects .
The proposed tracking approach contributes to the stateoftheart in several as-pects. First, both realworld objects as well as freehand gestures are tracked simulta-neously to satisfy different interaction purposes. Unlike the markers used in the refer-ences, the objects we have designed are much smaller, which makes it much easier tograsp. Second, our tracking system can support multiple users who can interact with theAR world either individually or cooperatively. Last but not least, the tracking camerasin our system are allowed to move freely in unconstrained environments, while mosttracking systems can only handle static camera(s).
The remainder of this paper is organized as follows. Sect. 2. gives an overview of thetracking system. Sect. 3. presents the visual tracking algorithm. Interaction mechanismsbased on the results of visual tracking are shown in Sect. 4. System performance isevaluated and discussed in Sect. 5., followed by a summary in Sect. 6.
2 System Overview
The tracking system is designed to be used in a multiuser AR environment, whereseveral users need to interact collaboratively with the virtual world rendered on topof a round table (see Fig. 1(a)). For different purposes, different kinds of interactionmechanisms are needed. Hence we use various 2D/3D objects as well as hand gesturesas input devices. The scene captured by the tracking system is very dynamic, as bothforeground and background objects are changing constantly and unexpectedly.
The users can sit or stand, and can move around the table to examine the virtualworld from different viewpoints. In order that the system keeps tracking the hand ges-tures while the users are moving freely, cameras are mounted on the head mounteddisplays (HMD). As a result, both the objects and the cameras are moving all the time.To enable dynamic interactions with the target objects in the virtual world, 3D poseparameters of the objects and gestures should be estimated precisely and in real time.
Visual Tracking for Seamless 3D Interactions in Augmented Reality 323
Fig. 1. (a). Multiple AR users interact with the augmented virtual world. (b). Objects and gesturesused in the tracking system. (c). Offline color calibration. (d). Illustration of recognition andtracking results. (e). Manipulation of the virtual buildings. (f) Creation of new 3D models.
The central task of the vision based 3D interface is the identification and tracking ofmultiple colored objects appeared in the camera view. As shown in Fig. 1(b), the objectsare made of six 2D place holder objects (PHOs), two 3D pointers, and a set of gestures.PHOs are 2D colored objects with 3DOF (degree of freedom) pose. They are calledplace holders because they are used mainly to be related to their virtual counterparts.The pose of the pointers is 6DOF. They are pointing devices that can be used to pointat some virtual objects in 3D.
There are altogether six kinds of gestures used in the system, with the hand showingzero (a fist gesture) to five fingers. The gesture with one finger is a dynamic pointing
324 C. Yuan
gesture whose 6DOF pose can be tracked in the same way as that of the 3D pointers.The other five gestures are also tracked continuously. But unlike the pointing gestures,these gesture are tracked only in 3DOF, as they are generally used as visual commandto trigger certain operations in the virtual world. Some HCI applications dont requirethe pose of a gesture to be known . However, pose parameters of even a static gestureare indispensable for 3D interactions in location critical applications.
The tracking system uses a static camera (Elmo CC491 camera unit with lipsticksize microhead QP49H) hanging over the round table to recognize the PHOs. Each ARuser wears a pair of headmounted cameras (HMC), which is installed horizontally onthe left and right side of the HMD. Each HMC is made of a pair of stereo cameras (JAICVM 2250 microhead camera) for 3D pose estimation. Pointers can be tracked byall the users HMCs. Gestures made by an AR user are tracked only by the HMC onhis own head. To increase tracking speed, the right image of a stereo pair will only beprocessed if pointers or gestures have been recognized in the left image.
3 Visual Object TrackingVisual tracking for AR involves several steps such as object detection, object iden-tification and object pose estimation. In the whole system, tracking is done using colors.First colored regions are detected. Then the shapes of the colored regions are analyzedto identify the objects and gestures. After an object or a gesture is identified, its 2D/3Dpose will be estimated. Though we do use interframe information to guide tracking, itis not necessary to use a generalpurpose tracking algorithm such as the condensationor meanshift algorithm, as the scene is very dynamic (both cameras and objects moveirregularly).
3.1 Color SegmentationColor regions are segmented by identifying the different colors based on pixelwiseclassification of the input images. For each of the colors used in the tracking system,a Gaussian model is built to approximate its distribution in the normalized redgreencolor space (r = rr+g+b , g = gr+g+b ). Since color is very sensitive to the changeof lighting conditions, adaptable color models are built in an offline color calibrationprocess before the tracking system works online. The calibration is done interactivelyby putting objects in different locations. The adaptability of the color model can bevisualized after calibration. To test the calibration result, the user just click on a colorregion and see whether it can be segmented properly, as is illustrated in Fig. 1(c), wherethe segmentation result of the right most circle on the top right PHO is shown.
After each of the used colors has been calibrated, the color models are completelybuilt and can be made available for use in online tracking. Once new images are grabbed,the pixels that have similar statistics as those in the models are identified. Regions ofdifferent colors can now be established.
3.2 Object Recognition and TrackingRecognition of the PHOs is done as follows. All the PHOs have a same backgroundcolor. Once each region having this background color has been identified, the two col-
Visual Tracking for Seamless 3D Interactions in Augmented Reality 325
ored regions within each can be localized and the geometric center of each circle can becomputed . The lines connecting the two center points indicate the orientations of thePHOs (see Fig. 1(d)). Based on the identified colors, the identity of the PHOs can bedetermined. Using a calibrated camera, the 3DOF pose of the PHOs is easily calculated.
The recognition is quite robust. As can be seen from Fig. 1(d), all the six PHOs havebeen recognized despite occlusions. Neither does the existence of objects with similarcolors in the image have any effect on the recognition results.
Recognition of the 3D pointers applies the same principle, i.e. by trying to locatethe pointers two colored regions. Shown in Fig. 1(d) on the right is the recognizedpointer, whose 2D location and orientation have been marked with a line. The 3D poseestimation of pointers is not so straight forward as that of the PHOs, which will beexplained in Sect. 3.3.
Gestures are identified through the analysis of skincolored regions. The center ofthe palm is located by fitting a circle with maximal radius within the boundary of thesegmented skin region. From here, fingers are sought after using circles with increasingradii. A region can be identified as a finger only if it can cross different circles with sub-stantial pixels. Based on the number of found fingers, gestures can be differentiated. Tosuppress false alarms due to unintended hand movement of the user, only a gesture thathas been recognized in three consecutive frames will be accepted as a gesture output.If the gesture is not a pointing gesture, then only the location of the hand (the center ofthe palm) will be computed. In case of a pointing gesture, calculation of its 3D pose issimilar to that of the 3D pointers, i.e. by using the approach to be shown in Sect. 3.3.
For illustration purpose, the recognition result of a pointing gesture is shown inFig. 1(d). The white point at the center of the hand shows the location of the palm.There are two big circles shown around the hand in Fig. 1(d). The interior one shows acircle which crosses the middle of the finger with an arc of maximal length. The exte-rior one crosses the finger with an arc whose distance to the fingertip is about onesixthof the length the finger has. The center of this arc is regarded as the location of thefingertip. The line on the finger shows its orientation.
3.3 Multiview 3D Pose Estimation
Since the HMC is not fixed in space, a multiview based 3D pose estimation is ap-proached. The calculation of the 3D pose of pointers and gestures is based on the trian-gulation of the points observed by both the HMC and the static overhead camera.
In principle there is no need to differentiate between a pointer and a pointing gesture,as in both cases, their 6DOF pose has to be computed. Due to this reason we will outlinethe algorithm by taking the pointer as an example.
In an offline process, the intrinsic camera parameters of the HMC and the transformmatrix between the left and the right camera coordinate of the HMC is calibrated be-forehand. Due to the constant movement of the HMC, it is necessary to estimate theextrinsic parameters of both the left and right cameras, i.e., to know the rotation andtranslation parameters of the HMC relative to the world coordinate. The idea is to usethe PHOs to establish the transform matrices of the HMC.
Using the measurements of the static overhead camera, the 3D world coordinates ofthe colored circles on the PHOs can be computed. As long as two PHOs are in field
326 C. Yuan
ofview of the overhead camera, we can obtain at least four points whose 3D positionsare known. If these PHOs are also in one of the HMCs field of view (e.g. the left orthe right camera of the HMC), the 2D image coordinates of these four points in theHMC image are also computable. Lets suppose the left HMC camera sees the PHOs.Now its extrinsic camera parameters can be solved using a least square optimizationalgorithm. Based on the known transform between the two cameras of the HMC, theextrinsic parameters of the right camera can also be determined.
Since we have six PHOs, we can always have more than four points with known 3D2D correspondences. This can guarantee a robust calculation of the projection matricesof the HMC even if some of the PHOs are moving or occluded. Once the location andorientation of the camera pair are known, a stereobased reconstruction algorithm isapplied to recover the 3D location of the pointer. If a pointer can be observed by morethan...