Facial motion capture is the process of electronically converting the movements of a person's face into a digital database using cameras or
laser scanners. This database may then be used to produce
computer graphics
Computer graphics deals with generating images and art with the aid of computers. Computer graphics is a core technology in digital photography, film, video games, digital art, cell phone and computer displays, and many specialized applications. ...
(CG),
computer animation
Computer animation is the process used for digitally generating Film, moving images. The more general term computer-generated imagery (CGI) encompasses both still images and moving images, while computer animation refers to moving images. Virtu ...
for movies, games, or real-time avatars. Because the motion of CG characters is derived from the movements of real people, it results in a more realistic and nuanced computer character animation than if the animation were created manually.
A facial
motion capture
Motion capture (sometimes referred as mocap or mo-cap, for short) is the process of recording high-resolution motion (physics), movement of objects or people into a computer system. It is used in Military science, military, entertainment, sports ...
database describes the coordinates or relative positions of reference points on the actor's face. The capture may be in two dimensions, in which case the capture process is sometimes called "
expression tracking", or in three dimensions. Two-dimensional capture can be achieved using a single camera and capture software. This produces less sophisticated tracking, and is unable to fully capture three-dimensional motions such as head rotation. Three-dimensional capture is accomplished using
multi-camera rigs or laser marker system. Such systems are typically far more expensive, complicated, and time-consuming to use. Two predominate technologies exist: marker and marker-less tracking systems.
Facial motion capture is related to body motion capture, but is more challenging due to the higher resolution requirements to detect and track subtle expressions possible from
small movements of the eyes and lips. These movements are often less than a few millimeters, requiring even greater resolution and fidelity and different filtering techniques than usually used in full body capture. The additional constraints of the face also allow more opportunities for using models and rules.
Facial expression capture is similar to facial motion capture. It is a process of using visual or mechanical means to manipulate computer generated characters with input from human
face
The face is the front of the head that features the eyes, nose and mouth, and through which animals express many of their emotions. The face is crucial for human identity, and damage such as scarring or developmental deformities may affect th ...
s, or to
recognize emotions from a user.
History
One of the first papers discussing performance-driven animation was published by
Lance Williams in 1990. There, he describes 'a means of acquiring the expressions of realfaces, and applying them to computer-generated faces'.
[Performance-Driven Facial Animation, Lance Williams, Computer Graphics, Volume 24, Number 4, August 1990]
Technologies
Marker-based
Traditional marker based systems apply up to 350 markers to the actors
face
The face is the front of the head that features the eyes, nose and mouth, and through which animals express many of their emotions. The face is crucial for human identity, and damage such as scarring or developmental deformities may affect th ...
and track the marker movement with high resolution
cameras
A camera is an instrument used to capture and store images and videos, either digitally via an electronic image sensor, or chemically via a light-sensitive material such as photographic film. As a pivotal technology in the fields of photograp ...
. This has been used on movies such as ''
The Polar Express'' and ''
Beowulf
''Beowulf'' (; ) is an Old English poetry, Old English poem, an Epic poetry, epic in the tradition of Germanic heroic legend consisting of 3,182 Alliterative verse, alliterative lines. It is one of the most important and List of translat ...
'' to allow an actor such as
Tom Hanks
Thomas Jeffrey Hanks (born July 9, 1956) is an American actor and filmmaker. Known for both his comedic and dramatic roles, he is one of the most popular and recognizable film stars worldwide, and is regarded as an American cultural icon. Ha ...
to drive the facial expressions of several different characters. Unfortunately this is relatively cumbersome and makes the actors expressions overly driven once the smoothing and filtering have taken place. Next generation systems such as
CaptiveMotion utilize offshoots of the traditional marker based system with higher levels of details.
Active LED Marker technology is currently being used to drive facial animation in real-time to provide user feedback.
Markerless
Markerless technologies use the features of the face such as
nostril
A nostril (or naris , : nares ) is either of the two orifices of the nose. They enable the entry and exit of air and other gasses through the nasal cavities. In birds and mammals, they contain branched bones or cartilages called turbinates ...
s, the corners of the lips and eyes, and wrinkles and then track them. This technology is discussed and demonstrated at
CMU,
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
,
University of Manchester
The University of Manchester is a public university, public research university in Manchester, England. The main campus is south of Manchester city centre, Manchester City Centre on Wilmslow Road, Oxford Road. The University of Manchester is c ...
(where much of this started wit
Tim Cootes Gareth Edwards and Chris Taylor) and other locations, using
active appearance models,
principal component analysis
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
The data is linearly transformed onto a new coordinate system such that th ...
,
eigen trackingdeformable surface modelsand other techniques to track the desired facial features from
frame
A frame is often a structural system that supports other components of a physical construction and/or steel frame that limits the construction's extent.
Frame and FRAME may also refer to:
Physical objects
In building construction
*Framing (con ...
to frame. This technology is much less cumbersome, and allows greater expression for the actor.
These vision based approaches also have the ability to track pupil movement, eyelids, teeth occlusion by the lips and tongue, which are obvious problems in most computer-animated features. Typical limitations of vision based approaches are resolution and frame rate, both of which are decreasing as issues as high speed, high resolution
CMOS cameras become available from multiple sources.
The technology for markerless face tracking is related to that in a
Facial recognition system
A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...
,
since a facial recognition system can potentially be applied sequentially to each frame
of video, resulting in face tracking.
For example, the Neven Vision system (formerly Eyematics, now acquired by Google) allowed real-time
2D face tracking with no person-specific training; their system was also amongst the best-performing facial recognition systems in the U.S. Government's 2002 Facial Recognition Vendor Test (FRVT).
On the other hand, some recognition systems do not explicitly track expressions or
even fail on non-neutral expressions, and so are not suitable for tracking.
Conversely, systems such a
deformable surface modelspool temporal information to disambiguate and obtain more robust results, and thus could not be applied from a single photograph.
Markerless face tracking has progressed to commercial systems such as
Image Metrics, which has been applied in movies such as ''
The Matrix
''The Matrix'' is a 1999 science fiction film, science fiction action film written and directed by the Wachowskis. It is the first installment in the The Matrix (franchise), ''Matrix'' film series, starring Keanu Reeves, Laurence Fishburne, Ca ...
'' sequels
and ''
The Curious Case of Benjamin Button''.
The latter used the
Mova system to capture a deformable
facial model, which was then animated with a combination of manual and
vision tracking. ''
Avatar
Avatar (, ; ) is a concept within Hinduism that in Sanskrit literally means . It signifies the material appearance or incarnation of a powerful deity, or spirit on Earth. The relative verb to "alight, to make one's appearance" is sometimes u ...
'' was another prominent performance capture movie however it used painted markers
rather than being markerless
Dynamixyzis another commercial system currently in use.
Markerless systems can be classified according to several distinguishing criteria:
* 2D versus 3D tracking
* whether person-specific training or other human assistance is required
* real-time performance (which is only possible if no training or supervision is required)
* whether they need an additional source of information such as projected patterns or invisible paint such as used in the Mova system.
To date, no system is ideal with respect to all these criteria. For example, the Neven Vision
system was fully automatic and required no hidden patterns or per-person training, but was 2D.
The Face/Off system is 3D, automatic, and real-time but requires projected patterns.
Facial expression capture
Technology
Digital video-based methods are becoming increasingly preferred, as mechanical systems tend to be cumbersome and difficult to use.
Using
digital camera
A digital camera, also called a digicam, is a camera that captures photographs in Digital data storage, digital memory. Most cameras produced today are digital, largely replacing those that capture images on photographic film or film stock. Dig ...
s, the input user's expressions are processed to provide the head
pose, which allows the software to then find the eyes, nose and mouth. The face is initially calibrated using a neutral expression. Then depending on the architecture, the eyebrows, eyelids, cheeks, and mouth can be processed as differences from the neutral expression. This is done by looking for the edges of the lips for instance and recognizing it as a unique object. Often contrast enhancing makeup or markers are worn, or some other method to make the processing faster. Like voice recognition, the best techniques are only good 90 percent of the time, requiring a great deal of tweaking by hand, or tolerance for errors.
Since computer generated characters don't actually have
muscle
Muscle is a soft tissue, one of the four basic types of animal tissue. There are three types of muscle tissue in vertebrates: skeletal muscle, cardiac muscle, and smooth muscle. Muscle tissue gives skeletal muscles the ability to muscle contra ...
s, different techniques are used to achieve the same results. Some animators create bones or objects that are controlled by the capture software, and move them accordingly, which when the character is rigged correctly gives a good approximation. Since faces are very elastic this technique is often mixed with others, adjusting the weights differently for the
skin
Skin is the layer of usually soft, flexible outer tissue covering the body of a vertebrate animal, with three main functions: protection, regulation, and sensation.
Other animal coverings, such as the arthropod exoskeleton, have different ...
elasticity and other factors depending on the desired expressions.
Usage
Several commercial companies are developing products that have been used, but are rather expensive.
It is expected that this will become a major
input device
In computing, an input device is a piece of equipment used to provide data and control signals to an information processing system, such as a computer or information appliance. Examples of input devices include keyboards, computer mice, scanne ...
for computer games once the software is available in an affordable format, but the hardware and software do not yet exist, despite the research for the last 15 years producing results that are almost usable.
Communication with real-time avatars
The first application that got wide adoption is communication. Initially video telephony and multimedia messaging and later in 3D with mixed reality headsets.
With the advance of
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, compute power and advanced sensors, especially on mobile phones, facial motion capture technology became widely available. Two notable examples are Snapchat's
lens
A lens is a transmissive optical device that focuses or disperses a light beam by means of refraction. A simple lens consists of a single piece of transparent material, while a compound lens consists of several simple lenses (''elements'') ...
feature and Apple's Memoji that can be used to record messages with avatars or live via the
FaceTime
FaceTime is a proprietary videotelephony product developed by Apple. FaceTime is available on supported iOS mobile devices running iOS 4 and later and Mac computers that run and later. FaceTime supports any iOS device with a forward-facin ...
app. With these applications (and many other) most modern mobile phones today are capable of performing real time facial motion capture!
More recently, real time facial motion capture, combined with realistic 3D
avatars were introduced to enable immersive communication in
mixed reality
Augmented reality (AR), also known as mixed reality (MR), is a technology that overlays real-time 3D computer graphics, 3D-rendered computer graphics onto a portion of the real world through a display, such as a handheld device or head-mounted ...
(MR) and
virtual reality
Virtual reality (VR) is a Simulation, simulated experience that employs 3D near-eye displays and pose tracking to give the user an immersive feel of a virtual world. Applications of virtual reality include entertainment (particularly video gam ...
(VR).
Meta demonstrated their Codec Avatars to communicate via their MR headset
Meta Quest Pro to record a podcast with two remote participants.
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
's MR headset
Apple Vision Pro
The Apple Vision Pro is a mixed reality, mixed-reality headset developed by Apple Inc., Apple. It was announced on June 5, 2023, at Apple's Worldwide Developers Conference (WWDC) and was released first in the US, then in global territories thr ...
also supports real-time facial motion capture that can be used with applications such as
FaceTime
FaceTime is a proprietary videotelephony product developed by Apple. FaceTime is available on supported iOS mobile devices running iOS 4 and later and Mac computers that run and later. FaceTime supports any iOS device with a forward-facin ...
.
Real-time communication applications prioritize low
latency to facilitate natural conversation and ease of use, aiming to make the technology accessible to a broad audience. These considerations may limit on the possible accuracy of the motion capture.
See also
*
Eye tracking
Eye tracking is the process of measuring either the point of gaze (where one is looking) or the motion of an eye relative to the head. An eye tracker is a device for measuring eye positions and eye movement. Eye trackers are used in research ...
*
Computer facial animation
Computer facial animation is primarily an area of computer graphics that encapsulates methods and techniques for generating and animating images or models of a character face. The character can be a human, a humanoid, an animal, a legendary creatu ...
*
Deepfake
''Deepfakes'' (a portmanteau of and ) are images, videos, or audio that have been edited or generated using artificial intelligence, AI-based tools or AV editing software. They may depict real or fictional people and are considered a form of ...
*
Facial recognition system
A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...
*
Facial Action Coding System
*
Uncanny valley
The effect is a hypothesized psychological and aesthetic relation between an object's degree of resemblance to a human being and the emotional response to the object. The uncanny valley hypothesis predicts that an entity appearing almost huma ...
References
External links
Carnegie Mellon UniversityDelft University of Technology Sheffield and Otago
{{Extended reality, state=collapsed
Computer animation
Facial expressions
Computing input devices
Motion capture
Virtual reality