PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

Here Come the Virtual Humans

How close are we to silicon-based actors on our screens? We asked USC assistant professor Dr. Hao Li, who's working on that very task.

Though Hollywood would prefer otherwise, actors do age. Performances have been preserved on celluloid and through digital means, and in the future, perhaps we'll interact with them inside immersive environments. But we can't keep them alive forever. Or can we?

Dr. Hao Li, a pioneer in virtual human development, thinks we can. His work is used by Apple, which acquired his Kinect-based facial performance capture tool Faceshift, and Oculus/Facebook, which created a prototype for the first facial performance-sensing VR headset to enable social interactions in cyberspace.

Dr. Li earned his doctorate in computer science at ETH Zurich, and became a research lead at Industrial Light & Magic (ILM) before joining the University of Southern California (USC). Today, he's an assistant professor of computer science, director of the Vision and Graphics Lab, and CEO of an AR startup, Pinscreen.

PCMag called him at his lab in Playa Vista, California, not far from the Google campus, to learn about his virtual humans and see how he brings them to life.

Dr. Li, give us some background. When did you first get interested in computer graphics, tech, and creating virtual humans?
When I was a kid, I had a Commodore 64, and learned BASIC to enable me to put pixels on the screen. When PCs became available, I started to play with more sophisticated graphics programming and used professional 3D modeling software to create my own CG renderings and animation. That was the early days of visual effect in the 90s, I remember how people were blown away by visual effects done by companies such as Industrial Light & Magic, where I later worked, on movies like Terminator 2, Jurassic Park, and so on.

Can you explain why your research on geometric capture of human performances moves us beyond today's clunkier methods, such as using markers, motion-capture, etc.?
When I started my PhD at ETH Zurich, real-time 3D sensors—similar to Microsoft's Kinect, but not yet commercialized—were just invented and only available in research labs. But, I thought, if a sensor can see the world in 3D and could digitize content directly, wouldn't that change the way we create animations? [Back then] one of the hardest things to create was a realistic animation of a digital human since it involves complex motion capture devices and artists. While depth sensors are fairly accurate in measuring the world, they only capture part of the information and the raw data is unstructured. Our research consisted of recovering dense surface motions from this data, so that a dynamic 3D model of a real object can be generated similar to those modeled and animated by an artist.

And you were able to do this, even with human faces?
Right. Since human faces are one of the most complex things to animate, we applied our algorithm to the performance capture of human faces and enabled the ability to track complex facial expressions in real time.

Without the use of markers, as with conventional motion capture, or Light Stage capture, which we saw digitize Robin Wright in The Congress?
Exactly. We won't need to use the Light Stage to do this soon. Plus, since our early work using depth sensors showed how compelling faces can be animated without complex mocap systems, solutions that only require pure RGB cameras have been introduced shortly after, and those technologies have since then been adopted widely in the consumer space such as in mobile phone apps, face recognition systems, VR headsets, and so on.

Which takes us out of your lab and into your company.
Yes, as part of [my company] Pinscreen, we are taking a step further and targeting at digitizing realistic 3D avatars from a single photograph, which will allow anyone to create their own digital selves in seconds and incorporate [it] into any game and VR application.

In 2014, you spent three months in New Zealand at Peter Jackson's Weta Digital. What was a highlight of your time there?
We had the opportunity to deploy our hair digitization technologies for the creation of CG characters at scale as part of the research group, but one of the highlights was to develop a novel facial performance capture pipeline for the photorealistic digital reenactment of [the late actor] Paul Walker in the movie Furious 7. We developed some cutting-edge computer-vision techniques to ensure that realistic movements were mapped from his double and onto his digital face.

Aside from your time with Weta Digital, how many top actors have you digitized to date?
At USC ICT, almost every month, we scan various celebrities for all the top blockbuster movies using our Light Stage capture technology, originally invented by Dr. Paul Debevec. We are developing new deep learning-based techniques to increase the speed and fidelity of acquisition and we work closely with all major VFX studios in adopting our tech into their pipelines. Recently we've worked on Ready Player One and Blade Runner 2049 [the full list is here].

Tell us about working with Apple, which used your depth sensor-driven facial animation for the iPhone X Animoji and forthcoming Memoji in iOS 12.
Obviously, I can't say too much about it, but my PhD colleagues and I developed Faceshift, which was the first end-to-end facial-animation solution, that used a Kinect sensor to produce high-quality animations using state-of-the-art technologies that we published at top Computer Graphics conferences.

How did the Faceshift system work?
It was easy to use, real-time, and markerless, so did not require trained artists to operate. Apple acquired Primesense, the depth sensor manufacturer behind Microsoft's Kinect Sensor, and subsequently the startup Faceshift. A variant of the original technology was then introduced as part of the iPhone X device where users can bring their Animojis to life using their facial movements.

Many social networks are bullish on real-time immersive 3D interaction. How close are we to seeing this become a reality?
Nowadays, when we want to have a meeting remotely, video conferencing is the only option, but it can't replace an in-person meeting. We use subtle gestures and micro facial expressions to communicate, solve problems, negotiate, and to convey emotions.

So much of human-human communication is non-verbal.
Exactly. So communication has to be in 3D, especially when multiple parties are involved, and when spatial collaborative tasks need to be performed. Such interaction is hard to imagine without the use of some form of immersive display technology. While hardware adoption still has some way to go to - VR/AR displays need to become more ergonomic and achieve better quality - 3D content creation is what we're working on right now; but compelling 3D content is very time consuming and costly to produce.

Can you give us an idea of a typical 3D content creation team size?
It typically involves a game studio and an army of artists, but we are working on computer vision and AI-driven technologies which can enable automated scalable content creation, accessible to the masses. For instance, our technology at Pinscreen can create complete CG characters from a single photograph in the matter of seconds, which would take months for a production studio to generate from scratch.

What's the future for this type of endeavor?
In the next few years, we'll reach photorealistic output that are fully computer generated and indistinguishable from reality. We'll be able to teleport ourselves into VR and seamlessly interact with each other remotely; mobile devices with holographic displays, such as the RED Hydrogen One/LEIA, will allow us to see each other in 3D, and an explosion of AR/VR/gaming/fashion content will be available with personalized avatars.

Aside from Hollywood and Silicon Valley, your dynamic shape reconstruction research is widely used in both defense and biomedical fields today. You were recently awarded the Office of Naval Research (ONR) Young Investigator Award. Is there anything you can say about this work?
Our research in digitizing humans is sponsored by various DoD agencies such as the Office of Naval Research, the Army Research Office, SRC, and DARPA. Our goal is to develop new artificial intelligence techniques that automate the generation of photorealistic digital humans from minimal input, so that non-experts can create 3D content at scale, and 3D avatars can become accessible to consumers for immersive communication purposes.

Finally, how many instances of "Hao Li" (virtual edition) are out there in the world? And do you interact with your (younger, digital) selves often?
[Laughs] I'm creating a few of myself every day. Mostly the present one. But I'm also exploring daily how I look like in different bodies and clothing, and then teleporting myself into new virtual places.

If PCMag readers are at SIGGRAPH this year (Aug. 12 - 16), Dr. Hao Li's work has been selected for Real-Time Live in Vancouver, Canada.

About S.C. Stuart