Xueting Li
Senior Research Scientist, NVIDIA

I am a computer vision researcher in the LPR team of NVIDIA Research. I received my Ph.D. from University of California, Merced supervised by Ming-Hsuan Yang. Before coming to UC Merced, I obtained my M.S. degree and B.S. degree from Tsinghua University (THU) and Beijing University of Posts and Telecommunications (BUPT) on 2016 and 2013 respectively. I am interested in all 3D vision related topics.


Education
  • University of California, Merced
    University of California, Merced
    PhD in EECS
    Sep. 2017 - Dec. 2021
  • Tsinghua University
    Tsinghua University
    M.S. in Software Engineering
    Sep. 2013 - Jun. 2016
  • Beijing University of Posts and Telecommunications
    Beijing University of Posts and Telecommunications
    B.S. in Computer Science
    Sep. 2009 - Jun. 2013
Honors & Awards
  • NVIDIA PhD Fellowship
    2019
  • Google Graduate Fellowship
    2020
News
2024
Pyramid Diffusion is accepted to ECCV 2024.
Jul 20
Dream-in-4d and GAvatar are accepted to CVPR 2024.
Jan 30
Selected Publications (view all )
Pyramid Diffusion for Fine 3D Large Scene Generation

Yuheng Liu, Xinke Li, Xueting Li, Lu Qi, Chongshou Li, Ming-Hsuan Yang

European Conference on Computer Vision (ECCV) 2024 Oral

Diffusion models have shown remarkable results in generating 2D images and small-scale 3D objects. However, their application to the synthesis of large-scale 3D scenes has been rarely explored. This is mainly due to the inherent complexity and bulky size of 3D scenery data, particularly outdoor scenes, and the limited availability of comprehensive real-world datasets, which makes training a stable scene diffusion model challenging. In this work, we explore how to effectively generate large-scale 3D scenes using the coarse-to-fine paradigm. We introduce a framework, the Pyramid Discrete Diffusion model (PDD), which employs scale-varied diffusion models to progressively generate high-quality outdoor scenes. Experimental results of PDD demonstrate our successful exploration in generating 3D scenes both unconditionally and conditionally. We further showcase the data compatibility of the PDD model, due to its multi-scale architecture: a PDD model trained on one dataset can be easily fine-tuned with another dataset.

Pyramid Diffusion for Fine 3D Large Scene Generation
Pyramid Diffusion for Fine 3D Large Scene Generation

Yuheng Liu, Xinke Li, Xueting Li, Lu Qi, Chongshou Li, Ming-Hsuan Yang

European Conference on Computer Vision (ECCV) 2024 Oral

Diffusion models have shown remarkable results in generating 2D images and small-scale 3D objects. However, their application to the synthesis of large-scale 3D scenes has been rarely explored. This is mainly due to the inherent complexity and bulky size of 3D scenery data, particularly outdoor scenes, and the limited availability of comprehensive real-world datasets, which makes training a stable scene diffusion model challenging. In this work, we explore how to effectively generate large-scale 3D scenes using the coarse-to-fine paradigm. We introduce a framework, the Pyramid Discrete Diffusion model (PDD), which employs scale-varied diffusion models to progressively generate high-quality outdoor scenes. Experimental results of PDD demonstrate our successful exploration in generating 3D scenes both unconditionally and conditionally. We further showcase the data compatibility of the PDD model, due to its multi-scale architecture: a PDD model trained on one dataset can be easily fine-tuned with another dataset.

Gavatar: Animatable 3d gaussian avatars with implicit mesh learning

Ye Yuan*, Xueting Li*, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal (* equal contribution)

Conference on Computer Vision and Pattern Recognition (CVPR) 2024 Highlight

Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions addressing the limitations (eg efficiency and flexibility) imposed by mesh or NeRF-based representations. However a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animations. Second to stabilize and amortize the learning of millions of Gaussians we propose to use implicit neural fields to predict the Gaussian attributes (eg colors). Finally to capture fine avatar geometries and extract detailed meshes we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method GAvatar enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality and achieves extremely fast rendering (100 fps) at 1K resolution.

Gavatar: Animatable 3d gaussian avatars with implicit mesh learning
Gavatar: Animatable 3d gaussian avatars with implicit mesh learning

Ye Yuan*, Xueting Li*, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, Umar Iqbal (* equal contribution)

Conference on Computer Vision and Pattern Recognition (CVPR) 2024 Highlight

Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions addressing the limitations (eg efficiency and flexibility) imposed by mesh or NeRF-based representations. However a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animations. Second to stabilize and amortize the learning of millions of Gaussians we propose to use implicit neural fields to predict the Gaussian attributes (eg colors). Finally to capture fine avatar geometries and extract detailed meshes we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method GAvatar enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality and achieves extremely fast rendering (100 fps) at 1K resolution.

Physics-based Indirect Illumination for Inverse Rendering

Youming Deng, Xueting Li, Sifei Liu, Ming-Hsuan Yang

International Conference on 3D Vision (3DV) 2024

We present a physics-based inverse rendering method that learns the illumination, geometry, and materials of a scene from posed multi-view RGB images. To model the illumination of a scene, existing inverse rendering works either completely ignore the indirect illumination or model it by coarse approximations, leading to sub-optimal illumination, geometry, and material prediction of the scene. In this work, we propose a physics-based illumination model that first locates surface points through an efficient refined sphere tracing algorithm, then explicitly traces the incoming indirect lights at each surface point based on reflection. Then, we estimate each identified indirect light through an efficient neural network. Moreover, we utilize the Leibniz's integral rule to resolve non-differentiability in the proposed illumination model caused by boundary lights inspired by differentiable irradiance in computer graphics. As a result, the proposed differentiable illumination model can be learned end-to-end together with geometry and materials estimation. As a side product, our physics-based inverse rendering model also facilitates flexible and realistic material editing as well as relighting. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method performs favorably against existing inverse rendering methods on novel view synthesis and inverse rendering.

Physics-based Indirect Illumination for Inverse Rendering
Physics-based Indirect Illumination for Inverse Rendering

Youming Deng, Xueting Li, Sifei Liu, Ming-Hsuan Yang

International Conference on 3D Vision (3DV) 2024

We present a physics-based inverse rendering method that learns the illumination, geometry, and materials of a scene from posed multi-view RGB images. To model the illumination of a scene, existing inverse rendering works either completely ignore the indirect illumination or model it by coarse approximations, leading to sub-optimal illumination, geometry, and material prediction of the scene. In this work, we propose a physics-based illumination model that first locates surface points through an efficient refined sphere tracing algorithm, then explicitly traces the incoming indirect lights at each surface point based on reflection. Then, we estimate each identified indirect light through an efficient neural network. Moreover, we utilize the Leibniz's integral rule to resolve non-differentiability in the proposed illumination model caused by boundary lights inspired by differentiable irradiance in computer graphics. As a result, the proposed differentiable illumination model can be learned end-to-end together with geometry and materials estimation. As a side product, our physics-based inverse rendering model also facilitates flexible and realistic material editing as well as relighting. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method performs favorably against existing inverse rendering methods on novel view synthesis and inverse rendering.

Generalizable one-shot 3D neural head avatar

Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, Jan Kautz

Neural Information Processing Systems (NeurIPS) 2023

We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.). At the core of our method are three branches that produce three tri-planes representing the coarse 3D geometry, detailed appearance of a source image, as well as the expression of a target image. By applying volumetric rendering to the combination of the three tri-planes followed by a super-resolution module, our method yields a high fidelity image of the desired identity, expression and pose. Once trained, our model enables efficient 3D head avatar reconstruction and animation via a single forward pass through a network. Experiments show that the proposed approach generalizes well to unseen validation datasets, surpassing SOTA baseline methods by a large margin on head avatar reconstruction and animation.

Generalizable one-shot 3D neural head avatar
Generalizable one-shot 3D neural head avatar

Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, Jan Kautz

Neural Information Processing Systems (NeurIPS) 2023

We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.). At the core of our method are three branches that produce three tri-planes representing the coarse 3D geometry, detailed appearance of a source image, as well as the expression of a target image. By applying volumetric rendering to the combination of the three tri-planes followed by a super-resolution module, our method yields a high fidelity image of the desired identity, expression and pose. Once trained, our model enables efficient 3D head avatar reconstruction and animation via a single forward pass through a network. Experiments show that the proposed approach generalizes well to unseen validation datasets, surpassing SOTA baseline methods by a large margin on head avatar reconstruction and animation.

Self-supervised Single-view 3D Reconstruction

Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, Varun Jampani, Ming-Hsuan Yang, Jan Kautz

European Conference on Computer Vision (ECCV) 2020

We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging self-supervisedly learned part segmentation of a large collection of category-specific images, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision.

Self-supervised Single-view 3D Reconstruction
Self-supervised Single-view 3D Reconstruction

Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, Varun Jampani, Ming-Hsuan Yang, Jan Kautz

European Conference on Computer Vision (ECCV) 2020

We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging self-supervisedly learned part segmentation of a large collection of category-specific images, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision.

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

Computer Vision and Pattern Recognition (CVPR) 2019

Affordance modeling plays an important role in visual understanding. In this paper, we aim to predict affordances of 3D indoor scenes, specifically what human poses are afforded by a given indoor environment, such as sitting on a chair or standing on the floor. In order to predict valid affordances and learn possible 3D human poses in indoor scenes, we need to understand the semantic and geometric structure of a scene as well as its potential interactions with a human. To learn such a model, a large-scale dataset of 3D indoor affordances is required. In this work, we build a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. With the data created by the synthesizer, we introduce a 3D pose generative model to predict semantically plausible and physically feasible human poses within a given scene (provided as a single RGB, RGB-D, or depth image). We demonstrate that our human affordance prediction method consistently outperforms existing state-of-the-art methods.

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments
Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

Computer Vision and Pattern Recognition (CVPR) 2019

Affordance modeling plays an important role in visual understanding. In this paper, we aim to predict affordances of 3D indoor scenes, specifically what human poses are afforded by a given indoor environment, such as sitting on a chair or standing on the floor. In order to predict valid affordances and learn possible 3D human poses in indoor scenes, we need to understand the semantic and geometric structure of a scene as well as its potential interactions with a human. To learn such a model, a large-scale dataset of 3D indoor affordances is required. In this work, we build a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. With the data created by the synthesizer, we introduce a 3D pose generative model to predict semantically plausible and physically feasible human poses within a given scene (provided as a single RGB, RGB-D, or depth image). We demonstrate that our human affordance prediction method consistently outperforms existing state-of-the-art methods.

All publications