Photo

Sirui Xie

Ph.D. in Computer Science
Research Scientist, Google DeepMind
Email: srxie [at] ucla [dot] edu

Google Scholar | LinkedIn | Twitter | GitHub

I am a Research Scientist at Google DeepMind. I received my Ph.D. in Computer Science from UCLA, advised by Prof. Ying Nian Wu, Prof. Demetri Terzopoulos, and Prof. Song-Chun Zhu. Previously, I conducted research at Meta FAIR, Amazon AWS AI, and SenseTime Research. I obtained my Bachelor's degree from The Hong Kong University of Science and Technology (HKUST).

I am broadly interested in fundamental problems in Machine Learning and Artificial Intelligence, including Generative Modeling, Sequential Decision-Making, and Representation Learning, etc. My Ph.D. thesis is centered around the statistical and representational structures of latent-variable top-down models, as well as the associated inference and learning algorithms on various data modalities.

News

Publications / Preprints ( Selected | All )

* indicates equal contribution.
 
thesis
Abstractions-the latent variables underlying our observations-are fundamental to human intelligence. Despite successes in modeling data distributions, Generative AI (GenAI) systems still lack robust principles for unsupervised learning of latent abstractions. This thesis investigates generative modeling of these latent variables to address GenAI systems' bottlenecks in alignment, efficiency, and consistency in representation, inference, and decision-making.
emd
To minimize the cost of sampling diffusion models, we propose EM Distillation (EMD), a Maximum Likelihood method that distills diffusion models to 1-step generators. EMD is inspired by Expectation-Maximization, where generators are updated using samples from the joint distribution of the diffusion teacher and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique to stabilize the distillation process. EMD interpolates between mode-seeking and mode-covering KL, excelling in image generation tasks.
lpt
Decision-making via sequence modeling can be viewed as return-conditioned autoregressive behavior cloning. Unaware of their own future behaviors, such models were thought to be susceptible to drifting errors. Decision Transformer alleviates this issue by additionally predicting the return-to-go labels. We propose an unsupervised solution, where a latent variable is first inferred from a target return and then guides the policy throughout the episode, functioning as a plan. Our model discovers improved decisions from suboptimal trajectories.
nmdp
Aoyang Qin*, Feng Gao, Qing Li, Song-Chun Zhu, Sirui Xie*
When imitating non-Markovian decisions, behavior cloning may be a preferable option over inverse reinforcement learning, which relies on a Markovian Bellman operator. We introduce a maximum likelihood estimation (MLE) to expand behavior cloning to state-only sequences. In order to gain insight into the acquired decision-making mechanism, we also derive the particular structure of its value, establishing connections with (non-Markovian) soft Q-learning and soft policy iteration.
pictionary
Humans have a long history of communicating concepts with drawings. We model and simulate a transition from sequential sketch-drawing to a pictographic sign system via two neural agents playing a visual communication game. The evolved sketches show intrinsic structures, including iconicity, symbolicity, and semanticity. Coadapted agents also show familarity with conventions as they switch between abstract and iconic drawings to communicate seen and unseen concepts.
coat
Statistical independence and permutation invariance are two parallel assumptions for inducing object-centric representions, but they fail to account for the fact that certain spaces can only accommodate one object. We consider compositionality as the consistency of representational transformations when the same set of objects is added to different scenes. We design a geometric equivariance test and show that existing models seem to lack an understanding of the absence and the unique identity of an object.
drc
Disentangling a static visual scene into distinct representations of foreground and background is challenging due to the lack of independence and symmetry between these two components. Inspired by Julesz ensemble, we propose a latent energy-based generative model, where a pixel reassignment in the background generator equalizes different texture instances. This model effectively capture the regularities in background regions, resolving spurious correlations in the representations. The learned disentanglement generalizes to images from previously unseen classes.
snas
Previous modeling of Neural Architecture Search as a Markov Decision Process ignores its deterministic state transitions and fully delayed rewards. Such an over-modeling may incur exponential delay in the convergence. We reformulate NAS as a stochastic optimization on a differentiable Markov Chain. SNAS learns operation parameters and architecture distributions in the same round of gradient update.

Education

  • 2019.09 - 2024.09, University of California, Los Angeles
    PhD in Computer Science
  • 2012.09 - 2016.06, The Hong Kong University of Science and Technology
    BEng in Computer Science, First-Class Honor

Selected Awards

  • Graduate Research Assistantship, UCLA, 2019 - present
  • Full Scholarship, HKUST, 2012 - 2016

Professional Service

  • Conference Reviewer: NeurIPS, ICML, ICLR, AISTATS, AAAI, IJCAI, CVPR, ICCV, ECCV, ICRA
  • Journal Reviewer: IEEE/T-PAMI, IEEE/RA-L