News
- July 2024: My work on driving scene generation with diffusion model "WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation " is accepted by ECCV 2024.
- July 2024: My work on Vision Transformer "Vision Transformers: From Semantic Segmentation to Dense Prediction" appears on IJCV 2024 July.
- April 2024: I have been offered PhD offers from CMU, UC Berkeley, EPFL, etc. in 2024 Fall PhD Application.
- March 2024: My work on Efficient Transformer Theory "Softmax-free Linear Transformers" appears on IJCV 2024 March.
- Feb 2024: My personal citation excceed 3000 on google scholar!
- July 2023: My work on HD-map of Autonomous Driving "Translating Images to Road Network: A Non-Autoregressive Sequence-to-Sequence Approach" is accepted as ICCV 2023 Oral.
- June 2023: My work on 3D Temporal Detection "SUIT: Learning Significance-guided Information for 3D Temporal Detection" is accepted by IROS 2023.
- March 2023: My work on generative perception model "Geneartive Semantic Segmentation" is accepted by CVPR 2023.
- Jan 2023: My work on mobile Transformer "SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation" is accepted by ICLR 2023.
- Jan 2023: E2EAD CVPR2023: 1st workshop on End-to-End Autonomous Driving:
Perception, Prediction, Planning and Simulation is now open for submission on CMT.
- Dec 2022: Committee member of E2EAD CVPR2023
|
Research
I have my interest in computervision. I have worked on semantic segmentation, Transformer, efficient Transformer, vision-based 3D detection, Bird's Eye View road segmentation.
|
|
WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation
Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, Li Zhang
ECCV2024
Paper
Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). ......
|
|
Vision Transformers: From Semantic Segmentation to Dense Prediction
Li Zhang, Jiachen Lu, Sixiao Zheng, Xinxuan Zhao, Xiatian Zhu, Yanwei Fu, Tao Xiang, Jianfeng Feng, Philip HS Torr
IJCV 2024 July
Paper/
Code
... However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification.
|
|
Softmax-Free Linear Transformers
Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, Li Zhang
IJCV 2024 March
Paper/Code
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. ......
|
|
Self-organizing Agents in Open-ended Environments
Jiaqi Chen, Yuxian Jiang, Jiachen Lu, Li Zhang
ICLR 2024 workshop on Large Language Model (LLM) Agent
Paper/Code
Leveraging large language models (LLMs), autonomous agents have significantly improved, gaining the ability to handle a variety of tasks. In open-ended settings, optimizing collaboration for efficiency and effectiveness demands flexible adjustments. Despite this, current research mainly emphasizes fixed, task-oriented workflows and overlooks agent-centric organizational structures. Drawing inspiration from human organizational behavior, we introduce a self-organizing agent system (S-Agents) with a "tree of agents" structure for dynamic workflow, an "hourglass agent architecture" for balancing information priorities, and a "non-obstructive collaboration" method to allow asynchronous task execution among agents. ......
|
|
Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping
Zijie Pan, Jiachen Lu, Xiatian Zhu, Li Zhang
ICLR 2024
Paper
High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model's capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. ......
|
|
Translating Images to Road Network: A Non-Autoregressive Sequence-to-Sequence Approach
Jiachen Lu, Renyuan Peng, Xinyue Cai, Hang Xu, Hongyang Li, Feng Wen, Wei Zhang, Li Zhang
ICCV 2023 [Oral]
Paper/Code
The extraction of road network is essential for the generation of high-definition maps since it enables the precise localization of road landmarks and their interconnections. However, generating road network poses a significant challenge due to the conflicting underlying combination of Euclidean (eg, road landmarks location) and non-Euclidean (eg, road topological connectivity) structures. Existing methods struggle to merge the two types of data domains effectively, but few of them address it properly. Instead, our work establishes a unified representation of both types of data domain by projecting both Euclidean and non-Euclidean data into an integer series called RoadNet Sequence. ......
|
|
SUIT: Learning Significance-guided Information for 3D Temporal Detection
Zheyuan Zhou, Jiachen Lu, Yihan Zeng, Hang Xu, Li Zhang
IROS 2023 [Oral]
3D object detection from LiDAR point cloud is of critical importance for autonomous driving and robotics. While sequential point cloud has the potential to enhance 3D perception through temporal information, utilizing these temporal features effectively and efficiently remains a challenging problem. Based on the observation that the foreground information is sparsely distributed in LiDAR scenes, we believe sufficient knowledge can be provided by sparse format rather than dense maps. To this end, we propose to learn Significance-gUided Information for 3D Temporal detection (SUIT), which simplifies temporal information as sparse features for information fusion across frames. ......
|
|
Generative Semantic Segmentation
Jiaqi Chen, Jiachen Lu, Xiatian Zhu, Li Zhang
CVPR 2023
Paper/Code
We present Generative Semantic Segmentation (GSS), a generative learning approach for semantic segmentation. Uniquely, we cast semantic segmentation as an image-conditioned mask generation problem. This is achieved by replacing the conventional per-pixel discriminative learning with a latent prior learning process. Specifically, we model the variational posterior distribution of latent variables given the segmentation mask. To that end, the segmentation mask is expressed with a special type of image (dubbed as maskige). This posterior distribution allows to generate segmentation masks unconditionally. To achieve semantic segmentation on a given image, we further introduce a conditioning network. It is optimized by minimizing the divergence between the posterior distribution of maskige (i.e., segmentation masks) and the latent prior distribution of input training images. Extensive experiments on standard benchmarks show that our GSS can perform competitively to prior art alternatives in the standard semantic segmentation setting, whilst achieving a new state of the art in the more challenging cross-domain setting.
|
|
SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation
Qiang Wan, Zilong_Huang, Jiachen Lu, Gang YU, Li Zhang
ICLR 2023
Paper/Code
Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and spatial enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. ......
|
|
Learning Ego 3D Representation as Ray Tracing
Jiachen Lu, Zheyuan Zhou, Xiatian Zhu, Hang Xu, Li Zhang
17th European Conference on Computer Vision (ECCV2022)
Paper/
Code
A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird's-eye-view (BEV) coordinate frame of the ego car in order to ground downstream planner. Existing perception methods often rely on error-prone depth estimation of the whole scene or learning sparse virtual 3D representations without the target geometry structure, both of which remain limited in performance and/or capability. In this paper, we present a novel end-to-end architecture for ego 3D representation learning from an arbitrary number of unconstrained camera views. Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation and formulate the learning process with the adaptive attention mechanism in conjunction with the 3D-to-2D projection. ......
|
|
SOFT: softmax-free transformer with linear complexity
Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, Li Zhang
NeurIPS2021 [Spotlight]
Paper/
Code
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic complexity in both computation and memory usage. Various attempts on approximating the self-attention computation with linear complexity have been made in Natural Language Processing. However, an in-depth analysis in this work shows that they are either theoretically flawed or empirically ineffective for visual recognition. We further identify that their limitations are rooted in keeping the softmax self-attention during approximations. Specifically, conventional self-attention is computed by normalizing the scaled dot-product between token feature vectors. Keeping this softmax operation challenges any subsequent linearization efforts. ......
|
|
Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, Li Zhang
CVPR 2021 [Cited by 3000+]
Paper/
Code
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
|
Awards and Honors
- 2024: Shanghai Outstanding Graduate (上海市优秀硕士毕业生).
- 2023: 2022-2023 China National Scholarship (国家奖学金).
- 2022: 2021-2022 China National Scholarship (国家奖学金).
- 2021: Shanghai Outstanding Graduate (上海市优秀毕业生).
- 2020: 2019-2020 John Wu & Jane Sun Sunshine Scholarship of SJTU.
- 2019: 3rd Prize of Formula Student Autonomous China.
- 2019: 2018-2019 John Wu & Jane Sun Sunshine Scholarship of SJTU.
- 2019: 2018-2019 China National Scholarship (国家奖学金).
- 2018: 2017-2018 Yuliming Scholarship of SJTU.
|
Languages
- Strong reading, writing, speaking and listening competencies for Mandarin Chinese and English.
- François, 日本語 and Deutsch learner.
|
|