Junwen Chen

Ph.D. at The University of Electro-Communications · Tokyo, Japan

Hello, I am Junwen Chen, from Sichuan, China. I am currently pursuing my Ph.D. at The University of Electro-Communications and belong to the Yanai Lab. My research mainly focuses on Human-Object Interaction Detection, MLLM, and AIGC.

Contact Me

Emailohhthxplz@gmail.com
LocationTokyo, Japan
GitHub Personalhttps://github.com/CUQS
GitHub Researchhttps://github.com/cjw2021
Google ScholarScholar profile link

🌍 Visitor Map

About

I mainly conduct research on deep learning in the field of computer vision, focusing on improving the accuracy and generalization of Human-Object Interaction (HOI) detection methods from my master's to doctoral studies. Recently, I have been exploring the integration of Multimodal Large Language Models (MLLMs) and AI-Generated Content (AIGC) into my research topics.

Research Interests

Machine Learning, Deep Learning
Computer Vision: Object Detection, Image Segmentation, Visual Question Answering, Video Action Recognition
AIGC: Text-to-Image Generation, Multi-layer Image Generation, Image Editing

Education

The University of Electro-Communications · Ph.D.

2023/10 — Present

Major: Informatics

Research Theme: Improving the Efficiency and Generality of Human-Object Interaction Detection Methods

Deep Learning
Computer Vision
Human-Object Interaction Detection
Transformer
VLM
MLLM

Major Achievement:

1. Chen, Junwen, Peilin Xiong, and Keiji Yanai. "HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection." International Conference on Pattern Recognition (ICPR). [PDF] [Code] Research Card
2. Chen, Junwen, and Keiji Yanai. "Bridging Detection Architectures with Foundation Models: A Unified Framework for Human-Object Interaction Detection." IEEE Access, doi: 10.1109/ACCESS.2026.3659132. 2025. [PDF] Research Card
3. Chen, Junwen, Yingcheng Wang, and Keiji Yanai. "Focusing on what to Decode and what to Train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor." 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025. [PDF] [Code] Research Card

The University of Electro-Communications · M.S.

2021/10 — 2023/09

Major: Informatics

Degree: Master of Informatics

Research Theme: Improvement of Human-Object Interaction Detection Methods and Their Application to Dietary Analysis

Deep Learning
Computer Vision
Human-Object Interaction Detection
Transformer

Major Achievement:

1. Chen, Junwen, and Keiji Yanai. "QAHOI: Query-based anchors for human-object interaction detection." 2023 18th International Conference on Machine Vision and Applications (MVA). IEEE, 2023. [PDF] [Code] Research Card
2. Chen, Junwen, and Keiji Yanai. "Parallel Queries for Human-Object Interaction Detection." Proceedings of the 4th ACM International Conference on Multimedia in Asia. 2022. [PDF] Research Card
3. Wang, Yingcheng, Junwen Chen, and Keiji Yanai. "HowToEat: Exploring Human Object Interaction and Eating Action in Eating Scenarios." Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management. 2023. [PDF] Research Card

North China University of Technology · B.S.

2016/09 — 2020/07

Major: Automation

Degree: Bachelor of Electrical and Control Engineering

Research Theme: Intelligent Driving Scene Segmentation with Deep Detection Model and Graph Convolutional Network

Deep Learning
Computer Vision
Instance Segmentation
Graph Neural Networks

Major Achievement:

1. Chen, J., Lu, Y., Chen, Y., Zhao, D., & Pang, Z. (2020, November). Contourrend: a segmentation method for improving contours by rendering. In International Symposium on Neural Networks (pp. 251-260). Cham: Springer International Publishing. [PDF] [Code] Research Card

Internships

Microsoft Research Asia · Full-time Research Intern

2024.10 — 2025.04 · Beijing

Research on Layout-based, Multi-layer Image Generation and Knowledge Graph-based Image Generation Benchmark.

Major Achievement:

Institute of Automation Chinese Academy of Sciences · Full-time Research Intern

2019.09 — 2020.06 · Beijing

Research on improving the interactive instance segmentation method and developed applications for Huawei Atlas 200 DK AI Kit.

Major Achievement:

Awards

Best paper awards in the 18th International Conference on Machine Vision Applications (MVA2023)

2023/06/25

This award has been given since 2011 to the authors of an paper that was most excellent from the viewpoint of machine vision applications.

[Official Site]

情報処理学会第86回全国大会学生奨励賞

2024/03/15

Presentation Title: 画像認識技術を活用した冷蔵庫内食材自動判別システムの開発

As the Teaching Assistant, I supported the student on this project.

[Award Info] [Official Site] [Paper]

Research Click to expand/collapse

HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

2025/10

International Conference on Pattern Recognition (ICPR)

Authors: Junwen Chen, Peilin Xiong, Keiji Yanai

We propose HOI-R1, the first reinforcement learning–based MLLM framework for Human-Object Interaction Detection (HOID) without relying on external detection modules.
HOI-R1 introduces a text-based reasoning process and HOID-specific reward functions to enable pure language-driven interaction detection.
Experiments on HICO-DET demonstrate that HOI-R1 achieves twice the baseline accuracy and exhibits strong generalization capability.

Bridging Detection Architectures with Foundation Models: A Unified Framework for Human-Object Interaction Detection

2026/01

IEEE Access, doi: 10.1109/ACCESS.2026.3659132

Authors: Junwen Chen and Keiji Yanai

Hybrid-SOV aligns HOID with the latest object-detection paradigms by incorporating an efficient hybrid encoder and a query-selection mechanism that directly constructs HOI queries from visual features, eliminating predefined embeddings and enabling more interpretable decoding.
When coupled with DINO-v3 foundation features, Hybrid-SOV achieves state-of-the-art accuracy with superior inference efficiency.

Hybrid Layout Control for Diffusion Transformer: Fewer Annotations, Superior Aesthetics

2025/10

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2025)

Authors: Keming Wu, Junwen Chen, Zhanhao Liang, Yinuo Wang, Ji Li, Chao Zhang, Bin Wang, Yuhui Yuan

We propose a hybrid layout framework for diffusion transformers that reduces dependence on large-scale semantic layout annotations while preserving spatial accuracy in text-to-image generation.
Our method divides layout control into two stages—anonymous layout generation and semantic refinement—followed by a quality-tuning phase to enhance visual aesthetics with minimal extra computation.
Experiments show that our approach surpasses SiamLayout in layout adherence and visual quality, achieving over 10× higher data efficiency using limited semantic layout data.

Japanese Kuzushiji Font Generation Employing Differentiable Renderer

2025/09

International Conference on Document Analysis and Recognition (ICDAR 2025)

Authors: Honghui Yuan, Junwen Chen, and Keiji Yanai

We focus on generating ancient handwritten fonts, particularly Kuzushiji, to support the preservation of Japanese historical documents.
We propose a few-shot, training-free method using vector images to convert modern fonts into Kuzushiji style.
Our experiments show that the proposed method effectively produces high-quality Kuzushiji fonts, outperforming previous approaches.

PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing

2025/08

The 36th British Machine Vision Conference (BMVC 2025)

Authors: Peilin Xiong, Junwen Chen, and Keiji Yanai

We address localized subject-driven image editing by proposing PosBridge, a training-free and scalable framework for inserting custom objects into target scenes.
Our method leverages positional embedding transplant and a Corner Centered Layout strategy to guide diffusion models in maintaining structural and appearance consistency.
Experimental results show that PosBridge outperforms existing methods in structure preservation, visual fidelity, and computational efficiency.

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

2025/06

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Authors: Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian

We introduce knowledge image generation as a new task and present MMMG, a large benchmark of 4,456 expert-validated knowledge image–prompt pairs across multiple disciplines and levels.
MMMG enables systematic evaluation using a unified Knowledge Graph (KG) representation and a new MMMG-Score that measures factual fidelity and visual clarity.
Experiments on 16 leading text-to-image models reveal major reasoning limitations, while our proposed FLUX-Reason baseline demonstrates promising performance for future research.

PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

2025/05

arXiv preprint arXiv:2505.22523

Authors: Junwen Chen, Heyang Jiang, Yanbin Wang, Keming Wu, Ji Li, Chao Zhang, Keiji Yanai, Dong Chen, Yuhui Yuan

We introduce PrismLayers and PrismLayersPro, the first large-scale, high-fidelity datasets of multi-layer transparent images with accurate alpha mattes for text-to-image generation.
We propose a training-free synthesis pipeline and a strong multi-layer generation model, ART+, built upon LayerFLUX and MultiLayerFLUX for high-quality layered image composition.
Experiments and user studies show that ART+ surpasses the original ART and matches the visual quality of FLUX.1-[dev], establishing a foundation for future research in editable multi-layer image generation.

Focusing on what to decode and what to train: SOV Decoding with Specific Target Guided DeNoising and Vision Language Advisor

2024/12

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

Authors: Junwen Chen, Yingcheng Wang, Keiji Yanai

We propose SOV-STG-VLA, a transformer-based framework for Human-Object Interaction Detection (HOID) that disentangles object detection and verb recognition.
Our method introduces Subject-Object-Verb (SOV) decoding, Specific Target Guided (STG) denoising, and a Vision-Language Advisor (VLA) to enhance representation learning and training efficiency.
Experiments show that SOV-STG-VLA achieves state-of-the-art performance with six times faster convergence than recent transformer-based methods.

RecipeSD: Injecting Recipe into Food Image Synthesis with Stable Diffusion

2024/10

Proceedings of the 2nd International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice (McGE '24)

Authors: Jing Yang, Junwen Chen, Keiji Yanai

We propose RecipeSD, a Stable Diffusion–based method that incorporates recipe text information to enhance food image synthesis.
Using a pretrained recipe encoder and the Image-like Recipe Transformation (IRT) with our CookNet model, we effectively inject detailed recipe semantics into the diffusion process.
Experiments show that RecipeSD generates high-quality, recipe-aligned food images, outperforming existing cross-modal synthesis methods.

Improving Cross-Modal Recipe Embeddings with Cross Decoder

2024/06

Proceedings of the 5th ACM Workshop on Intelligent Cross-Data Analysis and Retrieval

Authors: Jing Yang, Junwen Chen, Keiji Yanai

We propose a Cross-Modal Embedding Fusing Decoder (Cross Decoder) to enhance cross-modal recipe retrieval.
By integrating the Cross Decoder into a GAN–transformer framework and using dynamic margin loss, we improve embedding reliability and retrieval performance.
Experiments on the Recipe1M dataset show that our method outperforms state-of-the-art approaches in both retrieval accuracy and image generation quality.

Contextual Associated Triplet Queries for Panoptic Scene Graph Generation

2023/12

Proceedings of the 5th ACM International Conference on Multimedia in Asia (MMAsia '23)

Authors: Jingbin Xu, Junwen Chen, Keiji Yanai

We address the limitations of existing Panoptic Scene Graph (PSG) methods by proposing a new one-stage framework called Contextual Associated Triplet Queries (CATQ).
CATQ decodes subject, object, and relation features through separate branches, guided by instance information and enhanced with a Triplet Context Fusion Block.
Experiments demonstrate that CATQ significantly outperforms state-of-the-art methods, achieving 34.8 Recall@20 and 20.9 mRecall@20 with only half the training time.

HowToEat: Exploring Human Object Interaction and Eating Action in Eating Scenarios

2023/10

Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management (MADiMa '23)

Authors: Yingcheng Wang, Junwen Chen, Keiji Yanai

We focus on eating and diet multimedia analysis, emphasizing the need to detect eating activities in videos and images.
To address the lack of eating-specific data, we introduce HowToEat, a large-scale dataset with 66 days of videos and 95k annotated images across 12 eating scenarios.
Based on this dataset, we develop an eating analysis system that simultaneously detects hand-object interactions and eating actions using a single model.

QAHOI: Query-based anchors for human-object interaction detection

2023/07

2023 18th International Conference on Machine Vision and Applications (MVA)

Authors: Junwen Chen and Keiji Yanai

We address the Human-Object Interaction (HOI) detection task by proposing QAHOI, a transformer-based method using query-based anchors for end-to-end HOI prediction.
QAHOI employs a multi-scale architecture to capture spatial variations in object size and location, improving interaction recognition accuracy.
Experiments on the HICO-DET benchmark show that QAHOI with a transformer backbone significantly outperforms state-of-the-art methods.

Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training

2023/01

International Conference on Multimedia Modeling

Authors: Jing Yang, Junwen Chen, Keiji Yanai

We propose TNLBT (Transformer-based Network for Large Batch Training), a simple yet effective framework for cross-modal recipe retrieval and image generation.
TNLBT employs Transformer-based encoders for both image and text embedding, combines self-supervised and contrastive losses, and leverages large-batch training to enhance cross-modal learning.
Experiments on Recipe1M demonstrate that TNLBT significantly outperforms state-of-the-art methods, and we confirm that large-batch training improves recipe embedding learning.

Parallel Queries for Human-Object Interaction Detection

2022/12

Proceedings of the 4th ACM International Conference on Multimedia in Asia (MMAsia '22)

Authors: Junwen Chen, Keiji Yanai

We propose Parallel Query Network (PQNet) for Human-Object Interaction (HOI) detection, which separates human and object localization into parallel decoding branches.
PQNet employs two transformer decoders for subject and object embeddings and introduces a verb decoder that fuses their representations through attention mechanisms.
Experiments show that PQNet outperforms previous methods while requiring only half the training epochs.

ContourRend: A Segmentation Method for Improving Contours by Rendering

2020/10

International Symposium on Neural Networks

Authors: Junwen Chen, Yi Lu, Yaran Chen, Dongbin Zhao, Zhonghua Pang

We propose ContourRend, a segmentation method designed to refine object contours and achieve clearer, more complete segmentation results.
ContourRend integrates a contour renderer with a GCN-based segmentation model, focusing on high-resolution prediction around contour pixels.
Experiments on the Cityscapes dataset show that ContourRend achieves 72.41% mIoU, outperforming the baseline Polygon-GCN by 1.22%.

日本国内会議の参加

画像の認識・理解シンポジウムMIRU2025

京都

Peilin Xiong, Junwen Chen, Honghui Yuan, and Keiji Yanai: "Controlling Unseen Compositions in Diffusion Models by Swapping Positional Embeddings"

拡散モデルにおいて未知の物体構成を生成するために、ポジショナル埋め込みを動的に入れ替えることで、モデルを再学習することなく新奇な視覚的合成（例：人間の頭部を犬の体に合成）を実現する手法を提案している。

Peilin Xiong口頭発表

Honghui Yuan, Junwen Chen, and Keiji Yanai: Few-shot Font Generation for Japanese Kuzushiji with Differentiable Renderer

Word-as-ImageをベースとしてDiffVGでくずし文字を生成．

Honghui Yuan口頭発表

画像の認識・理解シンポジウムMIRU2024

熊本

Jing Yang, Junwen Chen, Jingbin Xu and Keiji Yanai: RecipeSD: Injecting Recipe Embedding into Food Image Synthesis using Stable Diffusion

Cross-modal recipe embeddingをControlNetの入力にしたCookNetを用いてStableDiffusionでレシピからの食事画像生成を実現．

ポスター

画像の認識・理解シンポジウムMIRU2023

浜松

陳俊文, 王瀛成, 柳井啓司: 人物・物体・動作デコーダの分離によるHOI検出

本研究では人物デコーダ，物体デコーダ，動作デコーダからなる新しい one-stage フレームワークを提案する．HICO-DET で本手法は学習エポックの 3 分の 1 で最先端手法より高い精度を達成した．

陳俊文口頭発表

画像の認識・理解シンポジウムMIRU2022

姫路

陳俊文, 柳井啓司: マルチスケールのアンカーを用いた人間と物体のインタラクション検出

QAHOIを提案．階層型バックボーンと Deformable Transformer Encoder を用いて、マルチスケールで特徴を抽出する

ポスター

電子情報通信学会パターン認識・メディア理解研究会(PRMU)

2022

陳俊文, 柳井啓司: クエリベースのアンカーを用いた人間と物体のインタラクション検出

Transformer を用いたマルチスケールアーキテクチャを採用し，クエリに基づくアンカーを用いて HOI インスタンスの全ての要素を予測する one-stage の手法を提案する

陳俊文オンライン発表

Skills

Programming

Python
C++
JavaScript
HTML/CSS
Kotlin

Frameworks/Tools

PyTorch
TensorFlow
Docker
Linux

Languages

Chinese (Native)
English (TOEFL 84, TOEIC 845)
Japanese (N1)

Personal Projects

Japanese Words Learning App

A mobile application written in Kotlin designed to help users learn Japanese words through interactive quizzes and flashcards.

Code

Japan Ski Resort Info

A website providing information about ski resorts in Japan, including location, facilities, and user reviews.

Website

Qwen-ASR-LLM-TTS_MCP-Chat-AI-Assistant

An easy-to-deploy real-time multilingual AI assistant integrating Qwen3-ASR for speech recognition, Ollama + Qwen3-30B-A3B for LLM, Qwen3-TTS / Kokoro for speech synthesis, and MCP for smart device control.

code

Hobbies

Tennis, Photography, Traveling