🎓Paper2Poster
Towards Multimodal Poster Automation from Scientific Papers

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
University of Waterloo National University of Singapore University of Oxford
† Equal Contribution ✉ Corresponding Authors

TL;DR

We address How to create a poster from a paper and How to evaluate poster.


Can AI assistants create a well-designed Poster given a Paper?

Paper

Inputs: Paper (a PDF)

Poster

Outputs: Poster (designed by author)

How GPT-4o or Open-source Multi-agents behave?

GPT-4o Image

Poster generated by GPT-4o-image

GPT-4o HTML

Poster generated by GPT-4o-html

PPT-Agent

Poster generated by PPTAgent

OWL

Poster generated by 🦉OWL

GPT-4o-image visually acceptable layouts at first glance, but closer inspection zoom-in-region reveals impaired text rendering, leading to poor readability of fine-grained details. GPT-4o-HTML and OWL generate blog-like, text-dense posters that suffer from low visual readability. PPTAgent struggles with layout control, often resulting in missing panels.

Poster Generated by PosterAgent

PosterAgent

In contrast, PosterAgent generates structurally coherent and readable posters while using significantly fewer words.

What are the Challenges?

  • Long-Context Long-Horizon Task: Scientific papers span multiple pages and thousands of words. Summarizing key insights while preserving coherence demands hierarchical understanding and selective abstraction. The complexity further necessitates long-horizon reasoning and multiple iterative interactions, making the task especially challenging.
  • Interleaved Multimodal Inputs: Papers integrate numerous figures, tables, and charts, each semantically linked to the surrounding text. Successful poster generation demands the ability to extract, interpret, and align these multimodal elements in a contextually appropriate manner.
  • Layout-aware Multimodal Outputs: Unlike tasks focused solely on text (e.g., blog) or vision, poster generation requires producing interleaved text–image outputs within a constrained spatial layout. This necessitates joint reasoning over language, visual content, and layout to prevent overflow, imbalance, and logical misalignment.


How to create poster 👉 PosterAgent

stat
A Top-down, visual-in-the-loop, efficient multi-agent pipeline.
(a) Parser distills the paper into a structured asset library; the (b) Planner aligns text–visual pairs into a binary‐tree layout that preserves reading order and spatial balance; and the (c) Painter-Commentor loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment.

How to evaluate poster 👉 PaperQuiz

stat
A Good poster should convey core paper content visually.
Left: We automatically generate multiple-choice questions from each paper using an LLM (o3), forming the our PaperQuiz evaluation. Right: In PaperQuiz, we simulate multiple reader by allowing VLMs—representing different expertise levels (e.g., student, professor)—to read each generated poster and answer the quiz. The poster that achieves the highest average score is considered the most effective in conveying the paper's content.

Abstract

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i) Visual Quality—semantic alignment with human posters, (ii) Textual Coherence—language fluency, (iii) Holistic Assessment—six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv) PaperQuiz—the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visualin-the-loop multi-agent pipeline: the (a) Parser distills the paper into a structured asset library; the (b) Planner aligns text–visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c) Painter–Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs—though visually appealing at first glance—often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source Paper2Poster pipeline outperforms GPT-4o–based systems across nearly all metrics while consuming 87% fewer tokens. These findings chart clear directions for the next generation of fully automated poster-generation models.

Data Statistic

stat
(a) Word cloud illustrating the diversity of research topics. (b) Textual Token statistics and Figure count statistics for input papers vs. posters provided by authors.

Main Results on Existing Solutions

stat
Detailed evaluation of Paper2Poster.
stat
PaperQuiz evaluation on Paper2Poster.
Efficiency & Cost analysis
Efficiency & Cost analysis demonstrating PosterAgent’s strong efficiency and low API cost.

More Examples

Case-1: Conformal Semantic Keypoint Detection with Statistical Guarantees.

Case-2: Neural Tangent Kernels for Axis-Aligned Tree Ensembles.

Case-3: Truly Scale-Equivariant Deep Nets with Fourier Layers.

BibTeX


@misc{pang2025paper2postermultimodalposterautomation,
      title={Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers}, 
      author={Wei Pang and Kevin Qinghong Lin and Xiangru Jian and Xi He and Philip Torr},
      year={2025},
      eprint={2505.21497},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.21497}, 
}