Janus-Pro is an advanced unified understanding and generation Multimodal Large Language Model (MLLM). Developed by the Chinese company DeepSeek, this cutting-edge framework excels in recognizing and generating images, surpassing prominent models like DALL-E 3 by OpenAI and Stable Diffusion in benchmarks.
Janus-Pro is the improved successor of the original Janus model, introducing optimized training strategies, an expanded dataset, and increased model scale. These enhancements have elevated its performance in tasks such as generating images from textual descriptions and analyzing visual data. In GenEval and DPG-Bench, Janus-Pro outperforms Stable Diffusion 3 Medium (open source) and DALL-E 3 (commercial).
The model is now publicly available on Hugging Face, with its code released under the MIT License and the model itself under the DeepSeek Model License.
Janus-Pro introduces a novel architecture that separates visual encoding to enable seamless multimodal processing. Built on the foundations of DeepSeek-LLM-1.5b-base and DeepSeek-LLM-7b-base, it incorporates the SigLIP-L vision encoder, which supports 384 x 384 image inputs. For image generation tasks, it utilizes a tokenizer with a downsampling rate of 16.
To learn more about its implementation, visit the GitHub Repository.
Janus-Pro demonstrates superior performance across multiple benchmarks, setting new standards for unified multimodal systems:
These achievements highlight Janus-Pro’s strength in both understanding and generating across diverse modalities.
Getting started with Janus-Pro is straightforward. Access the framework via the official GitHub Repository or explore the model directly on Hugging Face for implementation details and code.
We are excited to see what DeepSeek will bring to the table next, as their innovations continue to push the boundaries of AI and redefine what’s possible in the world of multimodal neural networks. What’s even more remarkable is that these opportunities are affordable for everyone, making advanced AI accessible to a wider audience than ever before.