Text-to-image models offer a new level of creative flexibility by allowing
users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across
diverse prompts remains challenging. Existing approaches fine-tune the
model to teach it new words that describe specific user-provided subjects or
add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to
align generated images with text prompts and face difficulties in portraying
multiple subjects. Here, we present ConsiStory, a training-free approach that
enables consistent subject generation by sharing the internal activations of
the pretrained model. We introduce a subject-driven shared attention block
and correspondence-based feature injection to promote subject consistency
between images. Additionally, we develop strategies to encourage layout
diversity while maintaining subject consistency. We compare ConsiStory to a
range of baselines, and demonstrate state-of-the-art performance on subject
consistency and text alignment, without requiring a single optimization step.
Finally, ConsiStory can naturally extend to multi-subject scenarios, and even
enable training-free personalization for common objects.