Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs.
Our dataset contains long captions for 100M images, where each image is paired with three long captions generated by , , and , respectively. (Click To See)
Caption from InstructBLIP:
In the image, there are several objectively existing and visible objects with attributes. These include:\n1. A cat: The main subject of the image is a gray and white cat, which is being fed by a person's finger.\n2. The person's hand: The person's hand is visible in the image, holding the finger that is feeding the cat.\n3. The cat's mouth: The cat's mouth is open as it receives the food from the person's finger.\n4. The cat's teeth: The cat's teeth are visible as it opens its mouth to receive the food from the person's finger.\n5. The person's fingertip: The person's fingertip is touching the cat's mouth, demonstrating the act of feeding the cat.\n6. The cat's fur: The cat's fur is visible in the image, contributing to the overall visual representation of the scene.,Caption from InstructBLIP:
In the image, there are several visible objects that can be identified based on their attributes. These objects include: 1. People: A group of people is present on the beach, engaging in various activities such as swimming, sunbathing, or simply enjoying the scenery. 2. Sand: The beach is covered with sand, providing a suitable environment for relaxation and recreational activities. 3. Water: There is a body of water near the beach, which could be the ocean, a lagoon, or another type of body of water. 4. Island: An island is visible in the distance, creating a picturesque backdrop for the beach scene. 5. Cliffs: Rocky cliffs can be seen in the background, adding to the natural beauty and diversity of the landscape. These objectively existing and visible objects contribute to the overall aesthetic and atmosphere of the scene.Caption from InstructBLIP:
The image features a white and pink double-decker bus traveling down a city street. The bus has two levels, making it a double-decker vehicle. It is parked at the side of the road, likely waiting for passengers to board or disembark. In the background, there are several other vehicles, including cars, trucks, and motorcycles, adding to the bustling atmosphere of the urban setting.'Existing short-text-image retrieval tasks primarily rely on short textual input. We collected long text-image pairs from DCI, IIW, and ShareGPT4V datasets for constructing long-text-image retrieval evaluation task.
The impacts of long v.s. short captions on image-language pre-training. Training with short text-image pairs leaves certain tokens (e.g., garden token) easily overshadowed by salient tokens (e.g., castle token). long captions-image pairs can help bring the overshadowed tokens back into the light.
As the length of text (the number of sub-captions) increases, the performance of the pre-trained model on long-text-image retrieval consistently improves and becomes stable. However, there is degradation in the retrieval task and classification when the model is trained with longer text.
In order to find a solution that well balances both long and short texts, we design to add extra text tokens for text encoders, termed corner tokens ([Cor 1], [Cor 2], . · · · ), to aggregate diverse text features.
@inproceedings{LoTLIP2024,
title = {LoTLIP: Improving Language-Image Pre-training for Long Text Understanding},
author = {Wu, Wei and Zheng, Kecheng and Ma, Shuailei and Lu, Fan and Guo, Yuxin and Zhang, Yifei and Chen, Wei and Guo, Qingpei and Shen, Yujun and Zheng-Jun, Zha},
journal = {arXiv},
year = {2024}
}