Natural language descriptions of images and their regions

To develop a model/framework that generates natural language descriptions of images and their regions. A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models. We are working to develop methodology for automated image description and captioning using the combination of linguistic and visual information with the aim of better image understanding and visual recognition.


Imagine this kind of technology being refined and expanded to surveillance systems. (“A car running over a human”, followed by a notification for review of the footage by humans).

Everyone has large photo collections these days. How can you intelligently find all pictures in which your dog appears? How can you find all pictures in which you are frowning? Can we make cars smart, e.g., can the car drive you to school while you finish your last homework? How can a home robot understand the environment, e.g., switch on a tv when being told so and serve you dinner? If you take a few pictures of your living room, can you reconstruct it in 3D (which allows you to render it from any new viewpoint and thus allows you to create a “virtual tour” of your room)? Can you reconstruct it from one image alone? How can you efficiently browse your home movie collection, e.g. find all shots in which Tom Cruise is chasing a bad guy?


  • Deep Visual-Semantic Alignments for Generating Image Descriptions (link)
  • Generation and Comprehension of Unambiguous Object Descriptions (link, slides)
  • Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections (link)
  • Automatic Description Generation from Images: A Surveyvof Models, Datasets, and Evaluation Measures (link)
  • ECCV2016 2nd Workshop on Storytelling with Images and Videos (VisStory)
  • EACL 2014 Tutorial: Describing Images in Sentences; April 27, EACL 2014, Gothenburg, Sweden
  • International Journal of Computer Vision; Volume 123, Issue 1, May 2017; Special Issue: Combined Image and Language Understanding
    • VQA: Visual Question Answering
    • Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
    • Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
  • Visual Genome: Dataset



Posted in Computer Vision, Projects.