To develop a model/framework that generates natural language descriptions of images and their regions. A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models. We are working to develop methodology for automated image description and captioning using the combination of linguistic and visual information with the aim of better image understanding and visual recognition.
Imagine this kind of technology being refined and expanded to surveillance systems. (“A car running over a human”, followed by a notification for review of the footage by humans).