|Title||Vision-Language Integration in AI: a reality check|
|Publication Type||Conference Papers|
|Year of Publication||2004|
|Authors||Pastra, K, Wilks, Y|
|Conference Name||Proceedings of the 16th European Conference in Artificial Intelligence|
Multimodal human to human interaction requires integration of the contents/meaning of the modalities involved. Artificial Intelligence (AI) multimodal prototypes attempt to go beyond technical integration of modalities to this kind of meaning integration that allows for coherent, natural, “intelligent” communication with humans. Though bringing many multimedia-related AI research fields together, integration and in particular vision-language integration is an issue that remains still in the background. In this paper, we attempt to make up for this lacuna by shedding some light on how, why and to what extent vision-language content integration takes place within AI. We present a taxonomy of vision-language integration prototypes which resulted from an extensive survey of such prototypes across a wide range of AI research areas and which uses a prototype’s integration purpose as the guiding criterion for classification. We look at the integration resources and mechanisms used in such prototypes and correlate them with theories of integration that emerge indirectly from computational models of the mind. We argue that state of the art vision-language prototypes fail to address core integration challenges automatically, because of human intervention in stages during the integration procedure that are tightly coupled with inherent characteristics of the integrated media. Last, we present VLEMA, a prototype that attempts to perform vision-language integration with minimal human intervention in these core integration stages.