In the context of designing a chatbot for videos that can use images as prompts, the extent of support for this feature is uncertain. For the use case you mentioned, where a user sends a picture of a person and asks the chatbot for the timestamp of the person’s appearance in the video, this is feasible if we index the faces of individuals appearing at various timestamps.
During testing, when users inquire about the timing of a person’s appearance in the videos, we can search the database and return the corresponding timestamp. This can be achieved even without LLM.
However, if the requirement extends beyond this, for instance, an image and a question that refers to the video’s context, and if we need the image context from the video, this is not fully functional with the current technology. In some cases, where the transcription alone is sufficient to answer the question, it is possible.
But generally, a model that fully comprehends the content of videos is required to answer all types of question