Quick answer
Recent progress in video-text retrieval has been driven largely by advancements in model architectures and training strategies. However, the representation learning capabilities of videotext retrieval models remain constrained by lowquality and limited training data annotations.