CLASS-DEPENDENT AND CROSS-MODAL MEMORY NETWORK CONSIDERING SENTIMENTAL FEATURES FOR VIDEO-BASED CAPTIONING