tgif其实就是gif数据集,feat,vocabulary还有datasets获取参见https://github.com/fanchenyou/HME-VideoQA/tree/master/gif-qa

No module named ‘colorlog’
pip install colorlog

No module named ‘block’
pip install block.bootstrap.pytorch

ordinal not in range(128)
调了半天utf编码啥的都不行,结果全都改回去了反而就可以跑了

AttributeError: Can’t get attribute ‘_init_fn’ on <module ‘main’ (built-in)>
好像是多进程的什么问题
麻了,弃了
——————————————————————————
虽然跑不了,但姑且还是学一学吧
——————————————————————————
能跑了,就是有点慢,研究一下原理

+--------------+------------------+
|  Parameter   |      Value       |
+==============+==================+
| Ablation     | none             |['none', 'gcn', 'global', 'local', 'only_local']这几种选项
+--------------+------------------+
| Batch size   | 128              |
+--------------+------------------+
| Birnn        | 0                |是否使用双向RNN
+--------------+------------------+
| Change lr    | none             |是否改变学习率,默认none即是不改变
+--------------+------------------+
| Checkpoint   | Count_4.092.pth  |saved_models\MMModel中存的预数据
+--------------+------------------+
| Cycle beta   | 0.010            |
+--------------+------------------+
| Dropout      | 0.300            |
+--------------+------------------+
| Fusion type  | coattn           |['none', 'coattn', 'single_visual', 'single_semantic', 'coconcat','cosiamese']这几种选项
+--------------+------------------+
| Gcn layers   | 2                |+1,即默认GCN层数为3
+--------------+------------------+
| Hidden size  | 512              |
+--------------+------------------+
| Lr           | 0.000            |
+--------------+------------------+
| Lr list      | [10, 20, 30, 40] |
+--------------+------------------+
| Max epoch    | 100              |
+--------------+------------------+
| Max n videos | 100000           |
+--------------+------------------+
| Model        | 7                |
+--------------+------------------+
| Momentum     | 0.900            |
+--------------+------------------+
| Num workers  | 1                |
+--------------+------------------+
| Prefetch     | none             |[none,nvidia, background]这几种选择,还有对应的nvidia_prefetcher,BackgroundGenerator类的调用
+--------------+------------------+
| Q max length | 35               |
+--------------+------------------+
| Rnn layers   | 1                |
+--------------+------------------+
| Save         | False            |是否保存模型,输入--save则为True
+--------------+------------------+
| Save adj     | False            |是否保存邻接矩阵,输入--save_adj则为True
+--------------+------------------+
| Save path    | ./saved_models/  |
+--------------+------------------+
| Server       | 1080ti           |['780, 1080ti, 1080']这几种选择
+--------------+------------------+
| Task         | Count            |[Count, Action, FrameQA, Trans]这几种选择
+--------------+------------------+
| Test         | False            |False即为训练,输入--test则为测试
+--------------+------------------+
| Tf layers    | 1                |
+--------------+------------------+
| Two loss     | 0                |
+--------------+------------------+
| V max length | 80               |
+--------------+------------------+
| Val ratio    | 0.100            |
+--------------+------------------+
| Weight decay | 0                |
+--------------+------------------+

补充

data_path      |'/home/jp/data/tgif-qa/data' |如果server是'780'则设置这一条,那么默认1080的话怎么办?下面还要用呢?
feat_dir        | data_path+'feats'
vc_dir          | data_path+'Vocabulary'
df_dir          | data_path+'dataset'
model_name      | 'Count'                     |即task
pin_memory      | False
dataset         | 'tgif_qa'
log             | './logs'
val_epoch_step  | 1
two_loss        | False                          |上面的two_loss大于0则设为True,否则为False
birnn           | False                          |同上
save_model_path | save_path + 'MMModel/'

通过data_utils的dataset创建两个TGIFQA类实例full_dataset(长为26839),test_dataset(长为3554),区别在于前者dataset_name=‘train’,后者为’test’
再通过torch.utils.data.random_split划分训练验证集,训练24156,验证2683,再通过torch.utils.data创建三个DataLoader类实例train_dataloader,val_dataloader,test_dataloader
补充

resnet_input_size  | 2048
c3d_input_size      | 4096
text_embed_size     | 300                 |train_dataset.dataset.GLOVE_EMBEDDING_SIZE
answer_vocab_size   | None
word_matrix         | (2423, 300)的ndarray  |train_dataset.dataset.word_matrix
voc_len             | 2423

VOCABULARY_SIZE = train_dataset.dataset.n_words=2423

当前task为’Count’所以创建nn.MSELoss()均方误差的损失对象,并设定best_val_acc=-100

训练模型通过LSTMCrossCycleGCNDropout创建

for ii, data in enumerate(train_dataloader):

(看不太懂这个train_dataloader里面到底那个东西是数据信息,咋就摘出来ii和data了)
data是[128, 80, 2048]float32,[128, 80, 4096]float32,[128]int64,[128, 35]int64,[128]int64,[128]float326个tensor组成的list

当前change_lr为none所以创建如下优化器

Adam (
Parameter Group 0amsgrad: Falsebetas: (0.9, 0.999)eps: 1e-08lr: 0.0001weight_decay: 0
)
LSTMCrossCycleGCNDropout(
读取train_dataloader
#sentence_inputs(batch_size, sentence_len, 1)
#video_inputs(batch_size, frame_num, video_feature)
当前task为'Count'所以首先执行forward_count
输入:
resnet_inputs   [128, 80, 2048]
c3d_inputs      [128, 80, 4096]
video_length    128
sentence_inputs [128, 35]
question_length 128
answers         128#创建all_adj[128,115,115]#通过model_block得到out, adj###问题编码   输入#          sentence_inputs [128, 35]#           question_length 128(sentence_encoder): SentenceEncoderRNN((embedding): Embedding(2423, 300, padding_idx=0)#得到embedded[128,35,300](dropout): Dropout(p=0.3, inplace=False)(upcompress_embedding): Linear(in_features=300, out_features=512, bias=False)(relu)#[128,35,300]x[300,512]->[128,35,512]if variable_lengths:nn.utils.rnn.pack_padded_sequence#输入:#    embedded        [128,35,300]#    input_lengths   128#输出:#   embedded:[1269,512],15,128,128 4个tensor组成的PackedSequence,#     有15个batch,batchsize有128,128,128,128,128,128,128, 128,107,60,35,20,16,6,1(rnn): GRU(512, 512, batch_first=True, dropout=0.3)#输入:embedded#输出:# output:尺寸和batchsize同embedded的PackedSequence#  hidden:[1, 128, 512]tensorif variable_lengths:nn.utils.rnn.pad_packed_sequence#输入:output#输出:output[128, 15, 512]#——————————————————————————————————————if self.n_layers > 1 and self.bidirectional:(compress_output): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)→q_output(compress_hn_layers_bi): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)→s_hiddenelif self.n_layers > 1:(compress_hn_layers): Linear(in_features=512, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)→s_hiddenelif self.bidirectional:(compress_output): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)→q_output(compress_hn_bi): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)→s_hidden#————————————————————————————————)#输出:#    q_output[128, 15, 512]即上面的output#    s_hidden[1, 128, 512]即上面的hidden,再squeeze到s_last_hidden[128, 512]###视频编码(compress_c3d): WeightDropLinear(in_features=4096, out_features=2048, bias=False)#c3d_inputs[128, 80, 4096]x[4096,2048]->[128,80,2048](relu)(video_fusion): WeightDropLinear(in_features=4096, out_features=2048, bias=False)#c3d_inputs与resnet_inputs拼接#[128, 80, 4096]x[4096,2048]->video_inputs[128,80,2048](relu)(video_encoder): VideoEncoderRNN(#输入:#    video_inputs    [128,80,2048]#   video_length    128(project): Linear(in_features=2048, out_features=512, bias=False)#[128,80,2048]x[2048,512]->embedded[128,80,512](relu)(dropout): Dropout(p=0.3, inplace=False)if variable_lengths:nn.utils.rnn.pack_padded_sequence#输入:#  embedded        [128,80,512]#    input_lengths   128#输出:#   embedded:[5311,512],80,128,128 4个tensor组成的PackedSequence,有80个batch(rnn): GRU(512, 512, batch_first=True, dropout=0.3)#输入:embedded#输出:#    output:尺寸和batchsize同embedded的PackedSequence#  hidden:[1, 128, 512]tensorif variable_lengths:nn.utils.rnn.pad_packed_sequence#输入:output#输出:output[128, 80, 512]#——————————————————————————————————————if self.n_layers > 1 and self.bidirectional:(compress_output): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)(compress_hn_layers_bi): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)elif self.n_layers > 1:(compress_hn_layers): Linear(in_features=512, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)elif self.bidirectional:(compress_output): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)(compress_hn_bi): Linear(in_features=1024, out_features=512, bias=False)(relu)(dropout): Dropout(p=0.3, inplace=False)#—————————————————————————————————————)#输出:#    v_output[128, 80, 512]即上面的output#    v_hidden[1, 128, 512]即上面的hidden,再squeeze到v_last_hidden[128, 512]if self.ablation != 'local':###视频问题融合if self.tf_layers != 0:(q_input_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=False)(v_input_ln): LayerNorm((512,), eps=1e-05, elementwise_affine=False)#————————————————————————————###自注意力if 'self' in self.fusion_type:(q_selfattn): SelfAttention((padding_mask_k)#(bs, q_len, v_len)(padding_mask_q)#(bs, v_len, q_len)(encoder_layers): ModuleList(SelfAttentionLayer(if attn_mask is None or softmax_mask is None:(padding_mask_k)(padding_mask_q)#三头(linear_k): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_q): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_v): WeightDropLinear(in_features=512, out_features=512, bias=False)(softmax): Softmax(dim=-1)(linear_final): WeightDropLinear(in_features=512, out_features=512, bias=False)(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=False))))(v_selfattn): SelfAttention((padding_mask_k)(padding_mask_q)(encoder_layers): ModuleList(SelfAttentionLayer(if attn_mask is None or softmax_mask is None:(padding_mask_k)(padding_mask_q)#三头(linear_k): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_q): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_v): WeightDropLinear(in_features=512, out_features=512, bias=False)(softmax): Softmax(dim=-1)(linear_final): WeightDropLinear(in_features=512, out_features=512, bias=False)(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=False))))#——————————————————————————————————if 'coattn' in self.fusion_type:(co_attn): CoAttention(#输入:layernorm后的q_output,v_output(padding_mask_k)#fake_q[128,15,512]×v_output.T[128,512,80]->attn_mask[128,15,80]bool(padding_mask_q)#q_output[128,15,512]×fake_k.T[128,512,80]->softmax_mask[128,15,80]bool(padding_mask_k)#fake_q[128,80,512]×q_output.T[128,512,15]->attn_mask_[128,80,15]bool(padding_mask_q)#v_output[128,80,512]×fake_k.T[128,512,15]->softmax_mask_[128,80,15]bool(encoder_layers): ModuleList(CoAttentionLayer(#输入:#q_output,v_output,attn_mask,softmax_mask,attn_mask_,softmax_mask_#多头(linear_question): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_video): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_v_question): WeightDropLinear(in_features=512, out_features=512, bias=False)(linear_v_video): WeightDropLinear(in_features=512, out_features=512, bias=False)#得到#question_q [128, 15, 512]#video_k   [128, 80, 512]#question  [128, 15, 512]#video     [128, 80, 512]#scale=512^(-1/2)#question_q×video_k.T->attention_qv[128,15,80]#attention_qv×scale再通过masked_fill(attn_mask,-np.inf)(softmax): Softmax(dim=-1)#attention_qv再通过masked_fill(softmax_mask,0)#video_k×question_q.T->attention_vq[128,80,15]#attention_vq×scale再通过masked_fill(attn_mask_,-np.inf)(softmax): Softmax(dim=-1)#attention_vq再通过masked_fill(softmax_mask_,0)#attention_qv×v_output->output_qv[128,15,512](linear_final_qv): WeightDropLinear(in_features=512, out_features=512, bias=False)(layer_norm_qv): LayerNorm((512,), eps=1e-05, elementwise_affine=False)#output_qv+q_output做LayerNorm#attention_vq×q_output->output_vq[128,15,512](linear_final_vq): WeightDropLinear(in_features=512, out_features=512, bias=False)(layer_norm_vq): LayerNorm((512,), eps=1e-05, elementwise_affine=False)#output_vq+v_output做LayerNorm)))#输出:#    q_output[128,15,512]# v_output[128,80,512]###GCN(adj_learner): AdjLearner(#拼接q_output,v_output得到graph_nodes[128,95,512](edge_layer_1): Linear(in_features=512, out_features=512, bias=False)(relu)(edge_layer_2): Linear(in_features=512, out_features=512, bias=False)(relu)#[128,95,512]×[128,512,95]->adj[128,95,95])#另外拼接q_output,v_output得到q_v_inputs[128,95,512](gcn): GCN(#输入:q_v_inputs,adj#输出:q_v_output[128,95,512](layers): ModuleList((0): GraphConvolution((weight): Linear(in_features=512, out_features=512, bias=False)(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=False)(relu)(dropout): Dropout(p=0.3, inplace=False))(1): GraphConvolution((weight): Linear(in_features=512, out_features=512, bias=False)(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=False)(relu)(dropout): Dropout(p=0.3, inplace=False))(2): GraphConvolution((weight): Linear(in_features=512, out_features=512, bias=False)(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=False)(relu)(dropout): Dropout(p=0.3, inplace=False))))###注意力池(gcn_atten_pool): Sequential((0): Linear(in_features=512, out_features=256, bias=True)#[128,95,512]x[512,256]->[128,95,256](1): Tanh()(2): Linear(in_features=256, out_features=1, bias=True)#[128,95,256]x[256,1]->[128,95,1](3): Softmax(dim=-1))#q_v_output[128,95,512]×local_attn[128,95,1]->[128,95,512]#再sum到local_out[128,512]if self.ablation != 'global':###全局融合(global_fusion): Block(    #block包里的功能#输入:#   s_last_hidden[128,512], v_last_hidden[128,512](linear0): Linear(in_features=512, out_features=1600, bias=True)#[128,512]x[512,1600]->[128,1600](linear1): Linear(in_features=512, out_features=1600, bias=True)#[128,512]x[512,1600]->[128,1600]if self.dropout_input > 0:(dropout)(dropout)(get_chunks)#输入:  x0#      self.sizes_list:[80]*20#输出:#  x0_chunks:通过对x0做narrow得到20个[128,80]的tensor组成的list(get_chunks)#  x1_chunks:通过对x1做narrow得到20个[128,80]的tensor组成的list#同步遍历x0_chunks和x1_chunks的20个tensor(merge_linears0): ModuleList( #20层(0): Linear(in_features=80, out_features=1200, bias=True)#[128,80]x[80,1200]->[128,1200])(merge_linears1): ModuleList( #20层(0): Linear(in_features=80, out_features=1200, bias=True)#[128,1200]x[128,1200]->[128,1200]#再view成[128,15,80]#再sum成z[128,80]#z=relu(z)^(-1/2)-relu(-z)^(-1/2)(normalize)#拼接成[128,1600])(linear_out): Linear(in_features=1600, out_features=512, bias=True)#[128,1600]×[1600,512]->global_out[128,512])if self.ablation != 'local':(fusion): Block(#输入:#    global_out[128,512], local_out[128,512](linear0): Linear(in_features=512, out_features=1600, bias=True)(linear1): Linear(in_features=512, out_features=1600, bias=True)(merge_linears0): ModuleList(#20层(0): Linear(in_features=80, out_features=1200, bias=True))(merge_linears1): ModuleList(#20层(0): Linear(in_features=80, out_features=1200, bias=True))(linear_out): Linear(in_features=1600, out_features=1, bias=True)#[128,1600]×[1600,1]->out[128,1]#输出:#  out[128,1]#  adj[128, 95, 95]#把adj放进all_adj(顶格放进前95位)#对out通过clamp到1到10(好像得到全是1啊?)
)
输出
out, predictions, answers, all_adj
用预测值out与标签answers做MSE损失
对每个batch做一次BP,总计是188个batch

每轮计算一个acc,比如18.935%,并计算本轮训练损失均值,比如5.758
然后又神秘的计算了一个所谓的真实损失,又拿所有预测结果和所有标签计算一次MSE,比如5.802

后面的验证测试集也是这样,既计算每batch的损失的均值和最终的acc,也要算一下真实损失(又不BP计算损失干啥,还能当性能指标的吗?)

复现Reasoning with Heterogeneous Graph Alignment for Video Question Answering相关推荐

  1. AAAI 2020 Reasoning with Heterogeneous Graph Alignment for Video Question Answering∗

    动机 视频问答(VideoQA)的推理通常涉及两个领域的异构数据,即时空视频内容和语言文字序列.现有的方法主要集中在多模态的表示和融合方面,在对齐和推理方面的研究还很少. 近年来,多模态问答技术取得了 ...

  2. 视频问答与推理(Video Question Answering and Reasoning)——论文调研

    文章目录 0. 前言 1. ACM MM 2. CVPR 3. ICCV 4. AAAI 更新时间--2019.12 首稿 0. 前言 学习 VQA 的第一步--前期论文调研. 调研近几年在各大会议上 ...

  3. Video Question Answering综述

    目录 引言 选择型视频问答 开放型视频问答 选择型.开放型均可的视频问答 结论 参考文献 引言 视频问答是视觉语言领域较为新兴的一个课题,需要根据视频内容和问题进行分析,得出问题的答案.根据回答形式, ...

  4. Hierarchical Graph Network for Multi-hop Question Answering 论文笔记

    Hierarchical Graph Network for Multi-hop Question Answering 论文笔记 2020 EMNLP,Microsoft 365, 这篇文章所提出的层 ...

  5. 【VideoQA最新论文阅读】第一篇视频问答综述Video Question Answering: a Survey of Models and Datasets

    Video Question Answering: a Survey of Models and Datasets 长文预警!!! p.s.此篇文章于2021年1月25日新鲜出炉,在Springer需 ...

  6. VideoQA论文阅读笔记——Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

    论文:Heterogeneous Memory Enhanced Multimodal Attention Model for VQA 来源:CVPR2019 作者:京东研究院 源码: Github ...

  7. AAAI 2020 Location-aware Graph Convolutional Networks for Video Question Answering

    动机 视频问答(Video QA)是计算机视觉领域的一个新兴课题,由于其在人工问答系统.机器人对话.视频检索等方面的广泛应用,近年来受到越来越多的关注.与深入研究的图像问答(Image QA)任务不同 ...

  8. Divide and Conquer:Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

    动机 理解问题和寻找答案的线索是视频问答的关键. VQA任务主要分为图像问答(Image QA)和视频问答(Video QA)两种,针对不同视觉材料的自然语言问题进行回答.通常,理解问题并在给定的视觉 ...

  9. CVPR 2020 Modality Shifting Attention Network for Multi-modal Video Question Answering

    动机 VQA具有挑战性,因为它需要同时使用图像和文本执行细粒度推理的能力.视频问答(VideoQA)和多模态视频问答(MVQA)都是这种需要推理的任务. 与VQA或VideoQA相比,MVQA是一项更 ...

  10. QA-GNN: Reasoning with Language Models and Knowledge Graphsfor Question Answering

    题目:QA-GNN:使用语言模型和知识图进行问答推理 作者:Michihiro Yasunaga.Hongyu Ren.Antoine Bosselut.Percy Liang.Jure Leskov ...

最新文章

  1. MATLAB R2021a v9.10 for win 最新无限制中英文完美版 数据处理软件
  2. mysql服务器多线程模型_mysql-线程模型
  3. 【verilog语法】always@(*)自动添加敏感变量列表
  4. python打包成安装包_把 python 程序打包成 egg 或者 whl 安装包
  5. 美图手机投射功能在哪_在Java 8中进行投射(还有其他功能?)
  6. Python 判断语句 if else
  7. 数据库查询:列出各个部门中工资高于本部门平均工资的员工信息,并按部门号排序。
  8. vs运行html没有注册类,解决win10运行com提示“错误代码 80040154-没有注册类”的方法...
  9. JBoss企业级应用服务平台群集指南(一)
  10. ***_ha_高可用_链路备份
  11. php 开发一元夺宝插件,yiyuanyungou 一元云购商城源码,商用 ci框架开发,带指定中奖插件 Other systems 其他 249万源代码下载- www.pudn.com...
  12. 基于Springboot+mybatis+lyaui实现学科竞赛管理系统【详细设计--附完整源码】
  13. dbf文件怎么还原到oracle中,oracle dbf文件恢复数据
  14. 为啥Linux这么大的操作系统使用面向过程语言编写
  15. Photoshop抠图(磁性钢笔工具)
  16. 第二周 Linux文件管理类命令及bash基本特性
  17. grep的-A-B-选项详解
  18. 英文文献翻译成中文,推荐哪个软件?
  19. 理解std::move和std::forward
  20. Linux·VFS虚拟文件系统

热门文章

  1. Linux 下街机模拟器 mame 安装
  2. XCTF-Cat+Bug
  3. C语言图书用国际标准书号,图书登记管理系统程序ds.doc
  4. 各大浏览器的内核分别是什么?
  5. 如何制作专属的VS Code主题
  6. 学计算机激励标语口号,关于学习的励志口号标语(精选160句)
  7. go 语言环境安装 WIMDOWS + LINUX 系统
  8. Java 进口管制限制解除
  9. 前馈神经网络练习:使用tensorflow进行葡萄酒种类识别
  10. cmt obm odm 代工模式oem_ODM/OEM/OBM的区别