Liu, Hongyuan, Sreepathi Pai, and Adwait Jog. “Why GPUs are
slow at executing NFAs and how to make them faster.” Proceedings of
the Twenty-Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems. 2020.

本文介绍了一种新的动态方案,在执行nfa时有效地平衡gpu的计算利用率和减少内存使用量。具体来说,作者指出了NFA匹配过程中的两个性能瓶颈,一个是过多的数据移动,另一个是较差的计算利用率。为了解决这些问题,本文演示了三个建议,包括1)在可能的情况下使用片上资源,2)将内存访问转换为计算,3)只将活动状态映射到线程。总体而言,与之前最先进的nfa GPU实现相比,本研究在广泛的新兴应用中获得了更好的性能。

Summary

This paper introduces a new dynamic scheme that effectively balances compute utilization with reduced memory usage for GPUs when executing NFAs. Specifically, the authors identify two performance bottlenecks in the NFA matching process, one is the excessive data movement, the other is poor compute utilization. To tackle these problems, three proposals are demonstrated including 1) using on-chip resources when possible, 2) converting memory accesses to compute 3) mapping only active states to threads. Overall, this study achieves better performance compared with the previous state-of-art GPU implementations of NFAs across a wide range of emerging applications.

In general, this paper focuses on solving a challenge domain-specific problem in the area of GPU. I hold a positive view of the sophisticated scheme and well-designed experiments in this paper for the reason that the methodology and experiments of the article utilize the characteristics of NFA and GPU, and the latter gives sufficient evidence to support these methods. Moreover, to the best of my knowledge, in the context of NFA processing, no prior work has considered both data movement and utilization problems in conjunction. However, it should be noted that there are some trivial flaws in the choice of the comparison method and the organization of the paper is not satisfactory.

In the following parts, I will analyze the whole paper in detail in terms of writing skills, method design, and experiment, etc.

Strengths

The strengths of the paper have several aspects. First of all, unlike most papers, the title of this paper asked two questions directly which gives an outlook for readers to preview the context of the article directly.

From a high perspective, I think the proposed new data structure to store NFA pattern in this paper is sophisticated and utilize the characteristics of GPU execution since it is challenging for GPUs to obtain enough threads for assigning node data structure which utilizes a 256-bit array of match set, 4 outgoing edges in a 64-bit integer, and an 8-bit array of attributes (3 bits are used to record start state, accept state and always-active state; other 2bits are used for compression). The authors examine the behavior of states and determine which states have high activity frequency and which states have low activity frequency. For example, one of the schemes uses the 1KB prefix of the 1MB input as the profiling input. If a state has an activation frequency more than a threshold in the profiling input, the process considers it as a hot state during the entire execution.

In addition, I think the new data structure can save many redundant spaces which is be of some use for future GPU optimization. In the structure, each node consumes 41 bytes leading to 41N bytes in total compared to 4096N bytes for the alphabet-oriented transition table. Apparently, the scheme only uses 1% space of the traditional table which enables the execution to better exploit the on-chip resources of GPU for topology and the match sets of NFAs.

In terms of the proposed compressing match set, it is intuitively feasible to reduce the number of checking the array of trigger symbols. Specifically, the compressing match set will be marked by the first element and the last element when the arrays have special attributes such as containing a continuous set of bit 1s or a continuous set of bit 0s. When a thread examines a matching set that has that attribute, it can examine in that range instead of checking all the bits. Based on that behavior, high-frequency states will be mapped one-one to threads while the low-frequency states will be stored in a list, and a thread takes responsibility for one or many elements in the list which depends on the available computational resource. Besides, from the beginning to the end of the article, it illustrates the complicated process above by using a simple but comprehensive NFA example that only contains 4 different states. Thus, it is easy for us to understanding and analyzing the whole story to some extent.

Next, as far as I’m concerned, one of the biggest advantages of this paper is that the experiments are detailed and well designed. On one hand, the experiments have designed several evaluation methods which are complete and standardized. These methods contain the characteristics of evaluated NFA applications, throughput enhancement results, absolute throughput with the proposed schemes, effect on data movement reduction, and performance sensitivity to Volta GPU architecture. In particular, all the experimental data gives a convincing analysis. On the other hand, in the appendix of the paper, the authors provide the artifact where there are source code, datasets, workflow, and dependencies, etc. All of them further prove the correctness of the experiment which can provide much convenience for future researchers eventually.

Considering the result of the performance sensitivity to Volta GPU architecture, the proposed schemes (HotStart-MaC and HotStart) show more than 15× speedup over iNFAnt[1], indicating their effectiveness on newer GPU architectures which is a great improvement compared to other methods.

Last but not least, another strength of the article is the proposed method doesn’t contain additional hardware (i.e. hardware-free) to improve the performance of computing NFA-based applications which greatly reduces the cost of deployment and maintenance. Advanced users can easily use the given scheme with the artifact to optimize a specific program.

Weaknesses

When talking about the weakness, the organization or structure of the article should be mentioned first inevitably. The paper including several sections, they are Introduction, background, problem/previous efforts, addressing the data movement problem via match set analysis, addressing the utilization problem via activity analysis, evaluation methodology, experimental results, related work, and conclusions. Obviously, there is redundancy between the chapters which will confuse readers to a certain degree. Sections like background, problem, and previous efforts, and related work can be merged together which provides the preliminaries to the proposed methods. Moreover, the experiments should become an independent chapter including addressing the proposed methods, evaluation methodology, and experimental results rather than splitting them into several independent sections.

Although the experiment is very well designed, its comparison algorithm is old in section 6. For example, iNFAnt[1] and NFA-CG[2] were proposed almost ten years ago which makes the contributions downgraded and unconvincing. Therefore, from my perspective, the paper is supposed to find more comparison methods that maybe not necessarily the application to NFAs to show the advancement of the proposed GPU schemes.

Also, in the experimental part, I find that the effect on data movement reduction isn’t improved a lot, though the utilization optimization reduces the number of thread blocks that access the transition table and the input streams. It can be observed that HotStart (section 5), HotStart-MAC (section 5), NT (section 4.2), and NT-MAC (section 4.3) use 98.9%, 99.3%, 95.9%, and 96.1% gld_transactions respectively while NFA-CG uses 88.2% gld_transactions where the first four names are proposed schemes. One of the possible reasons is that many current methods have improved the data movement reduction to the limitation which is hard to make a great move. Thus, it can be concluded that the data movement reduction is the necessary optimization aspect for NFAs execution. Here, many researchers may consider whether there are more directions for optimization[3] in technique rather than simply reducing data movement.

Furthermore, as a domain-specific paper, the related work (section 8) only demonstrates the work on reducing data movement and improving utilization used in the main process of the newly proposed method. It would be better if the related work could introduce more up-to-date specific methods or GPU accelerators so that readers will have a better understanding of the bottlenecks to improve the throughput of the NFA matching process using GPUs.

Others

As I have said above, the proposed scheme is hardware-free but if we take the throughput into consideration again, we can infer that the performance could be better with the help of hardware/software co-design optimizations to close the remaining gap between hardware and software.

Conclusion

Generally, the work has more strengths than weaknesses. Strengths include the sophisticated data structure and detailed experiments while there are some flaws in the organization of the article and out-of-date comparison methods. In summary, this paper gives a novel way to optimize NFA execution in GPU from the perspective of the software and can guide future work to optimize GPGPU in the aspect of data movement and structure compression.

速读-NFA的GPU加速器相关推荐

  1. 【论文速读】城市自动驾驶应用的概率语义地图

    点云PCL免费知识星球,点云论文速读. 标题:Probabilistic Semantic Mapping for Urban Autonomous Driving Applications 作者:D ...

  2. 【论文速读】RandLA-Net大规模点云的高效语义分割

    点云PCL免费知识星球,点云论文速读. 文章:RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds 作者:Qi ...

  3. 【论文速读】基于投影方法的激光雷达点云处理比较

    点云PCL免费知识星球,点云论文速读. 文章:LiDAR point-cloud processing based on projection methods: a comparison 作者:Gui ...

  4. 【论文速读】基于图像的伪激光雷达三维目标检测

    点云PCL免费知识星球,点云论文速读. 标题:End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection 作者:Rui Qian, Divy ...

  5. 【点云论文速读】最佳点云分割分析

    点云PCL免费知识星球,点云论文速读. 标题:Learning to Optimally Segment Point Clouds 作者:Peiyun Hu, David Held 星球ID:part ...

  6. 【点云论文速读】点云高质量3D表面重建

    点云PCL免费知识星球,点云论文速读. 标题:Local Implicit Grid Representations for 3D Scenes 作者:Chiyu "Max" Ji ...

  7. 【点云论文速读】6D位姿估计

    点云PCL免费知识星球,点云论文速读. 标题:MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fus ...

  8. 【点云论文速读】点云分层聚类算法

    点云PCL免费知识星球,点云论文速读. 标题:PAIRWISE LINKAGE FOR POINT CLOUD SEGMENTATION 作者:Lu, Xiaohu and Yao, Jian and ...

  9. 最大民科组织被取缔,鸡蛋返生、推翻相对论、量子速读都是他们干的

    点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达 梦晨 发自 凹非寺 量子位 报道 | 公众号 QbitAI 据媒体中 ...

  10. 中文速读微生物组(宏基因组)前沿文献——这个可以有

    写在前面 微生物组/宏基因组是当今世界生物科学的热门领域,本领域文章也比较多,有没有人在整理和翻译优秀文章的导读呢?让我们可以用母语快速浏览最新进展,并筛选符合自己口味的研究成果呢? 这个还真有,那就 ...

最新文章

  1. Kong APIGW — OpenResty
  2. php时间函数 1天,PHP函数第20款:两个时间相差的天数SubTime,不满1天按1天算
  3. redis数据类型、应用场景、常用命令
  4. boost::fusion::for_each用法的测试程序
  5. Netty : writeAndFlush的线程安全及并发问题
  6. VSFTP配置详解+虚拟用户的支持
  7. 安装云端服务器操作系统,安装云端服务器操作系统
  8. linux安装python27_linux下安装python27 nginx 和uwsgi
  9. 拼多多市值超1600亿美元 成中国第四大互联网公司
  10. mac终端怎么运行java_Mac 终端命令运行java
  11. android tv 文件管理,电视必备!5款文件管理器强力推荐
  12. mendeley引用参考文献不显示_使用 Zotero 在 Markdown 中优雅处理参考文献
  13. Opencv drawContours函数用于绘制和填充
  14. linux系统软路由软件,使用Linux+Zebra构建软路由系统
  15. 【Git】error: RPC failed; curl 56 GnuTLS recv error (-9): A TLS packet with unexpected length was rece
  16. windows getLastError 错误码大全
  17. 机器人学:操作臂运动学(Manipulator Forward Kinematics)
  18. js正则表达式及语法
  19. Hybrid A*路径规划器的代码注释
  20. 10以内随机加、减法练习题

热门文章

  1. linux 多线程超时中断,c#中的线程超时
  2. linux磁盘健康监控,MegaCli监控RAID磁盘健康信息
  3. java 定义三维列表_java 多维数据定义
  4. excel 直接查询企查查数据_企查查数据:我国一次性餐具相关企业八千家,前三季增1209家...
  5. IDEA 编写JDBC 第一个示例
  6. mysql锁表查询_如何通过自动增加索引,实现数据库查询耗时降低50%?
  7. 计算机网络技术二级,[2018年计算机二级《Access》过关练习题模拟]计算机网络技术学什么...
  8. python掷骰子_掷骰子童芯派 python硬件编程(上传模式)
  9. JavaScript:设置网站title
  10. opencv视频转图片并保存到文件夹下