PROBLEM: Frequently I face a need to see what are the most-frequently-repeated "patterns" within last day of specific logs. Like for a small subset of tomcat logs here:

問題:我經常需要查看特定日志最后一天內最常重復的“模式”。就像這里的一小部分tomcat日志一樣:

GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5

GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97

GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18

GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50

DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1

GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40

GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4

GET /app2/provisioning/v3/555412562561928/devices 200 1643 65

...

If I wish to find out the most-frequently-used URLs (along with method and retcode) - I'll do:

如果我想找出最常用的URL(以及方法和重新編碼) - 我會這樣做:

[root@srv112:~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\

|sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\

|sort|uniq -c|sort -rn|head -$N

4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401)

2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200)

2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200)

2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404)

1 POST /app2/servlet/handler (200)

1 POST /app1/servlet/handler (200)

If I wish to find out the most-frequent-username from same file - I'll do:

如果我想從同一個文件中找出最頻繁的用戶名 - 我會這樣做:

[root@srv112:~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\

|sort|uniq -c|sort -rn|head -$N

9 555412562561928

2 555411543532089

1 555417257243373

1 555416264256223

Above works quite fine on a small data-sets, but for a larger sets of input - the performance (complexity) of sort|uniq -c|sort -rn|head -$N is unbearable (talking about ~100 servers, ~250 log files per server, ~1mln lines per log file)

以上在小型數據集上工作得相當好,但是對於更大的輸入集 - 排序的性能(復雜性)| uniq -c | sort -rn | head - $ N是難以忍受的(談論~100台服務器,~250每台服務器的日志文件,每個日志文件約1mln行)

ATTEMPT TO SOLVE: |sort|uniq -c part can be easily replaced with awk 1-liner, turning it into:

嘗試解決:| sort | uniq -c部件可以很容易地用awk 1-liner替換,將其轉換為:

|awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N

but I failed to find standard/simple and memory-efficient implementation of "Quick select algorithm" (discussed here) to optimize the |sort -rn|head -$N part. Was looking for GNU binaries, rpms, awk 1-liners or some easily-compilable Ansi C code which I could carry/spread across datacenters, to turn:

但我沒有找到標准/簡單和內存效率高的“快速選擇算法”(這里討論)來優化| sort -rn | head - $ N部分。正在尋找GNU二進制文件,rpms,awk 1-liners或一些易於編譯的Ansi C代碼,我可以攜帶/傳播到數據中心,轉向:

3 tasty oranges

225 magic balls

17 happy dolls

15 misty clouds

93 juicy melons

55 rusty ideas

...

into (given N=3):

進(給定N = 3):

225 magic balls

93 juicy melons

55 rusty ideas

I probably could grab sample Java code and port it for above stdin format (by the way - was surprised by lack of .quickselect(...) within core java) - but the need to deploy java-runtime everywhere isn't appealing. I maybe could grab sample (array-based) C snippet of it too, then adapt it to above stdin format, then test-and-fix-leaks&etc for a while. Or even implement it from scratch in awk. BUT(!) - this simple need is likely faced by more than 1% of people on regular basis - there should've been a standard (pre-tested) implementation of it out there?? Hopes... maybe I'm using wrong keywords to look it up...

我可能可以獲取示例Java代碼並將其移植到上面的stdin格式(順便說一下 - 對於核心java中缺少.quickselect(...)感到驚訝) - 但是在任何地方部署java-runtime的需求並不吸引人。我也許可以抓取它的樣本(基於數組)的C片段,然后將其調整到上面的stdin格式,然后測試和修復泄漏等等一段時間。或者甚至在awk中從頭開始實現它。但是(!) - 超過1%的人經常會遇到這種簡單的需求 - 那里應該有一個標准的(預先測試過的)實現嗎?希望...也許我使用錯誤的關鍵詞來查找...

OTHER OBSTACLES: Also faced a couple of issues to work it around for large data-sets:

其他障礙:還面臨一些問題需要解決大數據集:

log files are located on NFS-mounted volumes of ~100 servers - so it made sense to parallelize and split the work into smaller chunks

日志文件位於NFS安裝的大約100台服務器上 - 因此將工作並行化並拆分成較小的塊是有意義的

the above awk '{S[$0]+=1}... requires memory - I'm seeing it die whenever it eats up 16GB (despite having 48GB of free RAM and plenty of swap... maybe some linux limit I overlooked)

上面的awk'{S [$ 0] + = 1} ...需要內存 - 每當它吃掉16GB時我都會看到它死掉(盡管有48GB的可用內存和大量的交換...可能是某些linux限制我忽略了)

My current solution is still not-reliable and not-optimal (in progress) looks like:

我目前的解決方案仍然不可靠,並且不是最佳(正在進行中),如下所示:

find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\

# TODO: reorder 'find' output to round-robin through srv1 srv2 ...

# to help 'parallel' work with multiple servers at once

parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\

|awk '{S[\$0]+=1}

END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'"

# I throw away patterns met less than 5 times per log file

# in hope those won't pop on top of result list anyway - bogus

# but helps to address 16GB-mem problem for 'awk' below

awk '{if("count:"==$1){C=$2}else{S[$0]+=C}}

END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\

# I also skip all patterns which are met less than 100 times

# the hope that these won't be on top of the list is quite reliable

sort -rn|head -$N

# above line is the inefficient one I strive to address

1 个解决方案

#1

2

I'm not sure if writing your own little tool is acceptable to you, but you can easily write a small tool to replace the |sort|uniq -c|sort -rn|head -$N-part with |sort|quickselect $N. The benefit of the tool is, that it reads the output from the first sort only once, line-by-line and without keeping much data in memory. Actually, it only needs memory to hold the current line and the top $N lines which are then printed.

我不確定是否可以編寫自己的小工具,但是你可以輕松編寫一個小工具來替換| sort | uniq -c | sort -rn | head - $ N-part with | sort | quickselect $ N.該工具的好處是,它只能逐行讀取第一個排序的輸出,並且不會在內存中保留太多數據。實際上,只需要內存來保存當前行和最后的$ N行然后打印。

Here's the source quickselect.cpp:

這是源quickselect.cpp:

#include

#include

#include

#include

#include

typedef std::multimap< std::size_t, std::string, std::greater< std::size_t > > winner_t;

winner_t winner;

std::size_t max;

void insert( int count, const std::string& line )

{

winner.insert( winner_t::value_type( count, line ) );

if( winner.size() > max )

winner.erase( --winner.end() );

}

int main( int argc, char** argv )

{

assert( argc == 2 );

max = std::atol( argv[1] );

assert( max > 0 );

std::string current, last;

std::size_t count = 0;

while( std::getline( std::cin, current ) ) {

if( current != last ) {

insert( count, last );

count = 1;

last = current;

}

else ++count;

}

if( count ) insert( count, current );

for( winner_t::iterator it = winner.begin(); it != winner.end(); ++it )

std::cout << it->first << " " << it->second << std::endl;

}

to be compiled with:

編譯:

g++ -O3 quickselect.cpp -o quickselect

Yes, I do realize you were asking for out-of-the-box solutions, but I don't know anything that would be equally efficient. And the above is so simple, there is hardly any margin for errors (given you don't mess up the single numeric command line parameter :)

是的,我確實意識到你要求開箱即用的解決方案,但我不知道任何同樣有效的方法。以上是如此簡單,幾乎沒有任何錯誤的余地(假設您沒有弄亂單個數字命令行參數:)

win rn linux n,Linux上的“快速選擇”(或類似)實現? (而不是排序| uniq -c | sort -rn | head - $ N)...相关推荐

  1. linux c sort 函数,关于算法:Linux上的“快速选择”(或类似方法)实现? (而不是sort | uniq -c | sort -rn | head-$ N)...

    问题:我经常需要查看特定日志的最后一天中最频繁重复的"模式".就像这里的一小部分tomcat日志一样: GET /app1/public/pkg_e/v3/555413242345 ...

  2. cat linux日志,Linux日志查看命令

    无意看到这样一个命令: cat cxx_Biz.log.2018-04-27-AM |grep ERROR| awk -F '[' '{print 1}'| sort |uniq -c|wc -l 该 ...

  3. 常用的linux故障,Linux下常用的故障排查命令行

    1.查看TCP连接状态 netstat -nat |awk '{print $6}'|sort|uniq -c|sort -rn netstat -n | awk '/^tcp/ {++S[$NF]} ...

  4. 安装Linux双系统取消快速启动,为什么在双启动时禁用Windows 8上的快速启动?

    问题描述 如果你和Ubuntu一起安装,为什么每个人都一直提到在Windows 8上禁用快速启动?是仅针对UEFI计算机推荐的内容还是对旧版BIOS计算机的建议?是因为它使Windows分区无法从Li ...

  5. linux文件向磁带备份,如何在Linux系统上进行快速磁带备份

    如何在Linux系统上进行快速磁带备份 发布时间:2006-09-08 00:28:08来源:红联作者:Myiozzdoc 使用mt-st 工具,在Linux系统上进行快速文件备份. 在CD.DVD以 ...

  6. 如何查找一个文件linux,linux系统上如何快速的查找一个文件?

    以前看到过一句话:linux的水平体现在快速查找文件上~~~ ,或许这句话言过其实 但是因为linux上一切皆文件,任何我们想做的事情都可以通过编辑文件来完成, 如服务器的配置,维护,一切的一切我们都 ...

  7. Linux中_Ubuntu上_使用命令总结整理_02

    文章目录 目录: 1.操作系统操作 1.Linux 基础 2.系统信息 3.系统负载 -- top 4.程序开机自启动服务配置 5.重定向_管道_流 6.终端其他命令 2.文件与目录操作 1.文件和目 ...

  8. linux 关闭开机 ftp,解决linux ftp匿名上传、下载开机自启问题

    如果在平时学习,工作中经常使用 ftp 服务器 ,可以设置成开机自启,在设置之前要先了解几个关于自启的命令: 1.chkconfig 命令 主要作用:用于检查,设置系统的各种服务.其中有几个重要参数, ...

  9. linux php 如何上传webshell,linux+apache+php的一次拿webshell的心得

    首先俺先声明俺是个菜鸟,俺虽然是菜鸟但俺不会一直是菜鸟的(一旁兄弟喊到:别俺,俺,俺的,说普通话!).俺,不对,我一直遵照着实践是检验真理的唯一标准这句话学习技术,这不,刚刚实践了一次有了一点小小的心 ...

最新文章

  1. python详细安装教程环境配置-python3.6环境安装+pip环境配置教程图文详解
  2. 用Java读取xml文件内容
  3. (转)iOS里面Frameworks介绍
  4. Photoshop的基本操作
  5. 终于,我读懂了所有Java集合——List篇
  6. 重磅!总奖池536万,首届“全国人工智能大赛”(AI+4K HDR 视频 与 行人重识别)...
  7. 坑,MySQL中 order by 与 limit 混用,分页会出现问题!
  8. python之获取标准时区的时间元组
  9. 解决Linux下Tomcat日志目录下的catalina.log日志文件过大的问题
  10. CV学习笔记-数字图像概述
  11. OpenStack安装流程(juno版)- 添加镜像服务(glance)
  12. 重提URL Rewrite(4):不同级别URL Rewrite的一些细节与特点
  13. 库存管理系统开发过程
  14. 全网首发:FFMpeg使用NVIDIA DECODER,解码后的数据是NV12,不是YUV420P
  15. 免费的matlab程序学习下载网站总结
  16. java标书_java软件项目投标技术标书模板.doc
  17. WinXP去掉桌面图标阴影
  18. c#制作图表(从数据库读取数据,制作柱状图,扇形图)
  19. JAVA旅游信息管理平台SSM【数据库设计、毕业设计、源码、开题报告】
  20. python常用工具类

热门文章

  1. 以太坊物流场景解决方案探索
  2. ORA-00904标识符无效(太坑了!!)
  3. 2022 校友会“双一流”大学排名
  4. javascriptengine
  5. 细节决定成败 – 选择传智播客成都java培训
  6. 如何在博客主页添加音乐按钮(HTML小试)
  7. 串口控制74HC164C语言,单片机驱动74HC164的程序设计应用实例
  8. hibernate中对象的3种状态(瞬态,持久态,游离态)
  9. 变量(自动变量、静态变量、寄存器变量、外部变量)与C的内存分配malloc/free、calloc/recalloc
  10. 自动驾驶入门小记(第一篇)