PROBLEM: Frequently I face a need to see what are the most-frequently-repeated "patterns" within last day of specific logs. Like for a small subset of tomcat logs here:


GET /app1/public/pkg_e/v3/555413242345562/account/stats 401 954 5

GET /app1/public/pkg_e/v3/555412562561928/account/stats 200 954 97

GET /app1/secure/pkg_e/v3/555416251626403/ex/items/ 200 517 18

GET /app1/secure/pkg_e/v3/555412564516032/ex/cycle/items 200 32839 50

DELETE /app1/internal/pkg_e/v3/accounts/555411543532089/devices/bbbbbbbb-cccc-2000-dddd-43a8eabcdaa0 404 - 1

GET /app1/secure/pkg_e/v3/555412465246556/sessions 200 947 40

GET /app1/public/pkg_e/v3/555416264256223/account/stats 401 954 4

GET /app2/provisioning/v3/555412562561928/devices 200 1643 65


If I wish to find out the most-frequently-used URLs (along with method and retcode) - I'll do:

如果我想找出最常用的URL(以及方法和重新編碼) - 我會這樣做:

[root@srv112:~]$ N=6;cat test|awk '{print $1" "$2" ("$3")"}'\

|sed 's/[0-9a-f-]\+ (/%GUID% (/;s/\/[0-9]\{4,\}\//\/%USERNAME%\//'\

|sort|uniq -c|sort -rn|head -$N

4 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (401)

2 GET /app1/secure/pkg_e/v3/%USERNAME%/devices (200)

2 GET /app1/public/pkg_e/v3/%USERNAME%/account/stats (200)

2 DELETE /app1/internal/pkg_e/v3/accounts/%USERNAME%/devices/%GUID% (404)

1 POST /app2/servlet/handler (200)

1 POST /app1/servlet/handler (200)

If I wish to find out the most-frequent-username from same file - I'll do:

如果我想從同一個文件中找出最頻繁的用戶名 - 我會這樣做:

[root@srv112:~]$ N=4;cat test|grep -Po '(?<=\/)[0-9]{4,}(?=\/)'\

|sort|uniq -c|sort -rn|head -$N

9 555412562561928

2 555411543532089

1 555417257243373

1 555416264256223

Above works quite fine on a small data-sets, but for a larger sets of input - the performance (complexity) of sort|uniq -c|sort -rn|head -$N is unbearable (talking about ~100 servers, ~250 log files per server, ~1mln lines per log file)

以上在小型數據集上工作得相當好,但是對於更大的輸入集 - 排序的性能(復雜性)| uniq -c | sort -rn | head - $ N是難以忍受的(談論~100台服務器,~250每台服務器的日志文件,每個日志文件約1mln行)

ATTEMPT TO SOLVE: |sort|uniq -c part can be easily replaced with awk 1-liner, turning it into:

嘗試解決:| sort | uniq -c部件可以很容易地用awk 1-liner替換,將其轉換為:

|awk '{S[$0]+=1}END{for(i in S)print S[i]"\t"i}'|sort -rn|head -$N

but I failed to find standard/simple and memory-efficient implementation of "Quick select algorithm" (discussed here) to optimize the |sort -rn|head -$N part. Was looking for GNU binaries, rpms, awk 1-liners or some easily-compilable Ansi C code which I could carry/spread across datacenters, to turn:

但我沒有找到標准/簡單和內存效率高的“快速選擇算法”(這里討論)來優化| sort -rn | head - $ N部分。正在尋找GNU二進制文件,rpms,awk 1-liners或一些易於編譯的Ansi C代碼,我可以攜帶/傳播到數據中心,轉向:

3 tasty oranges

225 magic balls

17 happy dolls

15 misty clouds

93 juicy melons

55 rusty ideas


into (given N=3):

進(給定N = 3):

225 magic balls

93 juicy melons

55 rusty ideas

I probably could grab sample Java code and port it for above stdin format (by the way - was surprised by lack of .quickselect(...) within core java) - but the need to deploy java-runtime everywhere isn't appealing. I maybe could grab sample (array-based) C snippet of it too, then adapt it to above stdin format, then test-and-fix-leaks&etc for a while. Or even implement it from scratch in awk. BUT(!) - this simple need is likely faced by more than 1% of people on regular basis - there should've been a standard (pre-tested) implementation of it out there?? Hopes... maybe I'm using wrong keywords to look it up...

我可能可以獲取示例Java代碼並將其移植到上面的stdin格式(順便說一下 - 對於核心java中缺少.quickselect(...)感到驚訝) - 但是在任何地方部署java-runtime的需求並不吸引人。我也許可以抓取它的樣本(基於數組)的C片段,然后將其調整到上面的stdin格式,然后測試和修復泄漏等等一段時間。或者甚至在awk中從頭開始實現它。但是(!) - 超過1%的人經常會遇到這種簡單的需求 - 那里應該有一個標准的(預先測試過的)實現嗎?希望...也許我使用錯誤的關鍵詞來查找...

OTHER OBSTACLES: Also faced a couple of issues to work it around for large data-sets:


log files are located on NFS-mounted volumes of ~100 servers - so it made sense to parallelize and split the work into smaller chunks

日志文件位於NFS安裝的大約100台服務器上 - 因此將工作並行化並拆分成較小的塊是有意義的

the above awk '{S[$0]+=1}... requires memory - I'm seeing it die whenever it eats up 16GB (despite having 48GB of free RAM and plenty of swap... maybe some linux limit I overlooked)

上面的awk'{S [$ 0] + = 1} ...需要內存 - 每當它吃掉16GB時我都會看到它死掉(盡管有48GB的可用內存和大量的交換...可能是某些linux限制我忽略了)

My current solution is still not-reliable and not-optimal (in progress) looks like:


find /logs/mount/srv*/tomcat/2013-09-24/ -type f -name "*_22:*"|\

# TODO: reorder 'find' output to round-robin through srv1 srv2 ...

# to help 'parallel' work with multiple servers at once

parallel -P20 $"zgrep -Po '[my pattern-grep regexp]' {}\

|awk '{S[\$0]+=1}

END{for(i in S)if(S[i]>4)print \"count: \"S[i]\"\\n\"i}'"

# I throw away patterns met less than 5 times per log file

# in hope those won't pop on top of result list anyway - bogus

# but helps to address 16GB-mem problem for 'awk' below

awk '{if("count:"==$1){C=$2}else{S[$0]+=C}}

END{for(i in S)if(S[i]>99)print S[i]"\t"i}'|\

# I also skip all patterns which are met less than 100 times

# the hope that these won't be on top of the list is quite reliable

sort -rn|head -$N

# above line is the inefficient one I strive to address

1 个解决方案



I'm not sure if writing your own little tool is acceptable to you, but you can easily write a small tool to replace the |sort|uniq -c|sort -rn|head -$N-part with |sort|quickselect $N. The benefit of the tool is, that it reads the output from the first sort only once, line-by-line and without keeping much data in memory. Actually, it only needs memory to hold the current line and the top $N lines which are then printed.

我不確定是否可以編寫自己的小工具,但是你可以輕松編寫一個小工具來替換| sort | uniq -c | sort -rn | head - $ N-part with | sort | quickselect $ N.該工具的好處是,它只能逐行讀取第一個排序的輸出,並且不會在內存中保留太多數據。實際上,只需要內存來保存當前行和最后的$ N行然后打印。

Here's the source quickselect.cpp:







typedef std::multimap< std::size_t, std::string, std::greater< std::size_t > > winner_t;

winner_t winner;

std::size_t max;

void insert( int count, const std::string& line )


winner.insert( winner_t::value_type( count, line ) );

if( winner.size() > max )

winner.erase( --winner.end() );


int main( int argc, char** argv )


assert( argc == 2 );

max = std::atol( argv[1] );

assert( max > 0 );

std::string current, last;

std::size_t count = 0;

while( std::getline( std::cin, current ) ) {

if( current != last ) {

insert( count, last );

count = 1;

last = current;


else ++count;


if( count ) insert( count, current );

for( winner_t::iterator it = winner.begin(); it != winner.end(); ++it )

std::cout << it->first << " " << it->second << std::endl;


to be compiled with:


g++ -O3 quickselect.cpp -o quickselect

Yes, I do realize you were asking for out-of-the-box solutions, but I don't know anything that would be equally efficient. And the above is so simple, there is hardly any margin for errors (given you don't mess up the single numeric command line parameter :)


