Replacing an SQL query with unix sort, uniq and awk

匿名 (未验证) 提交于 2019-12-03 10:09:14

问题:

We currently have some data on an HDFS cluster on which we generate reports using Hive. The infrastructure is in the process of being decommissioned and we are left with the task of coming up with an alternative of generating the report on the data (which we imported as tab separated files into our new environment)

Assuming we have a table with the following fields.

  • Query
  • IPAddress
  • LocationCode

Our original SQL query we used to run on Hive was (well not exactly.. but something similar)

select  COUNT(DISTINCT Query, IPAddress) as c1, LocationCode as c2,  Query as c3 from table group by Query, LocationCode 

I was wondering if someone could provide me with an the most efficient script using standard unix/linux tools such as sort, uniq and awk which can act as a replacement for the above query.

Assume the input to the script would be a directory of text files. the dir would contain about 2000 files. Each file would contain arbitrary number of tab separated records of the form :

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE> 

回答1:

Once you have a sorted file containing all the unique

Query <TAB> LocationCode <TAB> IPAddress <NEWLINE> 

you could:

awk -F '\t' 'NR == 1 {q=$1; l=$2; count=0} q == $1 && l == $2{count++} q != $1 || l != $2{printf "%s\t%s\t%d\n", q, l, count; q=$1; l=$2; count=1} END{printf "%s\t%s\t%d\n", q, l, count}' sorted_uniq_file 

To get this sorted_uniq_file the naive way can be:

sort -u dir/* > sorted_uniq_file 

But this can be very long and memory consuming.

A faster option (and less memory consuming) could be to eliminate duplicate as soon as possible, sorting first and merging later. This needs a temporary space for the sorted file, let use a directory named sorted:

mkdir sorted; for f in dir/*; do    sort -u $f > sorted/$f done sort -mu sorted/* > sorted_uniq_file rm -rf sorted 

If the solution above hit some shell or sort limit (expansion of dir/*, or of sorted/*, or number of parameters of sort):

mkdir sorted; ls dir | while read f; do    sort -u dir/$f > sorted/$f done while [ `ls sorted | wc -l` -gt 1 ]; do   mkdir sorted_tmp   ls sorted | while read f1; do     if read f2; then       sort -mu sorted/$f1 sorted/$f2 > sorted_tmp/$f1     else       mv sorted/$f1 sorted_tmp     fi   done   rm -rf sorted   mv sorted_tmp sorted done mv sorted/* sorted_uniq_file rm -rf sorted 

The solution above can be optimized to merge more that 2 files at the same time.



回答2:

Not a direct answer to your original question (which you already got), but if you have a bunch of flat file data that you want to query in different ways you might consider using NoSQL:

http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page

This NoSQL project is totally different animal from (and predates by many years) what have more recently come to be known as "NoSQL databases". Instead, this NoSQL ties together Unix tools, with Awk as the centerpiece, to simplify their use in accessing and maintaining a database of formatted text files. Makes it easy to do a lot of slick stuff, e.g., table joins.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!