Running Awk in parallel to process 256M records
Hm. I'm fully aware that I'm currently turning into a bearded DBA. And I may be just misreading the article and I probably don't understand the article.
But, I started being somewhat confused by something:
> Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data is read from and written to /dev/shm.
> The total data size is 329GB.
At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
The article presents a lot more AWK knowledge than I have. I'm impressed by that. I acknowledge that.
But I'd probably put all of that into a postgres instance, compute indexes and rely on automated query optimization and parallelization from there. Maybe tinker with pgstorm to offload huge index operations to a GPU. A lot of the shown scripting would be done by postgres, the parallelization is done automatically based on indexes, while eliminating the string serializations.
I do agree with the underlying sentiment of "We don't need hadoop". I'm impressed that AWK goes so far. I'd still recommend postgres in this case as a first solution. Maybe I just work with too many silly people at the moment.
I'm sure that with tools like MPI-Bash [1] and more generally libcircle [2] many embarrassingly parallelizable problems can easily be tackled with standard *nix tools.
[1] https://github.com/lanl/MPI-Bash [2] http://hpc.github.io/libcircle/
It's odd that TFA has this focus on performance but doesn't mention which awk implementation has been used; at least I haven't found any mention of it. There are 3-4 implementations in mainstream use: nawk (the one true awk, an ancient version of which is installed on Mac OS by default), mawk (installed on eg. Ubuntu by default), gawk (on RH by default last I checked), or busybox-awk. Tip: mawk is much faster than the others and to get performance out of gawk you should use LANG=C (and also because of crashes with complex regexpes in Unicode locales in some versions of gawk 3 and 4).
this somewhat reminds me of taco bell programming http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
I've always wanted to build a parallel awk. And call it pawk. And have an O'Reilly book about it. With a chicken on the cover. pawk, pawk, pawk! This is a true story, sadly.
This uses a regular expression. As regex is not actually needed in this case, you might be able to get better performance with something like this:!($1 in a) && FILENAME ~ /aminer/ { print }
!($1 in a) && index(FILENAME, "aminer") != 0 { print }
I like the idea of using AWK for this. But you can give kdb/q a try. 250M rows is nothing for kdb, and it seemed that you can afford the license.
gawk has been my go-to text processing program for many years. I have written a number of multi-thousand line programs in it. I always use the lint option. Catches many of my errors. One of these had to read a 300,000,000 byte file into a single string so it could be searched. The file was in 32-byte lines. At first, I read each line in and appended it to the result string, but that took way to long since the result string was reallocated each time it was appended to. So I read in about 1000 of the 32-byte lines, appending them to a local string. This 32000-byte string was then appended to the result, so this only was done 10000 times. Worked fine.
Spending minutes on these tasks on hardware like this is pretty silly. awk is fine if these are just one-off scripts where development time is the priority, otherwise you're wasting tons of compute time.
Querying things like these on such a small dataset should take seconds, not minutes.
Definitely an under-appreciated tool. Very useful for one-off tasks that would take a fair bit longer to code in something like python.
I am often impressed by the things that can be done with these old-school UNIX tools. I'm trying to learn a few of them, and the most difficult part are these very implicit syntax constructions. How is the naive observer to know that in bash `$(something)` is command substitution, but in a Makefile `$(something)` is just a normal variable? With `awk`, `sed` and friends it gets even worse of course.
Is the proper answer 'just learn it'? Are these tools one of these things (like musical instruments or painting) where the initial learning phase is tedious and frustrating, but the potential is basically limitless?
First, I think it is great that you found a tool that suits your needs. A few weeks ago I was mangling some data too (just about 17 million records) and would like to contribute my experience.
My tools of choice were awk, R, and Go (in that order). Sometimes I could calculate something within a few seconds with awk. But for various calculations, R proved to be a lot faster. At some point, I reached a problem where the simple R implementation I borrowed from Stack Overflow (which was supposed to be much faster than the other posted solutions) did not satisfy my expectations and I spend 4 hours writing an implementation in Go which was a magnitude faster (I think it was about 20 minutes vs. 20 seconds).
So my advice is to broaden your toolset. When you reach the point where a single execution of your awk program takes 48 minutes, it might be worth considering using another tool. However, that doesn't mean awk isn't a good tool, I still use it for simple things, as writing 2 lines in awk is much faster than writing 30 in Go for the same task.
https://mobile.twitter.com/awkdb was a joke account made in frustration by a coworker trying to operate a Hadoop cluster almost a decade ago. Maybe it's time to hand over the account...
Your system had 512-core Xeons? Did you mean that you had 5 12-core xeons? Or 512 cores total?
and here I am working on a big distributed system that has to handle 200k records a day (and hardly does successfully). sigh.
Turning json data into tabular data using jq was pretty neat. So many json apis in use today yet still a need for csv and excel docs.
Amazing work. Keep it up.
You couldn't have used FS="\036" or "\r"?
Why not just use data.table?
The solution would be much less error prone and most likely much quicker as well.
Slightly off topic, but as a swift developer (https://swift.org/) the usage of swift/T in this project really confused me. Is swift/T in any way related to Apple's swift language?
The naming conflict makes googling the differences fairly challenging.
Try mawk if you can. I find that it does even faster.
GNU Parallel + AWK = even less code to write.
undefined
Wow - great insight......