Simple data filtering with bash

For data-analysis tasks many software engineers usually use high-level programming languages: Java, C#, Python, etc.

But sometimes it’s more suitable to use scripts if we don’t want to install any interpreters of VM’s. So, bash is also used for that.

Imagine following task. We have some text files with data like in previous article. There are a phone number, date, city and some amount. Well, we want to get a list of unique pairs of phones-cities – we should only stay unique entries. For example, it can be used in some smoke tests. Also, these text files can be double zipped, as result, there is a sturcutre zip->zip->csv. We also have to unzip them.

What should we do? I suggest following steps:

  • Take a list of files, unzip these files into temporary dir;
    • Put these files into AWK and get unique pairs of 1st and 3rd columns;
    • Write these pairs into output file.

Taking a list of files.

Ok, how we can get a list of files in directory and iterate over them?

Well, ok. But what if the length of name of file is about 10-20 symbols and there is about 50K files in directory? Bash will give you this amazing message: too many arguments. Wow, why? So, when you put an asterisk you get a very long string with all names of files. Bash makes a substitution immediately before script execution.

But we have a solution — find command.

Well, iterating over archives names in current directory will look like that:

-qq option for zip tolds to be «quiet» (less log strings in output)

Getting unique pairs.

When we extracted all CSV files and removed unnecessary zip-archives, we can run over all files and get unique pairs. Here we can use few bash commands: awk, sed, sort, uniq

Using this sequence of commands (it called pipeline) we make several things: 

  • finding all unpacked CSV-files in tmp directory;
  • passing them to AWK and printing by AWK only 1st and 3rd columns divided by pipe
  • retrieve unique entries from whole dataset by using sort and uniq.

So after all let’s pring unique number of pairs, remove tmp dir and return 0 as result of script’s work.

Script can be easily started by passing parameter with relative path to input directory:
After starting script, we will take a lot of log messages about unzipping and we will get a result file:
Totally, script looks like that: