Simple data filtering with bash

For data-analysis tasks many software engineers usually use high-level programming languages: Java, C#, Python, etc.

But sometimes it’s more suitable to use scripts if we don’t want to install any interpreters of VM’s. So, bash is also used for that.

Imagine following task. We have some text files with data like in previous article. There are a phone number, date, city and some amount. Well, we want to get a list of unique pairs of phones-cities – we should only stay unique entries. For example, it can be used in some smoke tests. Also, these text files can be double zipped, as result, there is a sturcutre zip->zip->csv. We also have to unzip them.

What should we do? I suggest following steps:

  • Take a list of files, unzip these files into temporary dir;
    • Put these files into AWK and get unique pairs of 1st and 3rd columns;
    • Write these pairs into output file.

Taking a list of files.

Ok, how we can get a list of files in directory and iterate over them?

$ for zip in *.zip; do
> echo $zip
> done
xxx1.zip
xxx2.zip
xxx3.zip

Well, ok. But what if the length of name of file is about 10-20 symbols and there is about 50K files in directory? Bash will give you this amazing message: too many arguments. Wow, why? So, when you put an asterisk you get a very long string with all names of files. Bash makes a substitution immediately before script execution.

But we have a solution — find command.

Well, iterating over archives names in current directory will look like that:

cd DIR
for zip in `find . -name '*.zip'`; do
    echo "found zip file [outter]:" $zip
    unzip -qq $zip -d tmp
    cd tmp
    zipTempName=`find . -name '*.zip'`
    echo "found zip file [inner]:" $zipTempName
    unzip -qq $zipTempName
    rm *.zip
    cd ../
done

-qq option for zip tolds to be «quiet» (less log strings in output)

Getting unique pairs.

When we extracted all CSV files and removed unnecessary zip-archives, we can run over all files and get unique pairs. Here we can use few bash commands: awk, sed, sort, uniq

find $1/tmp -name '*.csv' -exec awk -F \| '{ print $1 "|" $3 }' {} \; | sort | uniq  > pairs.tmp

Using this sequence of commands (it called pipeline) we make several things: 

  • finding all unpacked CSV-files in tmp directory;
  • passing them to AWK and printing by AWK only 1st and 3rd columns divided by pipe
  • retrieve unique entries from whole dataset by using sort and uniq.

So after all let’s pring unique number of pairs, remove tmp dir and return 0 as result of script’s work.

echo `wc -l $uniquePairsFile | awk '{printf $1}'` "unique pairs are found"
rm -r $1/tmp
exit 0

Script can be easily started by passing parameter with relative path to input directory:

./data-filter.sh input_directory

After starting script, we will take a lot of log messages about unzipping and we will get a result file:

...
found zip file [inner]: ./data_aafyu.zip
found zip file [outter]: ./data_aadaq.zip
found zip file [inner]: ./data_aadaq.zip
found zip file [outter]: ./data_aafzn.zip
found zip file [inner]: ./data_aafzn.zip
200 unique pairs are found
Air-Yuri:scriptsfun b00blik$ head -n 10 pairs.txt
(010030)21955 | Westkerke 
(0101)136 2170 | Linares 
(0101)278 8400 | Mainz 
(0101)539 1114 | Kendal 
(0101)615 3484 | Robechies 
(0101)698 4870 | Palakkad 
(0101)851 4693 | Swansea 
(010120)51434 | Cervino 
(010312)12213 | Harrison Hot Springs 
(010340)44253 | Senftenberg

Totally, script looks like that:

#!/bin/bash

inputDir=$1
uniquePairsFile="pairs.txt"

cd $1
for zip in `find . -name '*.zip'`; do
    echo "found zip file [outter]:" $zip
    unzip -qq $zip -d tmp
    cd tmp
    zipTempName=`find . -name '*.zip'`
    echo "found zip file [inner]:" $zipTempName
    unzip -qq $zipTempName
    rm *.zip
    cd ../
done
cd ../

find $1/tmp -name '*.csv' -exec awk -F \| '{ print $1 "|" $3 }' {} \; | sort | uniq  > $uniquePairsFile
echo `wc -l $uniquePairsFile | awk '{printf $1}'` "unique pairs are found"

rm -r $1/tmp

exit 0