For data-analysis tasks many software engineers usually use high-level programming languages: Java, C#, Python, etc.
But sometimes it’s more suitable to use scripts if we don’t want to install any interpreters of VM’s. So, bash is also used for that.
Imagine following task. We have some text files with data like in previous article. There are a phone number, date, city and some amount. Well, we want to get a list of unique pairs of phones-cities – we should only stay unique entries. For example, it can be used in some smoke tests. Also, these text files can be double zipped, as result, there is a sturcutre zip->zip->csv. We also have to unzip them.
What should we do? I suggest following steps:
- Take a list of files, unzip these files into temporary dir;
- Put these files into AWK and get unique pairs of 1st and 3rd columns;
- Write these pairs into output file.
Taking a list of files.
Ok, how we can get a list of files in directory and iterate over them?
$ for zip in *.zip; do > echo $zip > done xxx1.zip xxx2.zip xxx3.zip
Well, ok. But what if the length of name of file is about 10-20 symbols and there is about 50K files in directory? Bash will give you this amazing message: too many arguments. Wow, why? So, when you put an asterisk you get a very long string with all names of files. Bash makes a substitution immediately before script execution.
But we have a solution — find command.
Well, iterating over archives names in current directory will look like that:
cd DIR for zip in `find . -name '*.zip'`; do echo "found zip file [outter]:" $zip unzip -qq $zip -d tmp cd tmp zipTempName=`find . -name '*.zip'` echo "found zip file [inner]:" $zipTempName unzip -qq $zipTempName rm *.zip cd ../ done
-qq option for zip tolds to be «quiet» (less log strings in output)
Getting unique pairs.
When we extracted all CSV files and removed unnecessary zip-archives, we can run over all files and get unique pairs. Here we can use few bash commands: awk, sed, sort, uniq
find $1/tmp -name '*.csv' -exec awk -F \| '{ print $1 "|" $3 }' {} \; | sort | uniq > pairs.tmp
Using this sequence of commands (it called pipeline) we make several things:
- finding all unpacked CSV-files in tmp directory;
- passing them to AWK and printing by AWK only 1st and 3rd columns divided by pipe
- retrieve unique entries from whole dataset by using sort and uniq.
So after all let’s pring unique number of pairs, remove tmp dir and return 0 as result of script’s work.
echo `wc -l $uniquePairsFile | awk '{printf $1}'` "unique pairs are found" rm -r $1/tmp exit 0
Script can be easily started by passing parameter with relative path to input directory:
./data-filter.sh input_directory
After starting script, we will take a lot of log messages about unzipping and we will get a result file:
... found zip file [inner]: ./data_aafyu.zip found zip file [outter]: ./data_aadaq.zip found zip file [inner]: ./data_aadaq.zip found zip file [outter]: ./data_aafzn.zip found zip file [inner]: ./data_aafzn.zip 200 unique pairs are found Air-Yuri:scriptsfun b00blik$ head -n 10 pairs.txt (010030)21955 | Westkerke (0101)136 2170 | Linares (0101)278 8400 | Mainz (0101)539 1114 | Kendal (0101)615 3484 | Robechies (0101)698 4870 | Palakkad (0101)851 4693 | Swansea (010120)51434 | Cervino (010312)12213 | Harrison Hot Springs (010340)44253 | Senftenberg
Totally, script looks like that:
#!/bin/bash inputDir=$1 uniquePairsFile="pairs.txt" cd $1 for zip in `find . -name '*.zip'`; do echo "found zip file [outter]:" $zip unzip -qq $zip -d tmp cd tmp zipTempName=`find . -name '*.zip'` echo "found zip file [inner]:" $zipTempName unzip -qq $zipTempName rm *.zip cd ../ done cd ../ find $1/tmp -name '*.csv' -exec awk -F \| '{ print $1 "|" $3 }' {} \; | sort | uniq > $uniquePairsFile echo `wc -l $uniquePairsFile | awk '{printf $1}'` "unique pairs are found" rm -r $1/tmp exit 0