{"id":160,"date":"2019-09-10T21:36:24","date_gmt":"2019-09-10T18:36:24","guid":{"rendered":"http:\/\/b00blik.ru\/tech\/?p=160"},"modified":"2022-08-01T23:36:52","modified_gmt":"2022-08-01T20:36:52","slug":"simple-data-filtering-with-bash","status":"publish","type":"post","link":"https:\/\/b00blik.ru\/tech\/?p=160","title":{"rendered":"Simple data filtering with bash"},"content":{"rendered":"\n\n\n<p>For data-analysis tasks many software engineers usually use high-level programming languages: Java, C#, Python, etc.<\/p>\n<p>But sometimes it\u2019s more suitable to use scripts if we don\u2019t want to install any interpreters of VM\u2019s. So, <strong>bash<\/strong> is also used for that.<\/p>\n<p><!--more--><\/p>\n<p>Imagine following task. We have some text files with data like in previous article. There are a <i>phone number, date, city and some amount<\/i>. Well, we want to get a list of unique pairs of <i>phones-cities<\/i>&nbsp;\u2013 we should only stay unique entries. For example, it can&nbsp;be used in some smoke tests. Also, these text files can be double zipped, as result, there is a sturcutre zip-&gt;zip-&gt;csv. We also have to unzip them.<\/p>\n<p>What should we do? I suggest following steps:<\/p>\n<ul>\n<li>Take a list of files, unzip these files into temporary dir;\n<ul>\n<li>Put these files into AWK and get unique pairs of 1st and 3rd columns;<\/li>\n<li>Write these pairs into output file.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"text-decoration: underline;\">Taking a list of files.<\/span><\/p>\n<p>Ok, how we can get a list of files in directory and iterate over them?<\/p>\n<pre class=\"lang:sh decode:true \">$ for zip in *.zip; do\n&gt; echo $zip\n&gt; done\nxxx1.zip\nxxx2.zip\nxxx3.zip<\/pre>\n<p>Well, ok. But what if the length of name of file is about 10-20 symbols and there is about 50K files in directory? Bash will give you this amazing message: <i>too many arguments. <\/i>Wow, why? So, when you put an asterisk you get a very long string with all names of files. Bash makes a substitution immediately before script execution.<\/p>\n<p>But we have a solution \u2014 <i>find<\/i> command.<\/p>\n<p>Well, iterating over archives names in current directory will look like that:<\/p>\n<pre class=\"lang:sh decode:true \" title=\"iterating over zips in dir\">cd DIR\nfor zip in `find . -name '*.zip'`; do\n&nbsp; &nbsp; echo \"found zip file [outter]:\" $zip\n&nbsp; &nbsp; unzip -qq $zip -d tmp\n&nbsp; &nbsp; cd tmp\n&nbsp; &nbsp; zipTempName=`find . -name '*.zip'`\n&nbsp; &nbsp; echo \"found zip file [inner]:\" $zipTempName\n&nbsp; &nbsp; unzip -qq $zipTempName\n&nbsp; &nbsp; rm *.zip\n&nbsp; &nbsp; cd ..\/\ndone<\/pre>\n<p><em>-qq<\/em> option for <em>zip<\/em> tolds to be \u00abquiet\u00bb (less log strings in output)<\/p>\n<p><span style=\"text-decoration: underline;\">Getting unique pairs.<\/span><\/p>\n<p>When we extracted all CSV files and removed unnecessary zip-archives, we can run over all files and get unique pairs. Here we can use few bash commands: <em>awk, sed, sort, uniq<\/em><\/p>\n<pre class=\"lang:sh decode:true\">find $1\/tmp -name '*.csv' -exec awk -F \\| '{ print $1 \"|\" $3 }' {} \\; | sort | uniq&nbsp; &gt; pairs.tmp<\/pre>\n<p>Using this sequence of commands (it called pipeline) we make several things:<span class=\"Apple-converted-space\">&nbsp;<\/span><\/p>\n<ul>\n<li>finding all unpacked CSV-files in tmp directory;<\/li>\n<li>passing them to AWK and printing by AWK only 1st and 3rd columns divided by pipe<\/li>\n<li>retrieve unique entries from whole dataset by using sort and uniq.<\/li>\n<\/ul>\n<p>So after all let\u2019s pring unique number of pairs, remove tmp dir and return 0 as result of script\u2019s work.<\/p>\n<pre class=\"lang:sh decode:true \">echo `wc -l $uniquePairsFile | awk '{printf $1}'` \"unique pairs are found\"\nrm -r $1\/tmp\nexit 0<\/pre>\n<p>Script can be easily started by passing parameter with relative path to input directory:<\/p>\n<pre class=\"lang:sh decode:true\">.\/data-filter.sh input_directory<\/pre>\n<p>After starting script, we will take a lot of log messages about unzipping and we will get a result file:<\/p>\n<pre class=\"lang:sh decode:true\">...\nfound zip file [inner]: .\/data_aafyu.zip\nfound zip file [outter]: .\/data_aadaq.zip\nfound zip file [inner]: .\/data_aadaq.zip\nfound zip file [outter]: .\/data_aafzn.zip\nfound zip file [inner]: .\/data_aafzn.zip\n200 unique pairs are found\nAir-Yuri:scriptsfun b00blik$ head -n 10 pairs.txt\n(010030)21955 | Westkerke \n(0101)136 2170 | Linares \n(0101)278 8400 | Mainz \n(0101)539 1114 | Kendal \n(0101)615 3484 | Robechies \n(0101)698 4870 | Palakkad \n(0101)851 4693 | Swansea \n(010120)51434 | Cervino \n(010312)12213 | Harrison Hot Springs \n(010340)44253 | Senftenberg<\/pre>\n<p>Totally, script looks like that:<\/p>\n<pre class=\"lang:sh decode:true \" title=\"script at all\">#!\/bin\/bash\n\ninputDir=$1\nuniquePairsFile=\"pairs.txt\"\n\ncd $1\nfor zip in `find . -name '*.zip'`; do\n    echo \"found zip file [outter]:\" $zip\n    unzip -qq $zip -d tmp\n    cd tmp\n    zipTempName=`find . -name '*.zip'`\n    echo \"found zip file [inner]:\" $zipTempName\n    unzip -qq $zipTempName\n    rm *.zip\n    cd ..\/\ndone\ncd ..\/\n\nfind $1\/tmp -name '*.csv' -exec awk -F \\| '{ print $1 \"|\" $3 }' {} \\; | sort | uniq  &gt; $uniquePairsFile\necho `wc -l $uniquePairsFile | awk '{printf $1}'` \"unique pairs are found\"\n\nrm -r $1\/tmp\n\nexit 0\n<\/pre>","protected":false},"excerpt":{"rendered":"<p>For data-analysis tasks many software engineers usually use high-level programming languages: Java, C#, Python, etc. But sometimes it\u2019s more suitable to use scripts if we don\u2019t want to install any interpreters of VM\u2019s. So, bash is also used for that.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[1],"tags":[30,31,29,33,32],"class_list":["post-160","post","type-post","status-publish","format-standard","hentry","category-1","tag-bash","tag-csv","tag-data","tag-scripting","tag-shell","entry"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","views":{"total":51,"cached_at":""},"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6oGDv-2A","_links":{"self":[{"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=\/wp\/v2\/posts\/160","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=160"}],"version-history":[{"count":10,"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=\/wp\/v2\/posts\/160\/revisions"}],"predecessor-version":[{"id":564,"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=\/wp\/v2\/posts\/160\/revisions\/564"}],"wp:attachment":[{"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/b00blik.ru\/tech\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}