You often see this and wondering about the total size of all these files:
We can use curl to get the directory listing in HTML form:
% curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ <html> <head><title>Index of /wikidatawiki/entities/20201123/</title></head> <body bgcolor="white"> <h1>Index of /wikidatawiki/entities/20201123/</h1><hr><pre><a href="../">../</a> <a href="wikidata-20201123-all-BETA.nt.bz2">wikidata-20201123-all-BETA.nt.bz2</a> 27-Nov-2020 00:06 123082501476 <a href="wikidata-20201123-all-BETA.nt.gz">wikidata-20201123-all-BETA.nt.gz</a> 26-Nov-2020 09:26 156598500230 <a href="wikidata-20201123-all-BETA.ttl.bz2">wikidata-20201123-all-BETA.ttl.bz2</a> 26-Nov-2020 13:23 77995024032 <a href="wikidata-20201123-all-BETA.ttl.gz">wikidata-20201123-all-BETA.ttl.gz</a> 26-Nov-2020 05:37 93438511093 <a href="wikidata-20201123-all.json.bz2">wikidata-20201123-all.json.bz2</a> 25-Nov-2020 19:13 60463213830 <a href="wikidata-20201123-all.json.gz">wikidata-20201123-all.json.gz</a> 25-Nov-2020 13:23 90588037193 <a href="wikidata-20201123-md5sums.txt">wikidata-20201123-md5sums.txt</a> 27-Nov-2020 01:25 401 <a href="wikidata-20201123-sha1sums.txt">wikidata-20201123-sha1sums.txt</a> 27-Nov-2020 02:06 449 </pre><hr></body> </html>
... then we can isolate 5th column:
% curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ | awk -F ' ' '{print $5}' 123082501476 156598500230 77995024032 93438511093 60463213830 90588037193 401 449
... then we can sum up all numbers in the 5th column:
% curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ | awk -F ' ' '{print $5}' | awk '{ sum += $1 } END { print sum }' 602165788704
... and convert it to human-readable form:
% curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ | awk -F ' ' '{print $5}' | awk '{ sum += $1 } END { print sum }' | numfmt --to=iec 561G
This is it.
Date: Tue, 5 Jan 2021 13:50:35 +0000 From: Raf Czlonka <rczlonka(a)gmail.com> To: blog(a)yurichev.com Subject: awk(1) in "[Unix] Gauging a size of a directory available via Apache webserver index" Hi Denis, Just a quick note regarding awk(1) usage in your latest blog post[0]: - "-F ' '" is redundant as whitespace is the default field separator - in 2nd and 3rd example, there's no need to run awk(1) twice: awk -F ' ' '{print $5}' | awk '{ sum += $1 } END { print sum }' can be reduced to: awk '{ sum += $5 } END { print sum }' [0] https://yurichev.com/news/20210105_apache_index/ Regards, Raf
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.