[Unix] Gauging size of a directory available via Apache webserver index

You often see this and wondering about the total size of all these files:

We can use curl to get the directory listing in HTML form:

 % curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/

<html>
<head><title>Index of /wikidatawiki/entities/20201123/</title></head>
<body bgcolor="white">
<h1>Index of /wikidatawiki/entities/20201123/</h1><hr><pre><a href="../">../</a>
<a href="wikidata-20201123-all-BETA.nt.bz2">wikidata-20201123-all-BETA.nt.bz2</a>       27-Nov-2020 00:06 123082501476
<a href="wikidata-20201123-all-BETA.nt.gz">wikidata-20201123-all-BETA.nt.gz</a>         26-Nov-2020 09:26 156598500230
<a href="wikidata-20201123-all-BETA.ttl.bz2">wikidata-20201123-all-BETA.ttl.bz2</a>     26-Nov-2020 13:23 77995024032
<a href="wikidata-20201123-all-BETA.ttl.gz">wikidata-20201123-all-BETA.ttl.gz</a>       26-Nov-2020 05:37 93438511093
<a href="wikidata-20201123-all.json.bz2">wikidata-20201123-all.json.bz2</a>             25-Nov-2020 19:13 60463213830
<a href="wikidata-20201123-all.json.gz">wikidata-20201123-all.json.gz</a>               25-Nov-2020 13:23 90588037193
<a href="wikidata-20201123-md5sums.txt">wikidata-20201123-md5sums.txt</a>               27-Nov-2020 01:25 401
<a href="wikidata-20201123-sha1sums.txt">wikidata-20201123-sha1sums.txt</a>             27-Nov-2020 02:06 449
</pre><hr></body>
</html>

... then we can isolate 5th column:

 % curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ | awk -F ' ' '{print $5}'




123082501476
156598500230
77995024032
93438511093
60463213830
90588037193
401
449

... then we can sum up all numbers in the 5th column:

 % curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ | awk -F ' ' '{print $5}' | awk '{ sum += $1 } END { print sum }'
602165788704

... and convert it to human-readable form:

 % curl -s https://dumps.wikimedia.org/wikidatawiki/entities/20201123/ | awk -F ' ' '{print $5}' | awk '{ sum += $1 } END { print sum }' |
numfmt --to=iec
561G

This is it.

Comments

Date: Tue, 5 Jan 2021 13:50:35 +0000
From: Raf Czlonka <rczlonka(a)gmail.com>
To: blog(a)yurichev.com
Subject: awk(1) in "[Unix] Gauging a size of a directory available via Apache webserver index"

Hi Denis,

Just a quick note regarding awk(1) usage in your latest blog post[0]:

- "-F ' '" is redundant as whitespace is the default field separator

- in 2nd and 3rd example, there's no need to run awk(1) twice:

        awk -F ' ' '{print $5}' | awk '{ sum += $1 } END { print sum }'

  can be reduced to:

        awk '{ sum += $5 } END { print sum }'

[0] https://yurichev.com/news/20210105_apache_index/

Regards,

Raf


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.