[Unix] Using the 'uniq -c' command to get some statistics, part II

Previously: 1, 2.

File extension statistics

find . -print | awk -F . '{print $NF}' | sort  | uniq -c | sort -n

( awk -F . '{print $NF}' leaves only file extension. )

For example, libssh-0.9.4 project:

...
      1 yml
      2 md
      2 png
      2 sh
      6 in
      8 svg
      9 pub
     12 dox
     14 symbols
     14 txt
     27 cmake
     65 h
    174 c

File type statistics

Would work for a current directory only. Like, libssh-0.9.4/tests:

 % file -b * | sort | uniq -c | sort -n
      1 Python script, ASCII text executable
      4 ASCII text
      9 directory
     19 C source, ASCII text

I'm using the -b option of the file here:

     -b, --brief
             Do not prepend filenames to output lines (brief mode).

Now for all files in tree:

 % find . -type f -exec file -b {} \; | sort | uniq -c | sort -n

      1 ASCII text, with no line terminators
      1 C++ source, ASCII text
      1 C source, ASCII text, with very long lines
      1 data
      1 empty
      1 HTML document, ASCII text
      1 HTML document, UTF-8 Unicode text
      1 JSON data
      1 OpenSSH DSA public key
      1 OpenSSH ED25519 public key
      1 OpenSSH private key
      1 OpenSSH RSA1 private key, version 1.1
      1 PEM DSA private key
      1 Python script, ASCII text executable
      1 ReStructuredText file, UTF-8 Unicode text
      1 UTF-8 Unicode text
      2 ASCII text, with very long lines
      2 OpenSSH ECDSA public key
      2 PEM EC private key
      2 PNG image data, 25 x 25, 8-bit/color RGBA, non-interlaced
      3 OpenSSH RSA public key
      3 ReStructuredText file, ASCII text
      4 PEM RSA private key
      8 SVG Scalable Vector Graphics image
     80 ASCII text
    245 C source, ASCII text

Most active days of project's development

For libssh-0.9.4 again:

 % find . -type f -exec stat {} \;

  File: ./src/dh_key.c
  Size: 10023           Blocks: 24         IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 31855060    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/       i)   Gid: ( 1000/       i)
Access: 2021-12-07 12:31:34.074679564 +0200
Modify: 2020-04-06 12:36:35.000000000 +0300
Change: 2020-12-18 20:19:06.156283089 +0200
 Birth: -
  File: ./src/knownhosts.c
  Size: 36939           Blocks: 80         IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 31855074    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/       i)   Gid: ( 1000/       i)
Access: 2021-12-07 12:31:34.090679832 +0200
Modify: 2020-01-27 17:45:32.000000000 +0200
Change: 2020-12-18 20:19:06.188283745 +0200
 Birth: -

Let's narrow this info to 'Modify' field only:

 % find . -type f -exec stat {} \; | grep Modify

Modify: 2020-01-27 17:45:32.000000000 +0200
Modify: 2020-03-27 14:13:36.000000000 +0200
Modify: 2020-01-27 17:45:32.000000000 +0200
Modify: 2020-01-27 17:45:32.000000000 +0200
Modify: 2020-01-27 17:45:32.000000000 +0200

Pull only date from these strings:

 % find . -type f -exec stat {} \; | grep Modify | cut -d ' ' -f 2

2020-01-27
2020-01-27
2020-01-27
2020-01-27
2020-04-09
2020-04-09

Collect statistics. What days are most active:

 % find . -type f -exec stat {} \; | grep Modify | cut -d ' ' -f 2 | sort | uniq -c | sort -n
      1 2018-12-07
      2 2019-01-31
      2 2020-03-30
      3 2009-06-15
      3 2013-02-07
      7 2020-03-27
      8 2016-02-15
      8 2018-09-10
      9 2020-04-06
     22 2018-10-19
     68 2020-04-09
    234 2020-01-27

Narrow to pure C files using -name or -iname:

 % find . -name '*.c' -type f -exec stat {} \; | grep Modify | cut -d ' ' -f 2 | sort | uniq -c | sort -n
      1 2009-06-15
      1 2013-02-07
      1 2018-09-10
      1 2018-12-07
      4 2018-10-19
      5 2016-02-15
      6 2020-03-27
      8 2020-04-06
     34 2020-04-09
    113 2020-01-27

As seen at reddit.


(Updated in Dec-2023.)

Also, let's see timezones of maintainers of some popular github project, like LLVM.

 % git remote -v
origin  https://github.com/llvm/llvm-project.git (fetch)
origin  https://github.com/llvm/llvm-project.git (push)

 % git log | grep "^Date:" | cut -d ' ' -f 9 | sort | uniq -c | sort -n
      2 +0430
      9 +0330
     10 +1200
     36 +1300
     43 -1000
    129 -0300
    129 +1100
    137 +1000
    163 +0400
    213 +0500
   1355 +0700
   1574 +0530
   1672 +0900
   1701 -0600
   5247 +0300
   5534 +0800
   9600 -0500
  12015 -0400
  14380 +0200
  19276 -0800
  23018 +0100
  35887 -0700
 352210 +0000

UTC (+0) is to be ignored -- probably some bots. Next popular is -7 (and -8) -- west coast of U.S. +1 and +2 is (western) Europe. -4 and -5 is east coast of U.S. +8 is China. +3 is eastern Europe and Russia. Neat! Majority of LLVM development happens at west coast of U.S.

Now let's determine most active hours in LLVM development.

Dates (only) for all commits:

 % git log --date=format:'%Y-%m-%d %H:%M:%S %z' --pretty=format:'%ad'
2023-12-19 11:46:30 -0700
2023-12-19 10:44:18 -0800
2023-12-19 11:39:00 -0700
2023-12-19 19:32:17 +0100
...

Ouch, these dates are with timezone. But I can manage this by adding '--date=local':

 % git log --date=format:'%Y-%m-%d %H:%M:%S %z' --pretty=format:'%ad' --all --date=local
Tue Dec 19 19:46:30 2023
Tue Dec 19 19:44:18 2023
Tue Dec 19 19:39:00 2023
Tue Dec 19 19:32:17 2023
Tue Dec 19 19:26:23 2023
...

I run this on a server with CET time zone (Germany). So all dates converted to CET TZ.

Filter only HH:MM:SS:

 % git log --date=format:'%Y-%m-%d %H:%M:%S %z' --pretty=format:'%ad' --all --date=local | cut -d ' ' -f 4
19:46:30
19:44:18
19:39:00
19:32:17
19:26:23
19:11:42
...

Filter only hour and show statistics:

 % git log --date=format:'%Y-%m-%d %H:%M:%S %z' --pretty=format:'%ad' --all --date=local | cut -d ' ' -f 4 | cut -d ':' -f 1 | sort | uniq -c
  30984 00
  28691 01
  24164 02
  17517 03
  12288 04
  11004 05
  12132 06
  12993 07
  13695 08
  13022 09
  13935 10
  14335 11
  13449 12
  12357 13
  15000 14
  17468 15
  19965 16
  22351 17
  27073 18
  31797 19
  33739 20
  27708 21
  33076 22
  33678 23

You see, most active hours are between 1900 (CET) and 0100. Night in Europe and/or day in U.S.


List of my other blog posts.

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.