I knew about dbacl -- a simple Bayes classifier, that can classify small texts. See the short tutorial. Basically, it was used for filtering spam, but can also categorize your email, like gmail does ('Social', 'Promotions', etc).
But I wanted to classify my books library, which is in PDF/DJVU/EPUB/MOBI form, but also available in text form, see: my blog post about it.
How would I classify it? Maybe I could use some books as references and compare other books with 'referenced books'?
But I came up with a more hackish idea -- I can use Wikipedia as a source of categorized texts. Why not?
Let's get the latest Wikipedia dump.
I'm passing all the XML dumps through my tiny Python script. It only sorts wikipedia articles by categories. So if a text X has '[[Category:Y]]' and '[[Category:Z]]' at the bottom, my script would append this text to two files: Y and Z. At the end, a file Y would contain text of all articles in this category.
The resulting text files are huge. If you want to repeat my steps, you can get them: (beware - 10GB).
Then I extracted all the features from each category using dbacl.
Again, if you want to get all the files processed by dbacl, here are: (1GB).
Now the problem. dbacl from the default Ubuntu 20.* package supports only 256 categories, but I wanted much more. Sadly, dbacl is not maintained actively, so it wouldn't compile, I'm getting an error like that.
So it's time to hack it a little. In src/config.h I removed "#define HAVE_MBRTOWC 1". I don't need Unicode anyway. With a bit tinkering, I managed to extend max categories to 10000: patch. But no more -- some global arrays would be bigger than compiler's limits otherwise... So I removed many smaller Wikipedia categories, and only ~10k are left, that have largest texts. (In other words, only ~10k most popular categories are left.)
Patched and compiled version of dbacl for Linux.
OK, now I can pass all my library through. This script prepares a command line for dbacl by enumerating all categories.
A command line can be huge, up to 0.5MB (all ~10k categories are listed here)! Example.
This script run dbacl with this huge command line for a book.
This script run it for all books in my collection, sorting them into categories.
Unfortunately, dbacl is painfully slow (keep in mind ~10k categories). I waited almost a week to get my library processed. 4 CPU cores has been busy all the time and 4 dbacl instances used ~10GB of RAM.
But the results are fun!
All the system is far from perfect, but it performs surprisingly well! Math books went to 'Articles_containing_proofs', many programming books went to 'Articles_with_example_C_code', 'Articles_with_example_code', etc.
One fun thing is that many fiction books went to a 'Wikipedia_Reference_Desk_archive' category. What is this? Ah, these are Wikipedia pages with many chats and dialogues: "The Wikipedia reference desk works like a library reference desk. Ask a question here and Wikipedia volunteers will try to answer it."
Probably, this category is better to be removed.
Then I processed many books from the Gutenberg library. Far from perfect, but not bad either!
In my opinion, though it's very slow, it's practically usable.
Further work: a set of categories may be much better, prepared manually. Also, a script can print several categories for each text.
P.S. Also fun story: "Can a Bayesian spam filter play chess?"
List of my other blog posts.
Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.