Bayesian Filtering of RSS Feeds with POPFile
For a while I've been wanting to try using POPFile to filter my feeds. I already have been using POPFile to handle my spam filtering and getting 99.99% accuracy. POPFile ranked high on my list of solutions because it handles an arbitrary number of buckets out of the box. Most people only know about Bayesian text analysis in a spam filtering capacity, but it's perfectly capable of arbitrary classifications.
That said, I'm only using 2 buckets: interesting and uninteresting. To apply POPFile to RSS feed reading was much simpler for me than for many others. As you may already know, I read all of my feeds as emails delivered to a dedicated IMAP account via Outlook and Thunderbird. As such, by the time I read feed items, they *are* emails and all email tools can be applied, including POPFile.
My intent to try this out came when I saw a new IMAP module in POPFile. Normally POPFile works on POP3 email accounts, but this opens other avenues.
I'm also intrigued by the possibility of using the XMLRPC module to do direct RSS to POPFile communication and classification.
Anyway, after approx 24 hours, it's processed 1200 RSS feed items with a 70% accuracy. I hand checked all of the classifications and re-trained it on the 30% it got "wrong".
If the accuracy gets closer to 99% (which it has for my spam in the email arena), this will cut the effort to keep up with this many feeds dramatically: a capacity I'll surely use to just track more stuff.
The current ratio points to about 33% of the items that come through as being worth at least a quick read. That's my criteria for "interesting". If I'd have opened the item and read the first couple of sentences, I consider it "interesting".
One more idea I've got is to put up a public feed database and run all of those items through my personal filter. Other people could add feeds to the pile and it would spit out a digital version of my opinion. Also possible would be a Bayesian Digg.com where the stories are picked from the giant feed database based on thumbs up/thumbs down voting on stories.
There's some real potential in neural net, statistical analysis and company for filtering and selecting content and I can't wait to see what comes of it.
You can see the resulting selected items, filtered by POPFile and then by me (to see anything I kept for one reason or another) at:


December 17th, 2006 at 4:50 pm
Sounds very cool, the idea of filling a weblog with the content of the feeds categorized by popfile. Any efforts done into that direction? … especially with XML-RPC?
July 25th, 2008 at 10:12 pm
This can be done easily with the site http://www.filteredrss.com http://www.filteredrss.com