Bayesian Filtering of RSS Feeds with POPFile

May
09
2006
POPFile Logo

For a while I've been wanting to try using POPFile to filter my feeds. I already have been using POPFile to handle my spam filtering and getting 99.99% accuracy. POPFile ranked high on my list of solutions because it handles an arbitrary number of buckets out of the box. Most people only know about Bayesian text analysis in a spam filtering capacity, but it's perfectly capable of arbitrary classifications.

That said, I'm only using 2 buckets: interesting and uninteresting. To apply POPFile to RSS feed reading was much simpler for me than for many others. As you may already know, I read all of my feeds as emails delivered to a dedicated IMAP account via Outlook and Thunderbird. As such, by the time I read feed items, they *are* emails and all email tools can be applied, including POPFile.

My intent to try this out came when I saw a new IMAP module in POPFile. Normally POPFile works on POP3 email accounts, but this opens other avenues.

I'm also intrigued by the possibility of using the XMLRPC module to do direct RSS to POPFile communication and classification.

Anyway, after approx 24 hours, it's processed 1200 RSS feed items with a 70% accuracy. I hand checked all of the classifications and re-trained it on the 30% it got "wrong".

If the accuracy gets closer to 99% (which it has for my spam in the email arena), this will cut the effort to keep up with this many feeds dramatically: a capacity I'll surely use to just track more stuff.

The current ratio points to about 33% of the items that come through as being worth at least a quick read. That's my criteria for "interesting". If I'd have opened the item and read the first couple of sentences, I consider it "interesting".

One more idea I've got is to put up a public feed database and run all of those items through my personal filter. Other people could add feeds to the pile and it would spit out a digital version of my opinion. Also possible would be a Bayesian Digg.com where the stories are picked from the giant feed database based on thumbs up/thumbs down voting on stories.

There's some real potential in neural net, statistical analysis and company for filtering and selecting content and I can't wait to see what comes of it.

You can see the resulting selected items, filtered by POPFile and then by me (to see anything I kept for one reason or another) at:

http://www.wynia.org/saved_items.html

 

Comments on this post

Feedback is always welcome. Read some from other folks or leave your own below. Just keep things civil and remember that what you post lives on in public. Forever.

Thanks,
J

2 Responses to “Bayesian Filtering of RSS Feeds with POPFile”

  1. Pepino Says:

    Sounds very cool, the idea of filling a weblog with the content of the feeds categorized by popfile. Any efforts done into that direction? … especially with XML-RPC?

  2. web man Says:

    This can be done easily with the site http://www.filteredrss.com http://www.filteredrss.com

Leave Your Own Comment

By submitting a comment, you agree to license it under the terms of the Creative Commons Attribution license.

People who post comments get the added benefit of visiting the site without advertising.

© 2003-2008 J Wynia. All original content is licensed under the terms of the Creative Commons Attribution license unless otherwise noted. Content from other sources is licensed under its original terms.