I did a bit more with the example script yesterday. I wanted to see how my sampler script would handle a real challenge, and wanted to work through some of the scaling issues to see how they'd need to be dealt with. So, I fed it a big challenge: Scoble's 860 feed OPML file.
If you've looked at Google's daily API limit of 1000 queries or Yahoo's 5000 and thought "that's a lot", this little experiment will quickly dump ice water over your head. If you actually hit either API for each item in each feed in Scoble's file, you'd use up 2 days worth of Yahoo and 10 days worth of Google's goodwill.
Clearly, some sort of limiting was in order. More on that in a minute. Now, longterm, I don't intend to use this clumsy hit-Yahoo-for-every-item-blindly setup. A real long-term setup would work in a more multi-stage way.
- Harvesting Feeds. Based on the database of feeds from Syndic8 and from uploaded OPML files, just plain fetch all of the content in each feed into a database. No modifications to it, just plain storage. This would, however, include spidering for new feeds and adding them to future harvests.
- Analyzing feeds. First and foremost, any analysis that can be done without limited web-service API's should be prioritized. Look for links between posts that are right there in front of you in the stored data. Build a web of relationships between the feeds and the posts. Do outright summarization keyword analysis. This would NOT be a single pass.
For instance after one pass of outright analysis, you could invoke outside keyword derivation and then use that to train a Bayesian or other statistical model that future new posts could be fed through for "free" analysis.
- Creating views. This is the part that users would be interacting with. Customized HTML or PDF reports, customized RSS feeds, OPML extracts, etc. would all sit here.
The thing to note about the view layer is that everyone is going to have different desires of it. The discussion of the OPML sampler experiment proved that theory to me. Lots of people wanted to see "real-time" views to keep on the bleeding edge. Others saw it as a "best of" selected by the internet at large (that's how I saw it). Some wanted it to be yet another "Top 100" list. ALL of these are valid. However, they aren't inherent to the way it works.
In essence, it's nothing more than one view: the most highly linked post from each feed in the Web 2.0 workgroup. The same view has now been applied to Scoble's feeds. This is not the same view as "the most highly ranked items in the OPML list". There are posts in Scoble's feeds that are selected with no incoming links and others with 10's of thousands. Those low ranking items would not show up in the second list but are there as representatives of their respective feeds.
While the views are arguably the most interesting (and what sparks my and most of your interest), the first 2 layers are where the real work needs to happen. I'm going to do the harvesting layer when I next put in time on this. A big pile of data is what it's going to take to make this really powerful and, once set up, can chug away day in and day out.
I'm probably going to use a bunch of the code in Feed on Feeds, though I'm not 100% thrilled with their database structure. But, they've got autodiscovery of feeds, a subscription engine, OPML export, etc. already build into their libraries, making it a pretty tantalizing set of code to base it off of.
At any rate, the current experiment was to make sure that, rather than fetch every item in every feed, I limit it in some way to keep it under the 5000 limit from Yahoo. So, I ran it with the actual Yahoo portion shut off and grabbing every item. The net result was nearly 11,000 items. Only when I pulled it down to the most recent 5 items in each feed did it drop below 5000. So, I ran the final report against that subset.
The output of my first pass is available in HTML format, though the file is HUGE. I'm also dubious of the linkcounts and think it may have messed something up somewhere along the line. Unfortunately, given the API limits, it's going to take several days to make enough passes at it to be sure of what's going on.