Trees are Bad. Views Are Good. Leaving Hierarchical Storage Behind

Nov
10
2005

[Editor: This is another older rough draft that I'm cleaning out. Enjoy]
My reading of Freakonomics really reawakened a desire to see better data analysis of what's going on in computing. Dubner's excitement over finding that the Chicago schools had mountains of data to dig through seemed to me to be the same glint in the eye that a gold miner has in the movies when they find just the right spot to start prospecting. They KNOW there's gold in "them thar hills" and it's just a matter of digging it out.

All of the economists and statisticians I've read recently point to the same thing: given enough data, the "gold" can be found.

Given the explosion in the amount of content available online for text, audio and video, the filters available to help people sort through the piles and piles of junk are going to be of critical importance as we go forward. Anyone who remembers being able to not only have read every site Yahoo listed, but keeping up with them as they were added can attest to what happens as the new content is added ever faster. The point of no return for human data processing is reached fairly quickly and there's just no way to keep up. It's not like you could just read for an extra hour per day and you'd make a dent. There is so much new content being created that no human or reasonably sized group of organized humans can keep up with it.

Yet, we keep trying to use our old models to tame the data mountain. Just look at most of the new sites to keep track of blogs and podcasts. Almost without exception, their primary organizational method is by category. Each entry "belongs" in a single or a few categories. When site owners complain that the categories don't adequately describe their content, subcategories are created and sites are allowed to be listed in multiple categories. Yet, the people creating content complain that even those categories aren't "right" and they feel that their content really doesn't fit any of the available descriptions.

However, the problem isn't the taxonomy. It's that hierarchical trees are an inherently unscaleable method of presenting large amounts of information. Books, for a LONG time before internet publishing showed up, typically presented non-fiction content as a tree (table of contents), the content itself, and an index. Research studies have shown that, other than a quick glance through to see if the book is topically what they expected, people completely ignore the tables of contents when looking for information and jump right to the index. This is because trees present one and only one organizational presentation of the information. Well written indexes present multiple views.

Indexes are only the tip of the "virtual view" iceberg, though. No matter how well written, an index still has to pick and choose terms and how they'll be shown. In a cookbook for instance, cornbread will often be listed under "Corn, bread" (next to "Corn, on the cob") and also under "Bread, corn" near "Bread, bananna". However, if space was unlimited, you might also see "Chili, accompanying side dishes, cornbread", "Southern cooking, cornbread", etc. In other words, the number of paths that lead to cornbread grow pretty quickly and depend on context.

In paper books, this always has to be balanced by the size of the book in relation to the index. After all, you don't want 200 pages of recipes to require 900 pages of index. People just won't buy that book.

The problem that doesn't go away is how indexes are typically written. The best ones (the ones you return to on your bookshelf over and over again) are actually written by a dying breed of book geek/writer who does nothing but write indexes. The ones who are really really good seem to have insights into the content that are just uncanny. However, even the best of them can only go so fast in writing. As a result, hiring a good indexer isn't cheap.

Contrast that with a table of contents. Merely aggregate the headlines, indicate parentage between the elements and stick it in the front of the book. Basically, trees are easy to build, have a nice, ordered appearance and actually work at small scales. When you're trying to organize a 10 minute speech, a tree representation can help you get organized easily. When you're trying to organize the textual content of 100,000 speeches given over 200 years, it doesn't.

When this problem moved to the internet in the late 1990's, the old paradigms followed, with a few exceptions. Sites like Yahoo and the open source clone: DMOZ.org have become some of the biggest tree structures that are publicly available. One only has to look at both of those sites to see that attempts at tree structures to classify that much data pretty much fail to scale. Yahoo has tried to solve the scaling problem by putting up a $200+ barrier of entry, essentially freezing the directory to its contents from a few years ago. DMOZ is non-profit and tries to solve it with community editing. However, take a look at just how many categories don't have an editor or do, but 2/3 of the links in the category are currently dead.

The trend has continued with the constant reliance on "Top XX" lists as the filter for large pools of content. This falls into the "headism" trap that Chris Anderson talks about as part of the long tail discussions. It's a natural thing to do. We want to seek out the best. Unfortunately, when we see the list of the best and it doesn't reflect our internal list, we're disappointed. While we should know better from experience, we still somehow hope that *this* list will accurately reflect our tastes a little better.

I'm not saying that either community or commerce are flawed themselves as solutions. Rather, tree structures just don't scale to include the literally millions of sites created on an ongoing basis. Of course, I'm not anywhere near the first to figure this out or even point it out. Search engines like Google and Altavista before it came about to help make sense of the big pile of data that the internet has become. Google has practically made the search box the interface to the web.

Yet, that's still not making it work 100%. Seeing that maybe technology has failed, lots of folks have turned to pure community to help sort out the content. Sites like delicious and Blinklist have their users entering keywords for URL's, etc. And, while I think meta data (I guess we're calling it "tagging" among the cool kids these days) is definitely part of the eventual solution, I think making people do the bulk of the work is a mistake.

What I think people are really after is a combination of things and no one solution will take care of them all. They want an ongoing stream of content that they find interesting and adapts to their changing interests. They want to be able to find information and answers to specific questions easily. And, they want to participate in communities.

The Wikipedia project serves as another example where the first steps may have trees look very promising. You start out making categories like "Arts and Humanities", "Science and Technology", etc. Then, you put up an article on using fractals as art. Where does it go? Fortunately, Wikipedia understands that (whether consciously or not) and uses search as its primary mechanism with a keyword-like index of topics. That's because, when you try to organize *that much* information, the tree structure just won't scale.

However, if you take the existing Wikipedia data pile, the search mechanism, the keywords, etc., you could very easily build individual trees for specific purposes that would be exceptionally useful.

 

Comments on this post

Feedback is always welcome. Read some from other folks or leave your own below. Just keep things civil and remember that what you post lives on in public. Forever.

Thanks,
J

One Response to “Trees are Bad. Views Are Good. Leaving Hierarchical Storage Behind”

  1. James Says:

    Interesting thoughs. One of the chief difficulties I have found is manageability. Essentially you are leaving items to be self categorizing - you do have some control over what the criteria for categorization is, however it is not "type safe" as using the tree model. (for instance I can imagine lots of scenarios where we have data that is simply lost, because it doesn't define itself well enough.) It seems that the key balance here is between manageability and scalability. Cheers, keep writing.
    James

Leave Your Own Comment

By submitting a comment, you agree to license it under the terms of the Creative Commons Attribution license.

© 2003-2008 J Wynia. All original content is licensed under the terms of the Creative Commons Attribution license unless otherwise noted. Content from other sources is licensed under its original terms.