So, building upon my last post, let me tell you about the second of the three talks that so impressed me last Saturday. Andrew Tomkins, whom I had the pleasure of working with while at IBM’s Almaden Research Center, spoke after Alon Halevy. He gave us some interesting stats on information (text) creation on the web, and concluded quickly that a reasonable fraction of this information can easily be indexed for 12.5K dollars, hence is within the grasps of every small company. BUT, interestingly, cost is only a small measure of things. The “knowhow” on indexing is quite rare. Why? Because before you can get to the interesting aspects of information retrieval, there is all the crap on the web that has to be filtered out. And, that is an unbearably high burden on folks trying to do research in this area. So universities have intellect to contribute, but no clean data to learn from. In some ways, it is yet another manifestation of the dialog that I talked about, and James Governor talked about.
Andrew suggested that there might a compromise — perhaps query logs can be shipped to the academics so that they can then contribute. But who wants the AOL fiasco? Aha, but can’t we obfuscate, encrypt and then ship? Andrew showed in a paper at WWW2007 how any anonymization technique is highly vulnerable on the web — based on techniques that are well known (such as those described very well in Simon Singh’s “The Code Book,”) or based on people doing vanity queries allowing one to guess certain words.
So the problem of research that is based on real data still remains. And consequently, the network effects of data will accrue to the few who control it. Let us see if over time the logjam can be broken.
My third installment, on work by Shiv Vaithyanathan, will come soon.
0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.
Leave a Comment