Internet Buzz with Richard Wiggins | 13
|Volume 1, Number 28||July 24, 1998||Internet Buzz main page|
Alexa and Netscape Smart Browsing: An Interview with Brewster Kahle
How does Alexa deal with the scale of the Web?
We see 40 million distinct content areas within a few months Â places worth referring to separately based on traffic patterns and link structures. Not every home page gets included included in this list. We see no limit to Alexa's ability to scale while still offering meaningful information on relatively low-volume sites they visit.
We can scale because the Web made everyone into a publisher, which forces all the users of the Net to become editors. We want to exploit the collective wisdom of the users of the Net, so we can observe as people vote with mouse clicks.
Brewster, I recall a story you once told about a demo you were giving at the Exploratorium in San Francisco. You related how a small boy came up and said, "I want to ask the Internet a question," which you thought was at once naïve and also very profound. How close are we being able to "ask the Internet a question"?
We're still trying to solve that problem Â helping people find the answer to hard questions. The idea of understanding natural language is a mirage.
Helping a user find the right 10 pages out of 300 million using the right 3 keywords is a mind-boggling notion. Search engines want you to answer questions that you know how to formulate well using keywords.
We don't take keywords for a question, we take the sites you visit Â your URLs are your search terms. We're trying to be your smart assistant
The questions that take 20 minutes to answer are still a problem. I did the WAIS thing and am glad to see that others are trying to do it, but we're still a long way away. When we set out to do WAIS, we tried to help answer the hard questions Â should I go to grad school, what about the problems with my boss? If we can have machines help with those, then we will have created something truly great.
Where'd you get the name Alexa?
Alexa is short for "Library of Alexandra" Â that was the last time someone tried to collect it all. The Internet Archive is the physical collection of the Web Â we periodically take a snapshot of the entire Web. Alexa is the library the public sees Â the catalog that makes it useful. The corporate theme of Alexa is to donate a copy of everything it collects for long term care and feeding in the Archive. We think it's too important for the early days of our digital history to be owned by a single person or corporation.
So Alexa is a form of digital library?
We're trying to build something as complex as the Library of Congress, but useful to each person on the Net. We see 20 million Web content areas, which is as many items as there are books in the Library of Congress. So we're now at the complexity of the biggest library created by humankind. That level of complexity is not usually exposed to 10s of millions of people! We're trying to do something gigantic here, and Alexa is an attempt to help people manage that level of complexity.
The Web doesn't yet hold the amount of data in all the books in the Library of Congress, but it does offer the same level of complexity. Mike Lesk, who works in digital libraries for the National Science Foundation, and a really wise person, notes that for every person entering a physical library today, twenty people use a search engine.
We don't have all the text of the Library of Congress, but we do have the same level of complexity. People are using the Web to answer questions that they used to use the library for. Already the impact of the Web on the info economy is on the order of all libraries in the world. We count 10 to 15 million authors on the Web Â and those aren't just couch potatoes. We want to make the diversity of the Web manageable. We don't want people to retreat to 10 familiar brand names
It seems the Archive could be important for a lot of reasons, especially now that we have so many cases of the news media making mistakes in reporting. When those mistakes are made on Web sites, the Archive can be a neutral third party place to find what the Dallas Morning News said in a particular on-line edition.
So far the Archive is not our most important feature in terms of usage levels, but we do see it playing a very important role. A professor was examining CNN's reporting of the Tail Feather story, and discovered that CNN took it down from their site. I'm not sure if we have their original story with the flawed reporting or not. Time magazine went and annotated their original, correcting the errors.
The professor's view was that we should require people to keep their stuff up as they originally posted it. I'm afraid this view is counter to reality; anyone in .com is going to behave in their own interests, and not leave damaging information onlne.
But there is a role for other institutions to build archives. This is why we have libraries. The Internet Archive is the only place to find the 1996 campaign Web sites. In the year 2000, when the Web is the newspaper of record, an archive will be essential.
What sites do people vote against?
We haven't really looked at that question for individual sites very much. Usually, if they vote, it's positive -- 6 to 1 they vote "for" a site. I noticed the other day that www.onetravel.com seems to get negative votes; I'm not sure why.
I worry about search engines that make ranking and presentation decisions based on advertising arrangements, not what's most relevant to the user's search. What is your policy on editorial versus advertising copy?
Under Alexa, if it's graphical, it's paid for Â otherwise it's our best guess as to what you're looking for Â it's like the Yellow Pages Â you get a listing for free, but you pay for a banner. You can pay for a banner, but not for position.
The Netscape version has no banner ads. Delineation of ads on the Netscape site hasn't been made clear. Netscape is free to change position based on their relationships; Alexa can make changes based on editorial decisions. Editorial integrity is an important issue, and becomes more vital if the links are all generally related Â why did one competitor rank more highly than another? This is something we need to work on.
So tell me, Brewster, when I requested this interview, did you read my past columns, or did you use Alexa to find out about my column and WebReference?
I did both. I read some of your past columns. I really enjoyed the column about Abilene; I'm a big fan of Internet2. I also liked your interviews with Vint Cerf and Hal Varian.
Alexa told me that WebReference is a popular site, and the Related Links list included your home page, so I clicked on it. I have to tell you, that Rich Wiggins home page is the most boring personal page I've ever seen!
What challenges face Alexa and your Netscape plugin?
Managing this thing is non-trivial. It's hard to tease out a separate logical site on a multi-site host. We have to deal with Different IPs for same site, multiple sites on one IP address, etc. It's hard to work out the layout of the Web; the Net is just not very well organized.
One of the wonderful things about this flat world that Tim [Berners-Lee, the inventor of the Web] gave us is this: the URL is king. XML might give us one URL but multiple pieces of content under that URL. We need repeatable, dependable URLs -- we need people to stand behind their URLs!
We tag each page with a date because otherwise a new edition can cause problems.
Recently we filled our Archive database Â the piece that ties the Archive and the toolbar was full. This is a database of 12 terabytes, which presents some interesting management problems. Microsoft says they have the largest database on Earth with their Terraserver, but I question that claim.
We need to be sure that the laws don't make institutions such as the Internet Archive illegal. Libraries were explicitly made possible. With the rise of Hollywood, archiving for non-print materials became more problematic. There is no archive of television Â there are spotty things, such as the Museum of Broadcasting. Now that television is going away, people understand that our cultural artifacts are in electronic (and increasingly digital) form. So I'm very worried about pending copyright legislation.
There's a tradeoff as old as Plato or Jefferson Â society versus individual Â a constant tug Â but we have to have a concept of civic spaces on the Web. Tim Berners-Lee is the statesman, our best advocate of this perspective. If we don't succeed, we recede back into proprietary protocols Â it took a decade to get out of the tyranny of SNA and DECnet into TCP/IP. We can't go back into a proprietary zone.
Finally, we have to make sure we have business models that work. Where we screwed up in building the Internet was failing to put in a business model early enough. I was at Thinking Machines, Tim Berners-Lee was at CERN, Gopher guys were at a university. We've come such a long way in a short time: we've built a large part of Memex Â or Xanadu. (I hope Ted Nelson gets some joy out of all this.) But we didn't know how to build businesses or a business model. We've all learned a lot.
Comments are welcome