Looking for Metadata in All the Wrong Places: Why a controlled vocabulary or thesaurus is in your future | 2
Looking for Metadata in All the Wrong Places
Why Andy Is Destined to Fail, or, at least, Stumble Around a Bit
There are CVs and thesauri that have already been created from which you can learn and borrow (with appropriate permissions, of course). Here are just a few good starting points for finding CVs, thesauri, and other classification schemes:
But why am I so pessimistic about Andy's chances? Because it'll be nearly impossible to find just the right CV or thesaurus that will meet all his needs.
The problem is two-fold:
First, there are a zillion subject domains out there. If you're looking to create the world's leading e-commerce site for fly-fishing enthusiasts, it's unlikely that you'll find a decent, if any, thesaurus or CV on the topic. So you'll have to create one of your own. And that's not an easy task. After all, if it were, there'd be a lot more thesauri and CVs available. My company offers a full-day seminar on the topic (http://argus-acia.com/acia_event/seminar_roadshow.html), which covers only the tip of the iceberg.
Second, just because a CV or thesaurus on your topic exists, it isn't necessarily appropriate for your needs. Your users, content, and context will invariably be different from those for which the CV or thesaurus was originally created.
Let's say that Andy actually finds a thesaurus of Internet and technology-related terms. But as he takes a closer look, he might find that the terms are very technical (e.g., "input device"), when his site's users might be more likely to use laypersons' terms (e.g., "keyboard"). He might learn that the thesaurus' terms were built for a very small collection of documents, and are therefore not sufficiently numerous or specific for his large body of content. Or he might discover that the thesaurus was designed for a slightly different context, such as to support automatically expanding users' search queries, which might make it harder for Andy to use the thesaurus to build his site's browsing taxonomy.
What Andy Can Do Now
So, there are no silver bullets. But there is hope! Andy has some options:
- Borrow an existing CV or thesaurus. This is a decent option, if there is an appropriate one out there, and if permissions don't get in the way. As Andy customizes a CV or thesaurus, it's not clear when it will become his own as opposed to remaining the property of the original author. This is a sticky intellectual property issue to be aware and beware of.
- Create a CV or thesaurus from scratch. This is a large and expensive task, and he'd better plan on a few months' work. And remember, it's really, really hard to get it "right" all at once: because users, content and context change all the time, CVs and thesauri should as well.
- Grow a CV or thesaurus over time. This might be the most practical approach. Andy should take a little time and learn about what's involved in developing a CV or thesaurus, and then begin an iterative process of starting with something small and basic, expanding and improving it over time. Better to crawl before walking than never to walk at all.
Whatever approach Andy takes, it's crucial to explore the three issues of:
- Users - Who will use it?
- Content - What will it be used to describe?
- Context - Where and how will it be used?
He'll need to analyze the language users employ when they search or browse; one good technique is to cluster search log results. He'll want to do some similar analysis of WebReference's content; a content inventory is a good place to start. Examining any existing indexing would also be a good idea. And he'd better make sure he understands his context; in this case, how much time and budget are available, and how the workflow of applying terms and maintaining the CV or thesaurus over time will work.
A great big challenge? Sure. But using CVs and thesauri are a wonderful way to improve the way your site works for both users and maintainers alike. Not to mention a great big opportunity, at least for us consultants....
Unfortunately there isn't much out there for the layperson who wants to learn about CV and thesaurus design. If you are willing to scratch the surface, you can't beat "Thesaurus Construction and Use: A Practical Manual" (3rd Edition) by Jean Aitchison, Alan Gilchrist, and David Bawden, as a general introduction. And though it has nothing to do with creating CVs and thesauri, Bill Bryson's highly entertaining "The Mother Tongue: English & How It Got That Way" is great food for thinking about language and meaning; it's wonderful prep for delving into the guts of thesauri and CVs.
You'll find more reading materials available from our seminar's reading list (http://argus-acia.com/seminars/seminar_resources.html).
# # #
About the author: Louis Rosenfeld is president of Argus Associates (http://argus-inc.com), a consulting firm that specializes in providing information architecture services to Fortune 500 clients. He is the author of "Information Architecture for the World Wide Web" and can be reached at: (firstname.lastname@example.org).
I read with interest your comments in the WEBREFERENCE UPDATE NEWSLETTER of February 1 about the use of thesauri in coding Web pages. As someone who has been in the information business for 30 years, it was interesting to see how technology really doesn't change the need for good information practices.
Over the years as indexing costs using a controlled vocabulary increased, managers in the information business looked to computers and the improved searching engines to rid them of the need of a controlled vocabulary. Unfortunately, the vast amounts of information now available has overwhelmed search engine ability. Although, like you stated, there will never be one vocabulary that we all use, just getting web sites to consider controlled vocabulary use will make them think about how their information is to be found. On our web site, http://www.ntis.gov, we use both controlled vocabulary and subject category coding not only on our 500,000 database records but also on our individually coded web pages.
Organizations like ours, who are in the abstracting and indexing business of large collections (more than 2,000,000 records of government publications), have use thesauri for years and a free reference catalog NTIS offers for its own database refers to a number of scientific and technical thesauri, http://www.ntis.gov/pdf/dbguid.pdf.
As I mentioned, we also use controlled subject categories to assist in the classification of web or database records. Subject categories are simply a broader approach to a controlled vocabulary. Whereas controlled vocabulary terms can be difficult to enter consistently, a broader approach using subject categories makes it easier to get consistency in subject assignment if more than one person is involved in the indexing process.
Enjoyed your thoughts
Ed Lehmann ELehmann@ntis.gov 010202
Created: February 1, 2001
Revised: February 2, 2001