May 05, 2004

Lucene and documentation

I have been wrestling with ht://dig for the last few days. Couldn't get it working. That's another story. But I thought it might be good to try Lucene, since I heard that it was a search engine, and it's an Apache project. So, I went off to the Lucene web site, and tried to figure out what would be involved in installing it. Is it a thingy that I can put directly on my Apache web server? Do I need Tomcat, or one of those inscrutable Jakarta thingies? Is it stand-alone in some sense?

Tha "getting started guide" says that you have to have Tomcat or some other "container", whatever a container is. It also talks about such things as Template Web Applications, and has the following delightful sentence:

One would hopefully use MVC architecture such as provided by Jakarta Struts and taglibs, or better yet XML with stylesheets.

Now, I'm perfectly confident that that makes perfect sense to a significant number of people, but I'm finding this to be utterly inscrutable. It looks like I'm going to have to go get Tomcat running, and possibly learn about MVC, Struts, and taglibs, whatever those are, before I can even start to install Lucene. And forget using it as an example in my Apache class.

So, then I went over to the wiki. A number of Apache projects have moved to Wikis for their documentation, because, we are repeatedly told, it is better in every way, allows people to edit the documentation easily, and contribute without barriers, and so, obviously, will produce better documentation.

I searched for "Installation" in the wiki. Apart from the irony of a web site about a search engine using a different search engine for indexing, I found the results intrigueing. All the pages that were returned by this search were about installing something called MoinMoin, which is apparently the Wiki software itself. None of the search results had anything to do with Lucene. I don't know if this means that the documentation is terrible, or that the search feature of the Wiki is terrible, but neither one of thse was particularly comforting.

A couple months ago, someone on a mailing list I'm on criticized Apache (The ASF in general) for the terrible documentation that is the hallmark of our software. I took great offense, and went to some length to defend the Apache Web Server documentation, which is, in my opinion, among the best of any free software product, and better than most commercial products.

I might be compelled to retract my statements. The Apache Web Server does indeed have great documentation, but each time I look at some of the other products, I'm apalled by how little information is conveyed by the docs. Many of the projects, it's almost impossible to even find out what the product does, let alone how to install or use it. The documentation, when there is any, is directed to the developers, not to the users. The developers don't need to be told basic things like what the product does, where to get the dozens of prerequisites, and what all the jargon means. But I don't really think that documentation *should* be directed at the developers. The folks that want to use the product are utterly in the dark, and, as often as not, throw their hands up in disgust and go look for something else. Something a product that's not as good, but which is easier to figure out.

People don't like feeling stupid. Documentation that makes people feel stupid leads to people choosing a different product. Documentation that uses words like "simple", "basic", "easy" and so forth makes people feel stupid when it's not easy for them. Documentation that assumes vast bodies of existing knowledge without even a nod towards somewhere to go to learn more about it, or even suggesting that you might need some existing knowledge, makes people feel even more stupid.

I'm reminded of the Geronimo demo at ApacheCon, where there wasn't even a mention of what they were demonstrating - we were all supposed to know. This made me feel uninformed, and, yes, stupid. This is how the Lucene docs make me feel. There's an underling feeling that I'm missing most of what is being said because of a great gap in my knowledge, but there's no suggestion as to where I can go to fill that gap.

For the record, Zope makes me feel the same way.

Posted by rbowen at May 5, 2004 04:04 PM | TrackBack

In general, I would agree. I do, however, think that there is a call for docs aimed at developers rather than end users...particularly early in the life of a software project. I do, of course, agree that projects need good docs for users, particularly as they get to "release quality" (whatever that may mean). Ultimately, I think most software projects need docs for both developers and users.

Oh, and I do think a partial retraction of your defense of ASF docs is in order. Yes, httpd's docs are good (great, in fact), but as you're seeing, many of the docs for other projects are sad, sad, sad. I share the experience of looking through many Apache projects and not even being able to figure out what the project does. "Annoying" doesn't describe it.


Posted by: Jeff McAdams on May 5, 2004 07:49 PM

It varies by project. I too found lucene a bit difficult to follow. There were some articles scattered around that helped clear things up but you certainly shouldn't be required to read external articles understand any particular project. I believe there is a book coming out at some stage.

Posted by: Glen Stampoultzis on May 6, 2004 01:06 AM

Lucene is, at it's core, a Java library (API) for indexing and search that doesn't have any other dependencies at all. There are extensions that may require containers, but it all depends on what you want to do with it. In any case, the community mailing list is very active and helpful.


Posted by: Scott Ganyo on May 6, 2004 09:36 AM

Well, I was hoping to set up a simple search engine on a simple (all static) existing web site. That was all. I will get on the mailing list and see how newbie questions are received. I really don't mean to gripe, and would be glad to contribute to a documentation effort in any way that I can.

Posted by: DrBacchus on May 6, 2004 11:28 AM

What is wrong with reading Lucene articles?
(lame comment entry - doesn't allow links)

Posted by: Otis on May 6, 2004 12:15 PM

Here's another aspect of Open Source that annoys the non-initiate. When they complain about how hard it is to find useful information for beginners, they get told that they are not working hard enough, and that if it's hard, it's because they are lazy (or stupid) and not because the docs are terrible.

Posted by: Someone on May 6, 2004 12:32 PM

Your criticism is like complaining about a garage that sells car engines when you were really looking for a hire car. There's a market and need for both, with typically different clients.

Lucene only sets out to be a library - not a server or end-to-end search solution. Having limited boundaries for open-source projects is desirable and makes it a lot easier to combine different libraries into custom solutions.

If you're looking for a server you can run "out of the box" - commercial solutions such as SearchBlox package the Lucene library with document parsers, admin consoles and documentation.

Posted by: Mark Harwood on May 6, 2004 06:47 PM

Actually, no, it's more like a garage that sells engines but doesn't have a clear sign that tells me what it sells. Thanks for clarifying what Lucene is, and what it is not.

I have this same complaint with a significant number of the newer Apache projects - I have absolutely no idea what they are, and there's no clear statement anywhere as to what they are. And so I waste a great deal of time trying to get things to work which end up not being what I thought they were.

I do indeed appreciate you stating what it is. I would appreciate even more a clear statement on the website saying what it is.

A while back I thought it might be a good idea to go through all the Apache projects and try to state what each one is. Unfortunately, I never found the tuits. I'll put that back on my ToDo list.

Posted by: DrBacchus on May 6, 2004 07:50 PM

First sentence on Lucene homepage.
"Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java."

I don't know what else you expect it to be. I think it's very clear. It's a LIBRARY, not a complete software. Although it comes with some command line interfaces, it is intended to be a LIBRARY.

Posted by: holyman on May 7, 2004 01:57 AM


I'm going to make the (probably incorrect) assumption that you're asking a real question, that you genuinely want to know why people don't understand what it says, and that the Lucene people care about making it understandable.

What else would I expect? I would expect that sensible people would not develop a library in a vacuum. That sensible people would develop an implementation, however small and limited, which would exercise the API of a library that they were building. And that, at the very least, the web site would point to such implementations.

I would further expect that a good percentage of visitors to the site would overlook the word "library" in that sentence, and see "high-performance, full-featured text search engine." A large part of writing good documentation is to understand how beginners see your sentences, and how they will understand them.

What I see very consistently is that people inside the community know what it does, and those outside it don't. That's the mark of documentation that is written with a very limited audience in mine. Yes, developer-centric documentation is important for small, immature projects, but as those projects mature, it's important to aim it more towards people stumbling on the project for the first time.

Of course, if the question was intended merely to be argumentative and belligerent, you probably didn't read this far at all. But, I read your message loud and clear: negative feedback not welcomed, go elsewhere for your needs. This is a shame, coming from a project with in the ASF, but I suppose the character of the ASF is inexorably changing, and I just need to get used to it.

Posted by: drbacchus on May 7, 2004 06:12 AM

funny coincidence: an old friend asked me yesterday about lucene (since he knows I do java and do java @ the ASF, he assumed I'd at least know something...which I really don't). This guy was also looking for a ht://dig replacement, which lucene is not.

There's several projects with this kind of problem. It all has to do with audience...this friend of mine was not the target audience for lucene, nor are you. Many "middleware" or backend projects have this kind of problem. Geronimo is indeed another good example. It isn't directed towards "new" users at all right now. Rather, the initial target audience is people who have several yeras of experience with server-side java programming.

You see this a lot, not just in open source, and we all experience it from time to time: you're interested in something, but there's "prerequisite reading" before you'll be able to grok all of it.

You can't go to and learn what a database is. Its assumed you know what a database is, and why you need one, before you ever visit the site. The microsoft windows homepage doesn't even hint at the fact that windows is an operating system. The page, to a casual non-windows-aware person, makes windows sound like some kind of security software (!).

The obvious thing to do is for the authors of documentation and software to describe their "target audience", and "prerequisite reading".

The HTTPD documentation is great if you're the target audience (namely, someone who wants to set up a webserver). Not if you want to learn what a webserver is, let alone what a server is (isn't a server a hardware device that makes a lot of noise?). Not if you want to learn how to write a webserver.

HTTPD omits the "target audience" and "prerequisite reading" materials, as do windows, oracle and countless other projects.

We could all learn a little from how the OSAF is managing chandler publicity.

Posted by: Leo Simons on May 7, 2004 07:05 AM

Rich - I've been keenly following this entry and the comments since you posted it. As a huge fan of Lucene, an author on a couple of articles on Lucene, a frequent speaker on the topic of Lucene, and finally a co-author of the upcoming Lucene in Action book (Manning) I want to see Lucene shown in the best possible light. Your comment is taken to heart and there is no doubt that all of us developing open source software could do a better job at documenting and promoting our projects. Lucene is no exception, and its web site could definitely use some clarifications on what it really is.

I often "spelunk" into various open source Java projects myself and find the documentation and explanations unclear. I'm so immersed in the Java world, though, that it feels natural to me to simply skip the documentation and go right to some example usage and then further into the source code itself when I need more clues.

While I do concur that Lucene's web site could use more (and I will do what I can to enhance it in the next couple of months), like another poster said about the first sentence, I think it provides a pretty clear and succinct description of Lucene. For a web crawler built on top of Lucene, give Nutch ( a try. It has recently been modified to facilitate intranet indexing/searching more easily.

If you have any questions on Lucene, I'd be more than happy to hear from you via e-mail.


Posted by: Erik Hatcher on May 9, 2004 09:16 AM
Post a comment