Nikita the Spider: a bulk validation and link checking tool

As a site grows to a large number of pages with content created by several (or numerous) people, it often becomes more difficult and time consuming to make sure that every t is crossed and every i is dotted. Links break. Our often less-than-perfect content management systems let validation errors creep in. And so on.

So when I found out about Nikita the Spider, a new validation and link checking service, I thought I’d give it a try on this site. The results show that despite my efforts, this site contains a bunch of validation errors, most of them by the way of unescaped ampersands and non SGML characters in comments. Ouch. The report also pointed out many URLs on this site are (too) long.

Yeah I know, I should make shorter URLs. Yes, I need to adjust my Movable Type templates to make sure any ampersands in comments get escaped properly. And I know that I should be using UTF-8, not ISO-8859-1.

The point is that Nikita the Spider is a very useful tool worth giving a spin. Try it, and you are likely to find out things you did not know about your site. (via The Elementary Group Standards)

Posted on October 18, 2006 in Quicklinks, Web Standards

Comments

  1. Sorry Roger but I don’t like this one. Seems a bit to complex. I have to supply a ‘Seed URL’, a domain, specify “WWW” domain equivalence and an email address? I consider myself to be a tech/web-savvy person but I don’t know what half of that means, and I’m not willing enough to find out.

    I’d rather put my URL into the [WDG HTML validator](http://www.htmlhelp.com/tools/validator/ and check ‘Validate entire site’; simple enough to understand for a web standards noob and it does the job. It’s limited to 100 pages but it wont cost you any money, unlike what Nikita the Spider will eventually do.

    Speaking of complex, what’s with this Markdown syntax?:)

  2. The report also pointed out many URLs on this site are (too) long.

    I’ve never really understood what the issue is with long URLs. If they’re meaningful and structured what’s the problem? And what’s “too long” anyway - are there any technical limits you’re breaking on this site, or is it an arbitrary measure?

  3. Thanks for this Roger.

    I don’t post often because your articles are always very comprehensive, and other people comment better.

    But I am in the mood to express my appreciation of your efforts.

  4. October 20, 2006 by Adrian Bengtson

    Seems like your post generated some traffic. From the Nikita site:

    October 19th - Thanks to a nice mention on 456 Berea Street, Nikita received twice as many crawl requests in the past twenty-four hours as in the past five months, and she was already averaging 1000 pages validated daily. She now has a week or two of work backlogged, so I’ve disabled submitting new crawl requests. Sorry about that!

    :)

  5. October 21, 2006 by Roger Johansson (Author comment)

    Rosano: To each his own I guess :-). About Markdown - it’s there to keep comments reasonably valid. The alternative would be to completely disable markup in comments.

    Dan: HTTP doesn’t specify a maximum length for URLs, but there are limits to the length applications will handle. For example, many email clients will put linebreaks in URLs longer than 72 characters and fail to reassemble them when the user clicks on them. I believe Outlook does this. Nikita checks for URLs longer than 72 characters.

    Kev: Thanks!

    Adrian: :-D

  6. To Rosano: Hi, I’m the developer of Nikita the Spider, and I thank you for visiting the site. Sorry you found it confusing. Making a simple interface to a complex tool isn’t easy and obviously I haven’t entirely succeeded. ;) That’s part of the reason that Nikita is still in alpha test.

    The “WWW equivalence” option that you mention confuses a lot of people and will go away as soon as I can figure out how to replace its function in code. The email address is optional and I only ask for it so I can notify you when your crawl is done or if something goes wrong. In fact, most users can just type in a seed URL (e.g. their home page) and click “Start”. I have to fill out more fields to make this blog posting than to start a crawl with Nikita so I’m not sure if I can simplify Nikita much more but I’ll try.

    To Dan Champion: Roger hit the nail right on the head — long URLs (long = more than 72 characters) break in many email programs. (I’m not sure whether the sending program, receiving program or some combination is responsible so I’m reluctant to point fingers.)

    I think long URLs reduce the number of visitors to one’s site because visitors who want to send an email to a friend or colleague that says, “Hey, look at this cool Web page I found” either have to use a service like tinyurl.com (which is very useful but comes with its own set of inconveniences) or take the chance that the recipient will soon be emailing them back because the link didn’t work. Either way, it’s a hassle. I know I sometimes can’t be bothered with it.

    OK, back to work I go…

  7. Per an email 11/12/2006, Nikita is available to the public again. There are some new features and changes based on the feedback from the visitors from the posting on this blog.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.