The question has come up more than a few times, what is up with all the dead links?
Well, I went so far as to add an FAQ entry, but let me go into more detail on what is up, and more importantly, what is being done.
The data in the database is about 10 years old. I actually think of that as a bit of an achievement on my part (especially after all of the problems I’ve had over the years), but that is also a problem. Sites move, projects die, owners move on, things change. Because diysearch does not spider the web, the only way to keep links current is to put trust in each individual link owner to keep the data current. I know this hasn’t been terribly easy in the past, but that has all changed now. Keeping your data current is as easy as logging in.
So, that’s now, what about then? The old data is going to be pruned. What I’ve done is, wrote a scipt that will validate each and every URL in the database (nearly 20,000 as of this writing). The validation is two tiered. The first tier basically just checks to make sure the URL is a valid URL (following the proper syntax). The second tier actually is a bit more complicated. This is where the indexing engine tries to connect to the target URL, which then interrogates the HTTP return code. If its a good return code, then the URL is marked “approved” in the database and is included in the index. If the HTTP return code is determined to be a bad one (i.e. 404 or 501) the URL is flagged as “not approved.”
The URL is not deleted from the database. It is simply flagged and is not included in the index. The owner can come back, at any time, make corrections and when the index job runs again, and assuming the owner fixed the mistake, the URL will then be marked as “approved.”
Now, I mentioned spidering. For those that don’t know, “spidering” refers to implementing a “bot” or a “spider” that automatically “crawls” through a web site, following links and collecting data on what it finds. I generally do not like this method of cataloging. It is fraught with problems, as you can imagine. It works for google, because their mission is quite different from diysearch.
Having said that, I am working on a plan to implement a limited form of spidering for diysearch subscribers (I’ll be getting into that later). Where a diysearch spider will crawl only within a single domain (provided by the subscriber) so as to automate the addition of other links and resources within the subscriber’s site. Yes, I know what a pain it can be to enter a URL for your zine, if you have broken up your site by issue. Well, if you become a subscriber, the spider could do that for you.
I’ll be talking more about subscribing later. These ideas are still being hashed out.
But as for the dead links, something is being done right now. So hang in there. The site will have a much more relevant set of data in the coming week or so.
Home > About This Post
This entry was posted by on Monday, May 22nd, 2006, at 12:05 pm, and was filed in site news.
Subscribe to the
RSS 2.0 feed for all comments to this post.
Post a Comment
You must be logged in to post a comment.