November 13, 2006

Search Engine-Friendly URLs

Search Engine-Friendly URLs

By Chris Beasley
August 10th 2001

On today’s Internet, database driven or dynamic sites are very popular. Unfortunately the easiest way to pass information between your pages is with a query string. In case you don’t know what a query string is, it's a string of information tacked onto the end of a URL after a question mark.

So, what’s the problem with that? Well, most search engines (with a few exceptions - namely Google) will not index any pages that have a question mark or other character (like an ampersand or equals sign) in the URL. So all of those popular dynamic sites out there aren’t being indexed - and what good is a site if no one can find it?

The solution? Search engine friendly URLs. There are a few popular ways to pass information to your pages without the use of a query string, so that search engines will still index those individual pages. I'll cover 3 of these techniques in this article. All 3 work in PHP [1] with Apache [2] on Linux [3] (and while they may work in other scenarios, I cannot confirm that they do).

Method 1: PATH_INFO

Implementation:

If you look above this article on the address bar, you’ll see a URL like this: http://www.webmasterbase.com/article.php/999/12. SitePoint actually uses the PATH_INFO method to create their dynamic pages.

Apache has a "look back" feature that scans backwards down the URL if it doesn’t find what it's looking for. In this case there is no directory or file called "12", so it looks for "999". But it find that there's not a directory or file called "999" either, so Apache continues to look down the URL and sees "article.php". This file does exist, so Apache calls up that script. Apache also has a global variable called $PATH_INFO that is created on every HTTP request. What this variable contains is the script that's being called, and everything to the right of that information in the URL. So in the example we've been using, $PATH_INFO will contain article.php/999/12.

So, you wonder, how do I query my database using article.php/999/12? First you have to split this into variables you can use. And you can do that using PHP’s explode function:

$var_array = explode("/",$PATH_INFO);

Once you do that, you’ll have the following information:

$var_array[0] = "article.php"

$var_array[1] = 999

$var_array[2] = 12

So you can rename $var_array[1] as $article and $var_array[2] as $page_num and query your database.

Drawback:

There was previously one major drawback to this method. Google, and perhaps other search engines, would not index pages set up in this manner, as they interpreted the URL as being malformed. I contacted a Software Developer at Google and made them aware of the problem and I am happy to announce that it is now fixed.

There is the potential that other search engines may ignore pages set up in this manner. While I don't know of any, I can't be certain that none do. If you do decide to use this method, be sure to monitor your server logs for spiders to ensure that your site is being indexed as it should.

Method 2: .htaccess Error Pages

Implementation:

The second method involves using the .htaccess file. If you're new to it, .htaccess is a file used to administer Apache access options for whichever directory you place it in. The server administrator has a better method of doing this using his or her configuration files, but since most of us don't own our own server, we don't have control over what the server administrator does. Now, the server admin can configure what users can do with their .htaccess file so this approach may not work on your particular server, however in most cases it will. If it doesn't, you should contact your server administrator.

This method takes advantage of .htaccess’ ability to do error handling. In the .htaccess file in whichever directory you wish to apply this method to, simply insert the following line:

ErrorDocument 404 /processor.php

Now make a script called processor.php and put it in that same directory, and you're done! Lets say you have the following URL: http://www.domain.com/directory/999/12/. And again in this example "999" and "12" do not exist, however, as you don't specify a script anywhere in the directory path, Apache will create a 404 error. Instead of sending a generic 404 header back to the browser, Apache sees the ErrorDocument command in the .htaccess file and calls up processor.php.

Now, in the first example we used the $PATH_INFO variable, but that won’t work this time. Instead we need to use the $REQUEST_URI variable, which contains everything in the URL after the domain. So in this case, it contains: /directory/999/12/.

The first thing you need to do in processor.php is send a new HTTP header. Remember, Apache thought this was a 404 error, so it wants to tell the browser that it couldn’t find a page.

So, put the following line in your processor.php:

header("HTTP/1.1 200 OK");

At this time I need to point out an important fact. In the first example you could specify what script processed your URL. In this example all URLs must be processed by the same script, processor.php, which makes things a little different. Instead of creating different URLs based on what you want to do, such as article.php/999/12 or printarticle.php/999/12 you only have 1 script that must do both.

So you must decide what to do based on the information processor.php receives - more specifically, by counting how many parameters are passed. For instance on my site [4], I use this method to generate my pages: I know that if there's just one parameter, such as http://www.online-literature.com/shakespeare/, that I need to load an author information page; if there are 2 parameters, such as http://www.online-literature.com/shakespeare/hamlet/, I know that I need to load a book information page; and finally if there are 3 parameters, such as http://www.online-literature.com/shakespeare/hamlet/3/, I know I need to load a chapter viewing page. Alternatively, you can simply use the first parameter to indicate the type of page to display, and then process the remaining parameters based on that.

There are 2 ways you can accomplish this task of counting parameters. First you need to use PHP’s explode function to divide up the $REQUEST_URI variable. So if $REQUEST_URI = /shakespeare/hamlet/3/:

$var_array = explode("/",$REQUEST_URI);

Now note that, because of the positioning of the /’ there are actually 5 elements in this array [5]. The first element, element 0, is blank, because it contains the information before the first /. The fifth element, element 4, is also blank, because it contains the information after the last /.

So now we need to count the elements in our $var_array. PHP has two functions that let us do this. We can use the sizeof() function as in this example:

$num = sizeof($var_array); // 5

You’ll notice that the sizeof() function counts every item in the array regardless of whether it's empty. The other function is count(), which is an alias for the sizeof() function.

Some search engines, like AOL, will automatically remove the trailing / from your URL, and this can cause problems if you’re using these functions to count your array. For instance http://www.online-literature.com/shakespeare/hamlet/ becomes http://www.online-literature.com/shakespeare/hamlet, and as there are 3 total elements in that array our processor.php would load an author page instead of a book page.

The solution is to create a function that will count only the elements in an array that actually hold data. This will allow you to leave off the ending / or allow any links from AOL'’s search engine to point to the proper place. An example of such a function is:

function count_all($arg)
{
// skip if argument is empty
if ($arg) {
// not an array, return 1 (base case)
if(!is_array($arg))
return 1;
// else call recursively for all elements $arg
foreach($arg as $key => $val)
$count += count_all($val);
return $count;
}
}

To get your count, access the function like this:

$num = count_all($url_array);

Once you know how many parameters you need, you can define them like this:

$author=$var_array[1];
$book=$var_array[2];
$chapter=$var_array[3];

Then you can use includes to call up the appropriate script, which will query your database and set up your page. Also if you get a result you’re not expecting, you can simply create your own error page for display to the browser.

Drawback:

The drawback of this method is that every page that's hit is seen by Apache as an error. Thus every hit creates another entry in your server error logs, which effectively destroys their usefulness. So if you use this method you sacrifice your error logs.

Method 3: The ForceType Directive

Implementation:

You'll recall that the thing that trips up Google, and maybe even other search engines, when using the PATH_INFO method is the period in the middle of the URL. So what if there was a way to use that method without the period? Guess what? There is! Its achieved using Apache'’s ForceType directive.

The ForceType directive allows you to override any default MIME types you have set up. Usually it may be used to parse an HTML [6] page as PHP or something similar, but in this case we will use it to parse a file with no extension as PHP.

So instead of using article.php, as we did in method 1, rename that file to just "article". You will then be able to access it like this: http://www.domain.com/article/999/12/, utilizing Apache's look back feature and PATH_INFO variable as described in method 1. But now, Apache doesn’t know to that "article" needs to be parsed as php. To tell it that, you must add the following to your .htaccess file.


ForceType application/x-httpd-php

This is known as a "container". Instead of applying directives to all files, Apache allows you to limit them by filename, location, or directory. You need to create a container as above and place the directives inside it. In this case we use a file container, we identify “article” as the file we're concerned with, and then we list the directives we want applied to this file before closing off the container.

By placing the directive inside the container, we tell Apache to parse "article" as a PHP script even though it has no file extension. This allows us to get rid of the period in the URL that causes the problems, and yet still use the PATH_INFO method to manage our site.

Drawback:

The only drawback to this method as compared with method 2 is that your URLs will be slightly longer. For instance, if I were to use this method on my site, I'd have to use URLs like this: http://www.online-literature.com/ol/homer/odyssey/ instead of http://www.online-literature.com/homer/odyssey/. However if you had a site like SitePoint and used this method it wouldn't be such a problem, as the URL (http://www.SitePoint.com/article/755/12/) would make more sense.

Conclusion

I have outlined 3 methods of making search engine friendly URLs - along with their drawbacks. Obviously, you should evaluate these drawbacks before deciding which method to implement. And if you have any questions about the implementation of these techniques, they are oft-discussed topics on the SitePoint Forums [7] so just stop in and make a post.

November 3, 2006

Use Wget to find site broken links

wget --recursive -nd -nv --delete-after --domains=domain.com http://domain.com/ | tee wget.out 2>&1
Got it from: http://www.shallowsky.com/blog/tech/web/webstats.html