What Is A Robots.txt File?
Ok, so you want to know what a robots.txt file is and how to use it? I am sure by now many have heard about robots.txt files, but what in the world is it and most importantly, what does it do and how will it benefit you? All the answers to these questions and more are found in this article.
The Simple Explanation
A robots.txt is a very small file that resides on your server that gives specific instructions to webcrawlers such as the famous Googlebot on which directories and files are allowed to be crawled and indexed. It is important to use a robots.txt file because it puts you (the publisher/webmaster) in total control of where webcrawlers are allow to visit on your website. A more detailed explanation can be found here on Wikipedia, at www.robotstxt.org, and here at Matt Cutt’s blog.
How Is It Useful?
Using a robots.txt file is useful (especially for WordPress users) because it gives you the ability to say where webcrawlers are allowed or NOT allowed to visit. By drawing this map for webcrawlers you do two things:
- You help the webcrawler by not allowing it to index crazy and stupid things such as files in your wp-includes directory.
- You help your important content get indexed the way you want it to and prevent duplicate content from getting indexed, such as disallowing your archives, category or tags section.
Every blog is unique and every author such as yourself, stress a unique importance on the various sections within a blog. It is up to you to decide which sections on your blog as well as within your server that you want to grant or deny access to webcrawlers. I have some blogs where I disallow all access except for the homepage and the individual post pages. Other blogs, I don’t care and put total faith into webcrawlers to crawl everything and index and rank things according to their importance.
How Do You Implement It?
If you are a WordPress blogger, your blog already has a robots.txt built in. By default, the files is set to allow access to all directories on your server. The process of overriding the default is easy. Simply create a txt file using NotePad, WordPad, TextPage, etc.. and name it robots.txt. From there, visit this page to learn about how you can quickly start adding instructions to it.
What Sections Will You Allow or Disallow?
Now that you know about the Robots.txt file, which section on your WordPress blog are you going to allow/disallow? If you already use a robots.txt file, I want to invite you to share you input by dropping a comment. I’d like to know what has been most effective for your blog. I look forward to reading your comments. Also, if you should have any questions, feel free to ask I’ll be standing by. Also, I am sure that many of my readers will be happy to pitch in and help address your questions too.
How Do You Have Your Sitemap.xml and Robots.txt File Configured?
There is a lot of personal preferences towards how many people configure their sitemap.xml file and their robots.txt file. Some people say its bad to allow Googlebot to index all sections of your blog because it can create duplicate content issues or cause your content to pull rank on a...How To Use a Robots.txt File To Prevent Duplicate Content In WordPress From Indexing In Google
WordPress creates a lot of duplicate content. I don’t believe that is too much of an issue today as it was a few years ago because WordPress is obviously very popular and I don’t think Google is going to penalize millions of blogs for something that publisher aren’t aware of....Increase Google Rankings With Robot Control
Do you want to increase your Google rankings? I have discovered an easy way to get your self hosted Wordpress blog and posts a better ranking in Google. A self hosted Wordpress blog can consists of hundreds, if not thousands of individual posts and pages. Google is standing by to rank...Search Engine Sitemaps
Earlier this year if you asked me about search engine sitemaps, I would have told you, “Yes, you need a sitemap.xml and you need to make it top priority!” but now I question the effectiveness of having one. To be honest, the process in how I started to question this...How To Speed Up Your WordPress Blog
In May of 2008 I was running into a lot of trouble with my WordPress blog. Traffic was on the rise and my blog was loading very slowly and at times would even crash. Later after I fixed my WordPress blog and talked about how the damage came from overloading...



So far I haven’t used robots.txt on any of my blogs. I know that I have to create one to prevent indexing of the wp* directories, but for the content itself, I have so far relied on using excerpts for tag, category and archives to prevent the same content from showing up in too many places.
If I then switch to using excerpts on the front page as well, I might not need to spend too much time on a robots.txt?
These are my thoughts on robots.txt at the moment. (Being a newbie in blogging, they are of course subject to change, if somebody makes a convincing argument for changing it).
Frank H M
10 Jan 08 at 5:26 am
You can view my Robots.txt file here.
It is actually a pretty standard one, though I disallowed a few files that are unique to my site that the search engines didn’t need to see like redirected affiliate links and such.
One of my News Year’s resolutions was to go through and update it, which hasn’t happened yet.
Kyle Eslick
10 Jan 08 at 8:52 am
Just keep in mind that anyone can read your robots.txt file. If you are disallowing directories that entices others to attempt to access them to figure out what you don’t want to share. =) It’s a common practice; which is why I recommend that if you use subdirectories to do development work that you have a separate testing server.
Jason L
10 Jan 08 at 3:40 pm
I’m glad you mentioned it because I need to update mine. I use it to disallow archives, trackback links, images, wp-admin etc. etc.
nofollow is somewhat effective, but not all SE’s acknowledge it.
@ Frank – It’s not just about duplicate content. It’s also to consolidate good pages in order to get the spiders deeper into your blog and to direct the flow of authority to those money pages.
Josh Spaulding
10 Jan 08 at 3:48 pm
I’ve done a few bit of work on my robots file, I spent a good few months trying to sort it out – I was trying to exclude stuff, but because of the order I placed it in the file, it wasn;t being picked up – once I sorted that out though, it’s worked quite well so far.
My ultimate aim is try and keep content to being shown on the individual pages only and index other pages of interest.
I’ve had a vbulletin forum for years now that my friends and I have mainly used, however I’ve now decided to drop the forum at my next renewal time in March since it doesn’t get enough use for the money it costs per year – with that in mind I’ve now added the forum pages to the robots file in the hope that once I do remove it, Google doesn’t penalise the rest of the site when the hundreds of forum pages suddenly disappear from the site!
Overall, in the last few months I’ve gone from around 1900 indexed pages (with lots of blog and forum duplications) down to around 1000 pages – this should continue to drop now all forum pages are being removed.
How big a difference it makes, I can’t say for sure, but I’ve now got to the point where keeping my Google indexed pages is a project and a challenge in itself!
Zath
12 Jan 08 at 12:13 pm
Thanks for the comments and feedback so far everyone, this is great stuff. Kyle thanks for showing us your file. Jason L… you have a very valid point
@ Zath,
Can you contact me by email and let me know where this forum is located? I’d like to check it out and see if I can help you out a little bit before you shut it down.
Garry Conn
12 Jan 08 at 9:06 pm
[...] Conn offers his thoughts on having a Robots.txt. It took me awhile to get the hang of them, but I’ve enjoyed a lot of success with a [...]
Technology Talk - 01/13/2008
13 Jan 08 at 4:01 am
[...] Conn offers his thoughts on having a Robots.txt. It took me awhile to get the hang of them, but I’ve enjoyed a lot of success with a [...]
Technology Talk - 01/13/2008
13 Jan 08 at 5:31 pm
[...] does not suck. Also Garry is another blogger who provides workable, informative posts such as this guide to the robots.txt file which I really need to absorb and apply myself. Finally, Garry invites his readers to ask him [...]
Caroline’s Favourite Blogs & Links #9 | Caroline Middlebrook
4 Feb 08 at 9:52 am