Robots.txt, An Online Marketers Friend Or Foe?

Robots.txt, An Online Marketers Friend Or Foe?

robots.txt is possibly the most miss understood file that a website can contain.

Many people think that by using a robots.txt file on their website they are protecting pages and folders from thieves and hackers. In fact it is totally the opposite! robots.txt opens up an enormous security hole that hackers and theives will use to easily gain access to the parts of your website that you don’t want them to.

What is robots.txt?

robots.txt is a file that you create and upload to your websites root directory that is used by search engine spiders to determine which parts of your website they should index and which folders/pages that you, the website owner,
don’t want listed in search engine indexes.

Why would not want pages indexed?

There are many reasons why you might not want search engines to index pages on your website, such as private membership pages or exclusive training pages and such like.

If you are an Internet Marketer selling your own ebook or other digital product, you wouldn’t want your thank you pages indexed either!

And this is where the misunderstanding comes to the fore, and robots.txt becomes your foe.

Many online marketers who provide ebooks or other digital products for instant download will list their download thank you pages in the robots.txt file because they obviously don’t want those pages indexed in search engines.

By using robots.txt this way though, you will be opening up your product to anyone who has a slight bit of knowledge about how the file works.

robots.txt is easily readable by any human that opens a browser and types in http://www.yourdomain.com/robots.txt and if you have listed your thank you pages, all they have to do is go to that url and take your product(s)!

It’s that easy!

And I’m living proof that this works, as this is exactly what happened to me. I had listed my thank you pages in robots.txt and thought that they were safe from hackers and thieves, then one day I was checking my web site stats and BAM, someone had been to every single thank you page, and taken everything.

The moral is, don’t list any URL in robots.txt that you don’t want humans to have free access to. Use robots.txt with great caution and secure your thank you pages using dedicated software.

Tim Spencer is the owner of http://www.create-an-income.com. Discover how easy it is to protect your thank you pages with expiring links.

This is the more advanced version of this robot, created by the University of Southern California. The robot is completely autonomous and trained by machine learning algorithms. The video is real-time, ie, not sped up.
Video Rating: 4 / 5

Incoming search terms:

Importance of the Robots.txt File

Importance of the Robots.txt File

Despite the importance of the Robots.txt file in getting your website indexed with the major search engines, many webmasters don’t offer one on their site. What is the robots.txt file you ask? If you don’t know, you are far from alone. The robots.txt file is a simple text file (no html) that is placed in your website’s root directory in order to tell the search engines which pages to index and which to skip.

When a search engine sends its webcrawler to your site, one of the first things the webcrawler will do is search the root directory for the robots.txt file. A correctly formated robots.txt file will consist of several records, each providing instructions for a particular search-bot. A record will generally consist of two components, the first is called the user-agent and is where the name of the search-bot is listed. The second line consits of one or more “disallow” lines. These lines tell the webcrawler which files or folders should not be indexed (ie a cgi-bin folder).

If you currently have a website and do not have a robots.txt file, you can create one easily. As mentioned earlier, the files are plain text, so just open up notepad and save the file at robots.txt. Most webmasters can use one record that will apply to all of the search engine crawlers. Once you have opened notepad enter the following:

User-agent: *
Disallow:

The “*” applies this rule to all bots. In this example, there is nothing listed in the disallow line. This tells the robot to index the entire site. You can also enter a folder path here such as “/private” if there is a folder that shouldn’t be indexed. This can be very useful if you are still testing a portion of your website or is a section is still under construction.

Now that you know what should go into your robots.txt file, there are several common mistakes people make when creating these files. Never enter notes or comments into the file as these items can cause confusion for the webcrawler. Also, the format should always be the user-agent on the first line, followed by the disallow(s). Do not reverse the order. Another common mistake made involves using the incorrect case. If the disallowed folder is /private, make sure your robots.txt file does not list the folder as /Private. It seems like a very minor issue, but it will cause problems if done incorrectly. Finally, there is no Allow command. You cannot tell the webcrawler what to look at, only what not to look at.

If you are still curious about the robots.txt file you can find many more complex examples online. Just try one of your favorite websites and look for their robots.txt file. For example you can go to http://www.cnn.com/robots.txt. If you need help creating a robots.txt file for your site, there are plenty of places online that will create the file for you for free. One example is http://www.seochat.com/seo-tools/robots-generator/. Despite its apparently simplicity, this file can make or break your site’s chances with the search engines. Make sure you have your robots.txt file in place and correctly formatted today.

Justin Scarborough is founder of the Affiliate Marketing Linx internet marketing directory . His goal with this website is to create a very selective, human-edited directory that will help others find quality links and information relating to affiliate and internet marketing.

Find More Robot Articles

Learn to Protect Your Site by Communicating in the Language of Robots.txt

Learn to Protect Your Site by Communicating in the Language of Robots.txt

If you are a website owner, you know the reasoning behind that question. No, we are not talking about physical robots in general, but rather the language of robots. Anyone that is familiar with the famous Google robot – Googlebot, knows how important it can be to be able to understand the language of robots to help protect your website. Not everyone though, is at savvy in the language art of speaking robot.   

It can be intimidating to some website owners when thinking they have to learn to effectively use the language, but there are tools available to help the lesser robot savvy communicators. Most of us have probably employed the services of Googlebot to protect sections and parts of our websites that we don’t want invaded. Those that are familiar with using the robots.txt language can simply fire off a file to him and he will always deliver what we need. But if you are unsure of your abilities in the art of speaking robot, there is something that can help you.

There is a new Webmaster tool available that acts as a translator or robot.txt files. It helps you build the file to use, and all you have to do is enter the areas you do not want robots to crawl through. You can also make it very specific blocking only certain types of robots from certain types of files. After you use the generator tool, you can take it for a test drive by using the analysis tool. After you have seen that your test file is ready to go, you can simply save the new file on the root directory on your website and sit back.

When creating and using the robots files, you should consider the following two tips:
1.    Robot text files are not always supported on all search engines – Googlebot and some other robots can understand the files, but other robots may not be able to understand the generated files.
2.    Keep in mind that robot text files are only a method of asking that your site be protected from robots crawling. You simply generate the file, but to some robots who are not as scrupulous as others, they can choose to ignore the file and get in. Make sure you use the password protection option to protect what files you need blocked.

This can be a great tool for those who are not as confident in their robot language skills, and can create a safe haven for the files on your website you need protected from unsavory robots. It can substantially help you in your quest to protect your website and files within by helping you generate the file in the correct format to the robot. As always, there are options out there if you need further guidance, you can check out the help center for Webmaster tools or seek answers from a help group of Webmasters.

Sarah Folgea from Aceinternetmarketing.ie specializes in writing articles relating to the online Business Industry and importance of Robots.txt . Visit her website at www.aceinternetmarketing.ie

How To Keep Robots Out Of Your Web Site

How To Keep Robots Out Of Your Web Site

THE ROBOTS.TXT FILE


You know that search engines have been created to help people find information quickly on the Internet, and the search engines acquire much of their information through robots (also known as spiders or crawlers), that look for web pages for them.


The spiders or crawlers robots explore the web looking for and recording all kinds of information. They usually start with URL submitted by users, or from links they find on the web sites, the sitemap files or the top level of a site.


Once the robot accesses the home page then recursively accesses all pages linked from that page. But the robot can also check out all the pages that can find on a particular server.


After the robot finds a web page it works indexing the title, the keywords, the text, etc. But sometimes you might want to prevent search engines from indexing some of your web pages like news postings, and specially marked web pages (in example: affiliate´s pages), but whether individual robots comply to these conventions is pure voluntary.


ROBOTS EXCLUSION PROTOCOL


So if you want robots to keep out from some of your web pages, you can ask robots to ignore the web pages that you don´t want indexed, and to do that you can place a robots.txt file on the local root server of your web site.


In example if you have a directory called e-books and you want to ask robots to keep out of it, your robots.txt file should read:


User-agent: * Disallow: e-books/


When you don´t have enough control over your server to set up a robots.txt file, you can try adding a META tag to the head section of any HTML document.


In example, a tag like the following tells robots not to index and not to follow links on a particular page:


meta name=”ROBOTS” content=”NOINDEX, NOFOLLOW”


Support for the META tag among robots is not so frequent as the Robots Exclusion Protocol, but most of major web indexes currently support it.


NEWS POSTINGS


If you want to keep the search engines out of your news postings, you can create an an “X-no-archive” line in of your postings’ headers:


X-no-archive: yes


But although common news clients allow you to add an X-no-archive line to the headers of your news postings, some of them don´t permit you to do so.


The problem is that most search engines assume that all information they find is public unless marked otherwise.


So be careful because though the robot and archive exclusion standards may help keep your material out of major search engines there are some others that respect no such rules.


If you’re highly concerned about the privacy of your e-mail and Usenet postings, you must use some anonymous remailers and PGP. You can read about it here:
www dot well dot com/user/abacard/remail.html
www dot io dot com/~combs/htmls/crypto.html
world dot std dot com/~franl/pgp/


Even if you are not particularly concerned about privacy, remember that anything you write will be indexed and archived somewhere for eternity, so use the robots.txt file as much as you need it.


Written by Dr. Roberto A. Bonomi

Dr. Roberto Bonomi is a successful e-book writer that shares his home business experience at: http://www.easy-home-business.com If you already have, or are looking for an Internet Home Business, you can’t miss the free knowledge that you’ll receive at his site, and you can post free your own articles at http://articles.drbonomi.com

Boston Dynamics just released a new video of the Big Dog on ice and snow, and also demoing its walking gait.
Video Rating: 4 / 5

More Robot Articles

Using Robots.txt to Control Search Engines

Using Robots.txt to Control Search Engines

Robots.txt is a text file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt implements the Robots Exclusion Protocol, which allows you as a web manager, to define what parts of your site are off-limits to search engine crawlers. For example, Web managers can disallow access to .cgi, private, temporary directories and other areas with pages they do not want accessed or indexed. 

The robots.txt file is made up of two parts, the User-agent and the Disallow. The User-agent specifies which robots to allow or disallow and the Disallow specifies which directories robots can or cannot crawl. The robots.txt is a gentleman’s agreement and some crawlers, such as Google, may ignore the robots.txt file that disallows all crawling.

The structure of a robots.txt is pretty simple. This example allows all robots to visit all files:

User-agent: *Disallow:

Example of a recommended robots.txt files blocking crawling of the scripts and images directories:

User-agent: * Disallow: /scripts/

Disallow: /images/

If you have a particular robot in mind, such as the Google image search robot, which collects images on your site for the Google Image search engine, you may include lines like the following: 

User-agent: Googlebot-Image

Disallow: /

This means that the Google image search robot, should not try to access any file in the root directory and all its subdirectories.

You can create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file and the filename should be lowercase. Include the robots.txt file in your server’s root directory. This is standard web management practice. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don’t find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way.

 

All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders your web site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.

 

Stanic Vojin is a full time internet marketer and the owner of PromoteClick.com

Find More Robot Articles

line
footer
Powered by Wordpress | Designed by Elegant Themes