Notices

+ Reply to Thread
Results 1 to 15 of 15
  1. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    Google ignores robots.txt?

    I have a domain, wettpedia.net which has exactly the same content as our main site sportfogadas.org. It will be translated at some point, that's the reason it's uploaded and visible online.
    I didn't want Google to index it, so I created a robots.txt disallowing everything, right after uploading.

    My question is, how did Google ignore it? In webmaster tools it shows everything is blocked correctly, but even so if I do a site: xxwww.wettpedia.net search it shows up indexed pages. It kinda sucks, as it is duplicate content.

    Now I'm thinking about blocking Googlebot completely, based on it's user-agent string, giving a 404 error for each page it tries to index, but it will probably take some time until it disappears from the index.
    Last edited by vmlinuz; 27 May 2009 at 3:48 pm.
    Reply With Quote Reply With Quote  

  2. Chips's Avatar
    Chips is offline Private Member
    Join Date
    October 2007
    Location
    God's Country
    Posts
    3,374
    Thanks
    911
    Thanked 1,075 Times in 807 Posts

    That is strange, I am in the process of re-designing one of my sites and am doing it all in one folder off the root. I blocked robots from that folder and none of the new pages are in the index. Also, when I begin a new review, I usually include that page in the robots.txt as Disallow and it never gets crawled.

    Did you have a mis-spelling in robots.txt?

    I am sure that the Disallow: /xx.xxx works as I forgot to delete one page from the list and included it in the xml map and it generated an error as being blocked.

    I list it as "Disallow: /newpagetitle.html" or for the folder "Disallow: /folder"

    Maybe you can do a page removal request in Webmaster tools to get them off the index sooner.
    --
    "If you shoot for the stars and hit the moon, it's OK. But you've got to shoot for something. A lot of people don't even shoot." - Confucius
    Reply With Quote Reply With Quote  

  3. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    Here's the content of robots.txt:

    User-agent: *
    Disallow: /

    If i do a removal request I think it could be harder to get it back in the index later.
    Reply With Quote Reply With Quote  

  4. Chips's Avatar
    Chips is offline Private Member
    Join Date
    October 2007
    Location
    God's Country
    Posts
    3,374
    Thanks
    911
    Thanked 1,075 Times in 807 Posts

    Yes I think you are correct, it would be hard to get "re-indexed". I thought that the pages you were doing were to be moved at a later date. My bad, sorry to offer incorrect advise, (oops). The robot.txt is on the money too. very odd that it was ignored by the spiders.
    --
    "If you shoot for the stars and hit the moon, it's OK. But you've got to shoot for something. A lot of people don't even shoot." - Confucius
    Reply With Quote Reply With Quote  

  5. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    I guess I'll have to "ban" Googlebot in some other way, by showing a 404 page insted of the real content until the site gets finished.
    Reply With Quote Reply With Quote  

  6. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    This should do it:
    Code:
    if(eregi("bot",$_SERVER['HTTP_USER_AGENT'])){
    	header('HTTP/1.1 403 Forbidden');
    	die("This page shouldn't be indexed yet!");
    }
    Reply With Quote Reply With Quote  

  7. pgaming is offline Public Member
    Join Date
    July 2005
    Posts
    2,834
    Thanks
    403
    Thanked 193 Times in 153 Posts

    Should be able to block googlebot via Robot.txt by using the following syntax:

    User-agent: Googlebot
    Disallow: /
    This is intended for Google only.

    greek39
    Reply With Quote Reply With Quote  

  8. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    I know that, but using * as User-agent should block it, too.
    As I said earlier, it shows as blocked in webmaster tools, but still has indexed pages. That's why from now on all bots get a 403 Forbidden message insted of the content.
    Reply With Quote Reply With Quote  

  9. zuhrunezz is offline New Member
    Join Date
    July 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    correct robots.txt

    Quote Originally Posted by vmlinuz View Post
    I have a domain, wettpedia.net which has exactly the same content as our main site sportfogadas.org. It will be translated at some point, that's the reason it's uploaded and visible online.
    I didn't want Google to index it, so I created a robots.txt disallowing everything, right after uploading.

    My question is, how did Google ignore it? In webmaster tools it shows everything is blocked correctly, but even so if I do a site: xxwww.wettpedia.net search it shows up indexed pages. It kinda sucks, as it is duplicate content.

    Now I'm thinking about blocking Googlebot completely, based on it's user-agent string, giving a 404 error for each page it tries to index, but it will probably take some time until it disappears from the index.
    The robots.txt should look like this:
    User-Agent: *
    Allow: /

    Sitemap: http://yourdomainname/sitemap.xml
    Reply With Quote Reply With Quote  

  10. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    Quote Originally Posted by zuhrunezz View Post
    The robots.txt should look like this:
    User-Agent: *
    Allow: /

    Sitemap: http://yourdomainname/sitemap.xml
    Yes, if you want your pages included. I DON'T want that yet.
    Reply With Quote Reply With Quote  

  11. zuhrunezz is offline New Member
    Join Date
    July 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    then tha easiest way is NOT to have robots.txt
    Reply With Quote Reply With Quote  

  12. vmlinuz's Avatar
    vmlinuz is offline Private Member
    Join Date
    March 2009
    Location
    Transylvania
    Posts
    198
    Thanks
    20
    Thanked 30 Times in 24 Posts

    Reply With Quote Reply With Quote  

  13. zuhrunezz is offline New Member
    Join Date
    July 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    then use it like this

    User-agent: *
    Disallow: /path_1/
    Disallow: /example/
    Disallow: /~joe/
    Reply With Quote Reply With Quote  

  14. bbonline's Avatar
    bbonline is offline Public Member
    Join Date
    January 2009
    Posts
    339
    Blog Entries
    5
    Thanks
    23
    Thanked 44 Times in 41 Posts

    Reply With Quote Reply With Quote  

  15. The Following User Says Thank You to bbonline For This Useful Post:

    vmlinuz (9 July 2009)

  16. seoinxs is offline New Member
    Join Date
    August 2009
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re:

    I have seen that Google ignore robots.txt exclusions on more than one occasioned when the target is heavily linked.

    But the code that vmlinuz used I think it is better to avoid ignorance of robots.txt in Google.
    Reply With Quote Reply With Quote  

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts