ConfigServer Security & Firewall or csf for short is a popular firewall solution for cPanel servers. Combined with some good rules for mod_security, it does a great job.
To prevent csf temporary/permanently blocking the IPs of good bots you should edit the file /etc/csf/csf.rignore
####################### ##########################
# Copyright 2006-2017, Way to the Web Limited
# URL: http://www.configserver.com
# Email: sales@waytotheweb.com
####################### ##########################
# The following is a list of domains and partial domain that lfd process
# tracking will ignore based on reverse and forward DNS lookups. An example of
# its use is to prevent web crawlers from being blocked by lfd, e.g.
# .googlebot.com and .crawl.yahoo.net
#
# You must use either a Fully Qualified Domain Name (FQDN) or a unique ending
# subset of the domain name which must begin with a dot (wildcards are NOT
# otherwise permitted)
#
# For example, the following are all valid entries:
# www.configserver.com
# .configserver.com
# .configserver.co.uk
# .googlebot.com
# .crawl.yahoo.net
# .search.msn.com
#
# The following are NOT valid entries:
# *.configserver.com
# *google.com
# google.com (unless the lookup is EXACTLY google.com with no subdomain
#
# When a candidate IP address is inspected a reverse DNS lookup is performed on
# the IP address. A forward DNS lookup is then performed on the result from the
# reverse DNS lookup. The IP address will only be ignored if:
#
# 1. The results of the final lookup matches the original IP address
# AND
# 2a. The results of the rDNS lookup matches the FQDN
# OR
# 2b. The results of the rDNS lookup matches the partial subset of the domain
#
# Note: If the DNS lookups are too slow or do not return the expected results
# the IP address will be counted towards the blocking trigger as normal
#
Add the following lines to /etc/csf/csf.rignore file:
.googlebot.com
.crawl.yahoo.net
.search.msn.com
.google.com
.yandex.ru
.yandex.net
.yandex.com
.crawl.baidu.com
.crawl.baidu.jp
csf is blocking IPs when a host is blocked for a number of times by a mod_security rule. So, we must go to the root of the problem – we will create mod_security rules to allow good bots.
For this, we will edit the mod_security .conf files. If you are using cPanel EasyApache 4, add the following lines to the file /etc/apache2/conf.d/modsec/ modsec2.user.conf
HostnameLookups On
SecRule REMOTE_HOST "@endsWith .googlebot.com" "allow,log,id:5000001,msg:'googlebot'"
SecRule REMOTE_HOST "@endsWith .google.com" "allow,log,id:5000002,msg:'googlebot'"
SecRule REMOTE_HOST "@endsWith .search.msn.com" "allow,log,id:5000003,msg:'msn bot'"
SecRule REMOTE_HOST "@endsWith .crawl.yahoo.net" "allow,log,id:5000004,msg:'yahoo bot'"
SecRule REMOTE_HOST "@endsWith .yandex.ru" "allow,log,id:5000005,msg:'yandex bot'"
SecRule REMOTE_HOST "@endsWith .yandex.net" "allow,log,id:5000006,msg:'yandex bot'"
SecRule REMOTE_HOST "@endsWith .yandex.com" "allow,log,id:5000007,msg:'yandex bot'"
SecRule REMOTE_HOST "@endsWith .crawl.baidu.com" "allow,log,id:5000008,msg:'baidu bot'"
SecRule REMOTE_HOST "@endsWith .crawl.baidu.jp" "allow,log,id:5000009,msg:'baidu bot'"
After adding these lines, please restart the Apache Web Server. After some time, you will see entries in the server logs. Just go to WHM->Security Center->ModSecurity™ Tools->Hits List or from the command line:
root@web [/]# grep "500000" /usr/local/apache/logs/error_log | tail -30
Resources:
https://webmasters.googleblog.com/2006/09/how-to-verify-googlebot.html
https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml
https://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
https://github.com/SpiderLabs/ModSecurity/wiki/
Hello, thanks for your tutorial, i tried it but facing some problems as i have been seeing 404 response code after doing it, please my email is [removed], i will love for us to have a chat on this and a way forward
Hello! There should be no 404 error codes. I suppose there is another thing causing the 404.
The best way is to ask your host. We can’t know exactly what is happening without server access.
Thanks a lot for this. I was suddenly having issues with google crawling my site and this helped fix it