As I am just an English Major with above-average IT skills and not an IT professional, it's getting a bit too technical for me.
And I'm just an IT major with above-average English skills... so we're even!
But if I understand you correctly, the error_logs you mention would be an artifact of bots crawling (running?) too quickly?
They are hitting the database in my forum too frequently (by requesting pages too often), causing errors in the DB, which outputs a message to the browser (bot, in this case), which causes PHP to complain that headers have already been sent. So, I happen to see the bots' activities as causing error_log entries (files). You might see their effects in some different way, such as LP telling you that your CPU usage is too high. If you run an ecommerce site, it might be manifested in a different way than a forum or blog does.
With that in mind, what do you think of this more generalized approach: (1) Direct all bots to wait a certain number of days between crawls. (2) Assume any that disregard the directive (excepting Googlebot) are poorly configured or malicious, and (3) ban those in robots.txt.
That might work. There's no harm in giving it a try. Note that a number of bots I've banned are "business intelligence" services, providing data on sites to sell for competitive analysis, rather than search engine indexing (like Google). You might want to check what the business is
of a given bot's home site, before banning it ("Disallow") in robots.txt. You probably don't want to block search engine bots unless they're simply causing you too much pain. Business Intelligence / Competitive Analysis bots aren't doing you
much good, so no harm in hamstringing them.
It occurs to me that some bots may be willing to observe the robots.txt file but cannot access it if banned in .htaccess.
Absolutely. If you feel you need to ban (block) in .htaccess, you still make an exception for /robots.txt, so the bot can read it (if it cares to) and see you don't like it and don't want to be Best Friends Forever.
I also noted your response to another user regarding LP's allowing hackers to access files directly and bypass the .htaccess file altogether simply by specifying a direct path. I see this happening in my logs, where "file not found" errors are reported on ips already banned in the root .htaccess. It appears that even fake paths would not spare the resources wasted by such attacks.
My understanding of .htaccess processing is that the server is supposed
to start by always processing the root (/) .htaccess, and then the next one down the chain of directories, etc. until the last one in the target directory (where the .php file is running). However, I have seen behavior on my server (/.htaccess apparently being skipped) that suggests that for some reason the server is configured to jump directly
to the target directory's .htaccess, skipping anything earlier. I can't find any documentation on whether this is permitted or encouraged, but it certainly breaks a lot of sites. If someone is giving a real path (but fake or missing file), that could well explain why your /.htaccess IP blocks are being bypassed.
It is perplexing to me that LP takes a "blame the victim" approach and doesn't provide adequate support in this area, considering most if not all of their users will be affected at some point. Do they have so many new customers that they don't have to worry about losing the old ones?
Same here. All I can figure is that most of their support staff is in India now, just reading off of canned scripts and not having the ability to adequately support their users. They certainly don't have enough new customers (who require much more hand-holding) to let them drive away their old ones!