How to block semalt.com referrer traffic using .htaccess
We saw this referral on our stats a few months back, and I was naive enough to signup to the service to check it out... it was total rubbish, and deleted my account immediately.
Clicky recently tweeted about banning them from the stats https://twitter.com/clicky/status/445704464750501890
Would someone mind explaining what semalt actually does?
As usual I cannot penetrate the marketing-ese explanation provided by their homepage.
The normal way to exclude a crawler is to use a robots.txt file: http://www.robotstxt.org/robotstxt.html
Any ethical crawler will respect exclusions defined in that file. The article doesn't mention trying this method before jumping to mod_rewrite, though.
So -- does Semalt's crawler look for and follow robots.txt? As long as it does, they're doing what they should be doing to let site owners opt out of crawling.
EDIT: Found a page on their web site where you can enter a domain to opt it out of crawling: http://semalt.com/project_crawler.php
No instructions for how to opt out via robots.txt, though. That's a big omission. Anyone who's going to do mass crawling needs to support robots.txt.
Logs are going to be full of junk from bots. Unless it's actually breaking something, IMHO it's better to just accept that and move on.
(If you're using Apache and do want to start fighting bots, I'd suggest taking a look at mod_security. Very powerful, but beware that the default rules can be touchy)
I wonder why we still don't have a simple standard way of dealing with abusive http-traffic, like RBL's for mail.
Way to often have I seen systems getting overrun with rogue traffic, usually spam-bots and vulnerability scanners, and lots of small sites get into serious trouble because of this.
I have been plagued by semalt for the past 2 months. This is working so far:
SetEnvIfNoCase Referer crawler.semalt.com spammer=yes SetEnvIfNoCase Referer semalt.com spammer=yes
Order allow,deny Allow from all Deny from env=spammer
I've noticed semalt.com in my Piwik analytics a few times now. Going to try this rule out in my web root and see if they ever show up again.
I've been getting a bunch of hits from this and couldn't figure out why! Glad someone did some more research on it.
Anybody know how to do this in nginx?
interesting new way of spamming