Hide Email Address from Bots - Keep mailto:

后端 未结 10 1876
走了就别回头了
走了就别回头了 2020-12-12 12:32

tl;dr

Hide email address from bots without using scripts and maintain mailto: functionality. Method must also support screen-readers.


S

相关标签:
10条回答
  • 2020-12-12 12:53

    People who write scrapers want to make their scrapers as efficient as possible. Therefore, they won't download styles, scripts, and other external resources. There's no method that I know of to set a mailto link using CSS. In addition, you specifically said you didn't want to set the link using Javascript.

    If you think about what other types of resources there are, there's also external documents (i.e. HTML documents using iframes). Almost no scrapers would bother downloading the contents of iframes. Therefore, you can simply do:

    index.html:

    <iframe src="frame.html" style="height: 1em; width: 100%; border: 0;"></iframe>
    

    frame.html:

    My email is <a href="mailto:me@example.com" target="_top">me@example.com</a>
    

    To human users, the iframe looks just like normal text. Iframes are inline and transparent by default, so we just need set its border and dimensions. You can't make the size of the iframe match its content's size without using Javascript, so the best we can do is giving it predefined dimensions.

    0 讨论(0)
  • 2020-12-12 12:53

    First, I don't think doing anything with CSS will work. All bots (except Google's crawler) simply ignore all styling on websites. Any solution has to work with JS or server-side.

    A server-side solution could be making an <a> that links to a new tab, which simply redirects to the desired mailto:

    That's all my ideas for now. Hope it helps.

    0 讨论(0)
  • 2020-12-12 12:56

    Defeating email bots is a tough one. You may want to check out the Email Address Harvesting countermeasures section on Wikipedia.

    My back-story is that I've written a search bot. It crawled 105,000+ URLs during it's initial run many years ago. From what I've learned from doing that is that web crawling bots literally see EVERYTHING that is text, which appears on a web page. Bots read everything except images.

    Spam can't be easily stopped via code for these reasons:

    1. CSS & JS are irrelevant when using the mailto: tag. Bots specifically look at HTML pages for that "mailto:" keyword. Everything from that colon to the next single quote or double quote (whichever comes first) is seen as an email address. HTML entity email addresses - like the example above - can be quickly translated using a reverse ASCII method/function. Running the JavaScript code snippet above, quickly turns the string which starts with: &#121;&#111;&#117;&#114;... into... "yourname@domain.com". (My search bot threw away hrefs with mailto:email addresses, as I wanted URLs for web pages & not email addresses.)

    2. If a page crashes a bot, the bot author will tune the bot to fix the crash with that page in mind, so that the bot won't crash at that page again in the future. Thus making their bot smarter.

    3. Bot authors can write bots, which generate all known variations of email addresses... without crawling pages & never using any starter email addresses. While it may not be feasible to do that, it's not inconceivable with today's high-core count CPUs (which are hyper-threaded & run at 4+ GHz), plus the availability of using distributed cloud-based computing & even super computers. It's conceivable that someone can now create a bot-farm to spam everyone, without knowing anyone's email address. 20 years ago, that would have been incomprehensible.

    4. Free email providers have had a history of selling their free user accounts to their advertisers. In the past, simply signing up for a free email account automatically guaranteed them a green light to start delivering spam to that email address... without ever using that email address online. I've seen that happen multiple times, with famous company names. (I won't mention any names.)

    5. The mailto: keyword is part of this IETF RFC, where browsers are built to automatically launch the default email clients, from links with that keyword in them. JavaScript has to be used to interrupt that application launching process, when it happens.

    I don't think it's possible to stop 100% of spam while using traditional email servers, without using filters on the email server and possibly using images.

    There is one alternative... You can also build a chat-like email client, which runs internally on a website. It would be like Facebook's chat client. It's "kind of like email", but not really email. It's simply 1-to-1 instant messaging with an archiving feature... that auto-loads upon login. Since it has document attachment + link features, it works kind of like email... but without the spam. As long as you don't build an externally accessible API, then it's a closed system where people can't send spam into it.

    If you're planning to stick with strictly traditional email, then your best bet may be to run something like Apache's SpamAssassin on a company's email server.

    You can also try combining multiple strategies as you've listed above, to make it harder for email harvesters to glean email addresses from your web pages. They won't stop 100% of the spam, 100% of the time... while also allowing 100% of the screen readers to work for blind visitors.

    You've created a really good starting look at what's wrong with traditional email! Kudos to you for that!

    A good screen reader is JAWS from Freedom Scientific. I've used that before to listen to how my webpages are read by blind users. (If you hear a male voice reading both actions [like clicking on a link] & text, try changing 1 voice to female so that 1 voice reads actions & another reads text. That makes it easier to hear how the web page is read for the visually impared.)

    Good luck with your Email Address Harvesting countermeasure endeavours!

    0 讨论(0)
  • 2020-12-12 13:04

    PHP solution

    function printEmail($email){
        $email = '<a href="mailto:'.$email.'">'.$email.'</a>';
        $a = str_split($email);
        return "<script>document.write('".implode("'+'",$a)."');</script>";
    }
    

    Use

    echo printEmail('test@gmail.com');
    

    Result

    <script>document.write('<'+'a'+' '+'h'+'r'+'e'+'f'+'='+'"'+'m'+'a'+'i'+'l'+'t'+'o'+':'+'t'+'e'+'s'+'t'+'@'+'g'+'m'+'a'+'i'+'l'+'.'+'c'+'o'+'m'+'"'+'>'+'t'+'e'+'s'+'t'+'@'+'g'+'m'+'a'+'i'+'l'+'.'+'c'+'o'+'m'+'<'+'/'+'a'+'>');</script>
    

    P.S. Requirement: user must have JavaScript enabled

    0 讨论(0)
  • 2020-12-12 13:06

    Have you considered using google's recaptcha mailhide? https://www.google.com/recaptcha/admin#mailhide

    The idea is that when a user clicks the checkbox (see nocaptcha below), the full e-mail address is displayed.

    While recaptcha is traditionally not only hard for screen readers but also humans as well, with the roleout of google's nocaptcha recaptcha which you can read about here as they relate to accessibility tests. It appears to show promise with regards to screen readers as it renders as a traditional checkbox from their view.

    Example #1 - Not secure but for easy illustration of the idea

    Here is some code as an example without using mailhide but implementing something using recaptcha yourself: https://jsfiddle.net/43fad8pf/36/

    <div class="container">
        <div id="recaptcha"></div>
    </div>
    <div id="email">
        Verify captcha to get e-mail
    </div>
    
    function createRecaptcha() {
        grecaptcha.render("recaptcha", {sitekey: "6LcgSAMTAAAAACc2C7rc6HB9ZmEX4SyB0bbAJvTG", theme: "light", callback: showEmail});
    }
     createRecaptcha();
    
    function showEmail() {
        // ideally you would do server side verification of the captcha and then the server would return the e-mail
      document.getElementById("email").innerHTML = "email@something.com";
    }
    

    Note: In my example I have the e-mail in a javascript function. Ideally you would have the recaptcha validated on the server end, and return the e-mail, otherwise the bot can simply get it in the code.

    Example #2 - Server side validation and returning of e-mail

    If we use an example more like this, we get additional security: https://designracy.com/recaptcha-using-ajax-php-and-jquery/

    function showEmail() {
        /* Check if the captcha is complete */
        if ($("#g-recaptcha-response").val()) {
            $.ajax({
                type: ‘POST’,
                url: "verify.php", // The file we’re making the request to
                dataType: ‘html’,
                async: true,
                data: {
                    captchaResponse: $("#g-recaptcha-response").val() // The generated response from the widget sent as a POST parameter
            },
            success: function (data) {
                alert("everything looks ok. Here is where we would take 'data' which contains the e-mail and put it somewhere in the document");
            },
            error: function (XMLHttpRequest, textStatus, errorThrown) {
                alert("You’re a bot");
            }
        });
    } else {
        alert("Please fill the captcha!");
    }
    });
    

    Where verify.php is:

    $captcha = filter_input(INPUT_POST, ‘captchaResponse’); // get the captchaResponse parameter sent from our ajax
    
    /* Check if captcha is filled */
    if (!$captcha) {
        http_response_code(401); // Return error code if there is no captcha
    }
    $response =     file_get_contents("https://www.google.com/recaptcha/api/siteverify?secret=YOUR-SECRET-KEY-HERE&amp;amp;response=" . $captcha);
    if ($response . success == false) {
    echo ‘SPAM’;
    http_response_code(401); // It’s SPAM! RETURN SOME KIND OF ERROR
    } else {
    // Everything is ok, should output this in json or something better, but this is an example
        echo 'email@something.com'; 
    }
    
    0 讨论(0)
  • 2020-12-12 13:07

    The issue with your request is specifically the "Supporting screen-readers", as by definition screen readers are a "bot" of some sort. If a screen-reader needs to be able to interpret the email address, then a page-crawler would be able to interpret it as well.

    Also, the point of the mailto attribute is to be the standard of how to do email addresses on the web. Asking if there is a second way to do that is sort of asking if there is a second standard.

    Doing it through scripts will still have the same issue as once the page is loaded, the script would have been run and the email address rendered in the DOM (unless you populate the email address on click or something). Either way, screen readers will still have issues with this since it's not already loaded.

    Honestly, just get an email service with a half decent spam filter and specify a default subject line that is easy for you to sort in your inbox.

    <a href="mailto:no-one@no-where.com?subject=Something to filter on">Email me</a>
    

    What you're asking for is if the standard has two ways to do something, one for bots and the other for non-bots. The answer is it doesn't, and you have to just fight the bots as best you can.

    0 讨论(0)
提交回复
热议问题