Weapon
During my career I see the battle between website/web app owners and bots/scrapers/crawlers writers. I thought this battle can’t be won. But about 6 months ago I joined it and I think now I have [almost] deadly weapon.
Selenium Webdriver is my choice.
Probably, you heard or used it before. It’s the most popular tool for the functional tests (also known as end-to-end tests), and projects like saucelabs.com can make these tests very easy to implement and run.
But Selenium Webdriver is not only a testing tool - it’s browser automation tool. Modern implementation with Google Chrome (actually Chromium) driver is very powerful - it communicates with Google Chrome via special protocol which is a native thing for this browser. You have access to everything - JavaScript, DOM, even secure cookies! That’s why it’s almost impossible to detect scraper written with Selenium Webdriver and Google Chrome - you just tell browser what to do and it works like there is a real person who is sitting in front of the browser and clicking buttons.
Preparations for the battle
Xvfb
So, you wrote a sequence of steps for scraping some website. Awesome! But what step should be next? Of course you can just run it manually on your computer, but what if you need to create some sort of service or even platform based on it? Yes, it’s possible!
Xvfb is a virtual display server implementing the X11 protocol. Selenium Webdriver needs a display to work and it works nicely with Xvfb. Set of steps you need to do if you want to run all this stuff on your server:
- install Google Chrome application
- install Xvfb
- download Google Chrome driver, add a path to this file to the “webdriver.chrome.driver” system property
- create Xvfb initialization script, example for Ubuntu
- run Xvfb
- set DISPLAY variable like “export DISPLAY=:99”, where 99 is a number of your virtual display, I believe it can be a random number
- now you can run your application! Everything should just work, including screenshots (useful for debugging).
File download
There is one problem that Selenium Webdriver can’t solve. Usually, when you click to a download button you see the OS modal window. Unfortunately, driver can’t handle OS windows. But there is a solution for this problem - create your own file downloader and pass session information to it, like cookies and other headers. Example with Apache HttpClient and Scala:
scala 1object FileDownloader {
2 val defaultUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
3 val timeout = 10 // seconds
4
5 def download(url: String, pathToSave: String, cookies: Set[Cookie], userAgent: Option[String]): Future[String] = Future { blocking {
6 val fileUrl = new URL(url)
7
8 val downloadedFile = new File(pathToSave)
9 if (!downloadedFile.canWrite) downloadedFile.setWritable(true)
10
11 val config = RequestConfig.custom()
12 .setConnectTimeout(timeout * 1000)
13 .setConnectionRequestTimeout(timeout * 1000)
14 .setSocketTimeout(timeout * 1000)
15 .setCookieSpec(CookieSpecs.BROWSER_COMPATIBILITY)
16 .build()
17
18 val client = HttpClientBuilder.create()
19 .setDefaultRequestConfig(config)
20 .setRedirectStrategy(new LaxRedirectStrategy())
21 .setUserAgent(userAgent getOrElse defaultUserAgent)
22 .build()
23
24 val localContext = new BasicHttpContext()
25
26 localContext.setAttribute(HttpClientContext.COOKIE_STORE, mimicCookieState(cookies))
27
28 val request = new HttpGet(fileUrl.toURI)
29
30 val response = client.execute(request, localContext)
31
32 log.info(s"HTTP GET request status: ${response.getStatusLine.getStatusCode}, Downloading file: ${downloadedFile.getName}")
33
34 FileUtils.copyInputStreamToFile(response.getEntity.getContent, downloadedFile)
35 response.getEntity.getContent.close()
36
37 downloadedFile.getCanonicalPath
38 }}
39
40 private def mimicCookieState(seleniumCookieSet: Set[Cookie]): BasicCookieStore = {
41 val mimicWebDriverCookieStore = new BasicCookieStore()
42
43 for (seleniumCookie <- seleniumCookieSet) {
44 val duplicateCookie = new BasicClientCookie(seleniumCookie.getName, seleniumCookie.getValue)
45 duplicateCookie.setDomain(seleniumCookie.getDomain)
46 duplicateCookie.setSecure(seleniumCookie.isSecure)
47 duplicateCookie.setExpiryDate(seleniumCookie.getExpiry)
48 duplicateCookie.setPath(seleniumCookie.getPath)
49 mimicWebDriverCookieStore.addCookie(duplicateCookie)
50 }
51
52 mimicWebDriverCookieStore
53 }
54}
It takes URL to download, path where it should save the file, set of cookies and optional user agent header. Of course you can pass and add more headers if you need.
And it’s very easy to get current cookies:
scala1driver.manage().getCookies.toSet
and user agent:
scala1driver.executeScript("return navigator.userAgent") match {
2 case userAgent: String => Some(userAgent)
3 case _ => None
4}
Defence
Before I said that it’s almost impossible to detect Selenium Webdriver and Google Chrome when they used together. Actually, I see a few ways to protect yourself:
- CAPTCHA. But there are a lot of tools that can help with recognition, so it can’t be really serious protection.
- Create your website/web app with Flash. It’s ugly, nobody uses it except for promo sites, but it should work. I’m sure it’s possible to find a way to interact with Flash as well (with JavaScript calls or using other tools), but it won’t be a native browser way to do it - so, probably, you can detect it.
- Heuristic methods. For example, Google AdWords/AdSense system is able to detect bots by tracking mouse moves, scrolls, timings, etc. I believe it’s very complicated and very expensive technology, but it exists.
Summary
As you can see, Selenium Webdriver is a very powerful tool, not only for testing, but for browser automation in general. If you need an integration with web app that doesn’t have a public API, Selenium Webdriver can be a way to go. But with great power comes great responsibility…