During my career I see the battle between website/web app owners and bots/scrapers/crawlers writers. I thought this battle can’t be won. But about 6 months ago I joined it and I think now I have [almost] deadly weapon.
Selenium Webdriver is my choice.
Probably, you heard or used it before. It’s the most popular tool for the functional tests (also known as end-to-end tests), and projects like saucelabs.com can make these tests very easy to implement and run.
Preparations for the battle
So, you wrote a sequence of steps for scraping some website. Awesome! But what step should be next? Of course you can just run it manually on your computer, but what if you need to create some sort of service or even platform based on it? Yes, it’s possible!
Xvfb is a virtual display server implementing the X11 protocol. Selenium Webdriver needs a display to work and it works nicely with Xvfb. Set of steps you need to do if you want to run all this stuff on your server:
- install Google Chrome application
- install Xvfb
- download Google Chrome driver, add a path to this file to the “webdriver.chrome.driver” system property
- create Xvfb initialization script, example for Ubuntu
- run Xvfb
- set DISPLAY variable like “export DISPLAY=:99”, where 99 is a number of your virtual display, I believe it can be a random number
- now you can run your application! Everything should just work, including screenshots (useful for debugging).
There is one problem that Selenium Webdriver can’t solve. Usually, when you click to a download button you see the OS modal window. Unfortunately, driver can’t handle OS windows. But there is a solution for this problem – create your own file downloader and pass session information to it, like cookies and other headers. Example with Apache HttpClient and Scala:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
It takes URL to download, path where it should save the file, set of cookies and optional user agent header. Of course you can pass and add more headers if you need.
And it’s very easy to get current cookies:
and user agent:
1 2 3 4
Before I said that it’s almost impossible to detect Selenium Webdriver and Google Chrome when they used together. Actually, I see a few ways to protect yourself:
- CAPTCHA. But there are a lot of tools that can help with recognition, so it can’t be really serious protection.
- Heuristic methods. For example, Google AdWords/AdSense system is able to detect bots by tracking mouse moves, scrolls, timings, etc. I believe it’s very complicated and very expensive technology, but it exists.
As you can see, Selenium Webdriver is a very powerful tool, not only for testing, but for browser automation in general. If you need an integration with web app that doesn’t have a public API, Selenium Webdriver can be a way to go. But with great power comes great responsibility…