Since my first meeting with Scala I always think that this language was designed to work in a concurrent environment. I wasn’t familiar with actors for a long time, but a few months ago I got a task to write a website scraper. It’s a typical story, there are a lot of nice solutions, but I felt it’s a right time to try Akka in action. This article is the result of my work. I don’t want to describe the basic things, so you should be familiar with the main Akka concepts: actors and messages.
Here is the GitHub repo: https://github.com/sap1ens/scraper.
The Task
So, my task was to find some ads on ***** website, extract data from these ads and save all results to a XLS file. Also, I had a list of cities in US and Canada (472 totally). This work can be done in a few steps:
- Generate a list of URLs for all cities.
- Every page with city results can contain a pagination, so we should fetch all pages.
- Then we should extract URLs of the ads and parse the data.
- Once it’s ready (all ads were fetched and parsed), we need to combine results (for grouping, sorting, etc.) and save to the file.
Start with Activator
Typesafe Activator is a beautiful tool to start Typesafe stack projects. It’s very easy to install.
So, I’ve started project with Activator (just chose Scala + Akka template) and I’ve got a ready-to-go application with Sbt, Scala, Akka and ability to run the application with another beautiful tool – Typesafe Console.
Keep your stuff in Config
Also, I put all my configuration stuff into the Typesafe Config file (I’ve chose JSON format). There are a list of cities/countries, search query and some settings related to saving.
ActorSystem
Scraper.scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Here I’ve took some config data, created ActorSystem and root actor named CollectorService. The app starts with sending StartScraper message to the root actor.
Actors
CollectorService.scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
CollectorService is the root actor, it coordinates all work.
It contains variable (or value, to be correct) named lists, which holds a pointer to the next level of hierarchy – ListParser actors. Constructor of the ListParser accepts one element, CollectorService. Also, it uses routing to balance requests between 5 actors, based on the actor mailbox capacity (SmallestMailboxRouter).
1
|
|
CollectorService also contains two variables: pageResults and listUrls. First one is the storage for final ads results, second one holds URLs to be fetched.
CollectorService starts with StartScraper message, it sends all generated URLs to itself in AddListUrl message. Probably you think that sending messages to itself is a strange practice, but I don’t agree :) It’s a good pattern to decouple and reuse logic and you’ll see it.
The next step is to start fetching these URLs. CollectorService delegates this job to the ListParser actor and you can see here the second pattern: every actor should have its own task.
So, sending AddListUrl message adds URL to the listUrls variable as well as sends the StartListParser message to the ListParser actor.
RemoveListUrl message removes specified URL from the listUrls and if it’s empty, we think that job is done and it should persist results (so it sends SaveResults to itself).
PagesResults message adds an extracted page data to the pageResults variable. We’ll use it later, during saving process.
ListParser.scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
As you can see, the ListParser also contains few variables: pages, pageUrls and pageResults. pages is similar to lists from CollectorService: it holds next level actors – PageParser and the same router, SmallestMailboxRouter.
pageUrls and pageResults help to keep intermediate results. They are pretty similar to the listUrls and pageResults from CollectorService, except they are maps, where key is listUrl.
ListParser starts with the StartListParser message. It sends URL to parseAdvertisementList method (which I want to skip, you can find it in the GitHub repo though). As a result, it receives a Future with the ListResult case class. This class contains a list of page-level URLs and optionally an URL to the next page of this city results.
1
|
|
This line sends the result of the Future to actor itself. There are two possible ways after.
- If there is a next page in the message, it goes to this case:
case ListResult(listUrl, urls, Some(nextPage))
It sends the AddListUrl message to the CollectorService, as well as the ListResult message without next page to actor itself. - If there is no next page (or it’s a message from 1), it goes to another case
case ListResult(listUrl, urls, None)
Second ListResult case checks a list of URLs. If it’s an empty, RemoveListUrl will be sent to the CollectorService. If it’s not empty, actor sends the AddPageUrl message for every URL to itself.
In AddPageUrl case, actor saves specified URL to the pageUrls and sends the StartPageParser message to the PageParser, next actor in hierarchy.
In RemovePageUrl case it removes specified url from the pageUrls and if it’s empty, it sends the PagesResults as well as the RemoveListUrl to the CollectorService.
Also ListParser contains SavePageResult case, which just saves sent data to the pageResults.
PageParser.scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
PageParser actor is pretty straightforward. It uses parseAdvertisement method to get a Future with extracted data and then sends SavePageResult and RemovePageUrl messages to the parent actor (ListParser).
Important things
- Error handling is a very important thing. That’s why every Future has onFailure block, which sends clean-up messages to parent actors.
- You can find ListParser and PageParser similar: they both have 2 same types of inner variables, same actions (add item to the process queue, remove item from the processing queue, save results). It means we can extend actors hierarchy multiple times, but it’s a good practice to have different actors for every level of hierarchy, because we can set up different supervisors. So, it’s worth thinking how to reuse this behaviour.
Summary
I like the results: it takes about 4 minutes on my MBP to find, fetch, parse and save about 10k ads.
Bonus: Immutable data structures
May be you didn’t notice, but all inner variables inside the actors are immutable. It’s not a requirement because actor can process only one message from mailbox at the one period of time, so it won’t mess with any mutable data. I used immutable data structures just as an exercise and also it’s a good culture in Scala. That’s why we have these implicits to work with List and Map.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Further Reading
I can recommend perfect book Akka Concurrency by Derek Wyatt to continue learning about Akka.