Vessel is a powerful web crawling framework for automated web crawling and data mining that is powered by the Ferrum web driver. Both tools are written in pure Ruby and are open source projects .
A search framework, or web crawler, is used to collect information from web pages. For example, to later enter the information into a database or analyze the structure of a site. Similar tools are used by search engines, SEO services, scoring systems, and other programs that rely on data from open sources.
Algorithm of work
Let’s look at how the framework works using an example:
- To install the framework, simply add the gem “vessel” to your Gemfile.
- Register the crawler class.
- Create a software agent spider.rb, where you will need to register the Spider class, derived from Vessel::Cargo.
- Next, set the data collection parameters and parsing callback functions. If you do not specify a method, Vessel::Cargo will raise NotImplementedError by default.
As a result, we get the following code
First, Vessel starts the Ferrum driver, which job seekers database goes through one or more pages specified in start_urls. When the page with all the data is loaded, the analysis begins.
To execute a given query, Vessel parallelizes tasks into multiple threads. One thread is used per core, and you can change the settings and add threads max: n to the class definition if you wish.
To start the crawler, use bundle exec ruby spider.rb.
Advantages of crawling over scraping
Web scraping and web crawling are both powerful and useful tools. However, search frameworks for crawling pages provide much more possibilities. The advantages of Vessel:
- The framework allows collecting data analyzes the backlink profile of a domain not from a single page, but from the entire site at once. You can also specify several web pages at once in start_urls.
- You have complete control over what data Vessel collects throughout the process.
- You receive the information immediately in a convenient format, such as CSV or JSON, which greatly simplifies its further use.
By integrating Vessel into your project, you belgium numbers can create your own little Google: quickly and easily collect, extract, organize and index data. The options for further use are endless.
If you want to use the web crawler capabilities in your project, write to us, Evrone developers will show how fast and functional this solution is.