Web scraping, also known as Web Data Extraction or Web Harvesting, is the process of gathering information from all over the web. Today, data works as oxygen for startups and freelancers seeking to start a business or project in any domain. Suppose you need to get the price of a product on an eCommerce website. It’s easy to find, but now let’s imagine you have to run through this process for thousands of products across many eCommerce websites. Doing it manually is not a bad option at all.
You should solve a set of factors to build a fully-featured web scraper:
- how to extract data (from the website to retrieve requested data)
- how to parse data ( only the information that is asked for)
- how to provision or store parsed data
HTML parsing: Cheerio and JSDOM
The retrieved website content is usually an HTML code of the entire web page. Still, the web scraping process is typically used to obtain specific information from the entire page content, such as product title, price, and image URL.
At the beginning of this article, we’ve applied the standard expression extract the title from the content of example.com. You can use this method to analyze strict-structured data such as telephone numbers, emails, etc. For common cases, it is unnecessarily complicated.
The libraries listed below help build a well-structured, maintainable, and readable codebase without RegExp.
Making requests: HTTP clients
The HTTP client is a tool that can be used to connect with servers using the HTTP protocol. It’s a module or library that can send requests and get responses from the servers, to phrase it another way.
An HTTP client is usually only used to cover data extraction from the website: it allows issuing a request to a web server for receiving HTML content, and a response contains the desired HTML. HTTP clients are frequently used in more complex data extraction tools under the hood.
NodeJS comes with various options: Axios, SuperAgent, Got, and Node Fetch, but we’ll only review the two most popular (based on the Github stars count).