Web Scraping

jaskaranchaniana
Apr 5, 2020
13 min read

Authors: Jaskaran Chaniana, Devesh Bhatia, Anjali Patel

What is Web Scraping?

Web scraping is a data harvesting technique used to extract large amounts of data from a website. This data can consist of images, text, descriptions, reviews, prices, or any other desired content that the web page in question may contain. Web scraping can also be referred to as Screen Scraping, Web Data Extraction, Web Harvesting, etc. The process of web scraping begins by sending a request to a web page to load the page and then extracting data depending HTML or XML tags. After extracting data, the information is collected and saved into a local database on the computer or exported into a file (i.e. Spreadsheet or API) so the content is collected to be retrieved later for something such as analysis.

What is Web Scraping used for?

The main purpose of web scraping is to minimize the time it would take to collect this information through the use of basic copy and paste functions. Users can collect information from websites for several different applications, some of which include:

Contact scraping: Gathering names, phone numbers, email addresses, or company URLs to be used for marketing research or personal use.

Price comparison: Comparing the prices of products existing in retail stores to find the best deals for oneself.

Scraping stocks: Keeping track of the stock market by choosing a specific set of stocks and extracting needed information from each stock.

Sports Stats and Data: Scraping sports stats for betting or fantasy leagues - sorting players by certain stats to find hidden gems.

Weather data monitoring: Collecting weather data to analyze weather patterns which can also be used to predict upcoming weather changes.

Research and Development: Gathering data from multiple sources on the Internet for academic, marketing, scientific research. Companies such as Google use web scraping to build their search databases worth hundreds of billions of dollars to grow their business.

The possibilities of web scraping uses are ceaseless. While some companies may use this data for business profit, others may focus on constructing strategic techniques to combat their competitors. It is up to an individual or organization using the web scraper to determine how they can utilize this data for causes that are valuable to them.

Is Web Scraping legal?

Web scraping is legal and ubiquitous, albeit falls in a grey area. Users and businesses may use scraping for their personal or professional uses. However, some actions taking place after web scraping is performed can still be illegal.

A web crawler may scrape websites for content such as videos or images for example, but are not allowed to re-post the media claiming it as their own. Copyright for data prevents users from gaining control over the content, but does not control users from scraping it.

Web crawlers do not have the freedom to scrape data from sites requiring authentication. These websites require users to agree to its ‘Terms and Conditions’, most of which state that the automated collection of data is forbidden. Facebook, for example, requires user authentication and because of this, web crawlers are not permitted to download user data behind the login gate. Publicly available websites such as YouTube do not require users to agree to any Terms of Service, allowing web crawlers to scrape data.

Furthermore, some web scraping applications post the threat of sending large amounts of unwanted traffic to the web server associated with the web pages in question. Copious amounts of requests can indirectly cause issues for the underlying system if the system is unable to handle the traffic. This can effectively lead to unintentional denial-of-service. Even if the system is still able to function, the large amount of incoming traffic can raise alarms that prompt the web server administrators to issue an investigation.

It is not uncommon for web applications to prompt visitors to agree to their Terms and Conditions before proceeding to the application. Many Terms and Conditions for use of a web application contain a clause which legally protects the site against web scraping. If a user agrees to the terms that state that the user cannot legally extract data from the site via a web scraper, then there are legal threats the user may face. This goes both ways in the sense that if a user never agrees to the Terms and Conditions, they are not legally bound by said clause.

There are also concerns of legality when considering how the extracted or scraped data is being used. For example, data pertaining to specific users of an application cannot be shared without their consent. Individuals or organizations that harvest user data, such as email addresses, to sell can face serious charges in court. To expand, this goes as far as protecting users against their scraped personal information being used in order to directly send advertisements and spam.

Web Scraping Mitigation

Like anything else in the cybersecurity world, one cannot completely mitigate web scraping from occurring. However, there are some measures which can be implemented to keep a web application relatively safe from scraping.

Monitor Website

Rate Limiting

Rate limiting is a strategy which involves limiting the amount of requests to be processed by the website. Typically, when web scrapers are deployed on a target website, they scrape it slowly to avoid monopolizing the bandwidth of the website. Implementing rate limiting can help slow down web scraping tools effectively.

In addition to blocking IP Addresses, it is also important to consider other factors when you’re rate limiting. Other indicators can include but are not restricted to:

HTTP headers
JavaScript related info to identify malicious users:
- screen size/resolution
- timezone
- fonts
Areas of the screen being clicked
How fast forms are being filled on a website

To elaborate, if an application starts getting multiple requests from a single IP Address, we can further analyze the IP using the indicators above to identify it as a scraper. For example, using JavaScript, we can determine if the screen size is the same for all the requests. Also, one can identify similar requests coming from different IP addresses which is a technique known as distributed scraping.

Detect Unusual Traffic

It is important to monitor the web application for unusual activities such as too many requests from the same source IP address, unusual searches, etc. Have access lists and detection systems in place for when the website detects any sort of malicious requests.

Require Registration & Login

Another method that can be used to help reduce the severity of a web scraper is the mandatory requirement of registration to interact with key features. Once account registration is required, it allows an administrator to easily detect and monitor possible scrapers. Implementing this can help with rate limiting as well since it provides better management of clients by replacing IP addresses with usernames. However, an adversary can still abuse this by simply writing a script that can create multiple accounts. In order to mitigate this problem, ensure that a valid e-mail address is mandatory and have a process in place for email verification so that tools such as 10-minute mail are considered irrelevant.

Prevent Access From Cloud Hosting and VPN Services

On occasion, an adversary may choose to run their scraper with the use of cloud hosting services, VPS, VPN/Proxy Servers or other similar web hosting services. It can be helpful to limit the amount of requests coming in from such services to help prevent web scraping. In correlation to Cloud hosting services, one can limit the IP addresses which seem to be originating from VPN service providers. An attacker can choose to use a proxy server IP to mask their requests form being picked up. Lastly, it is very important for an administrator to understand the possible consequences of implementing this safeguard. Using this method can also block access for legitimate users since the use of VPNs and proxy servers is very common.

Implement CAPTCHA

A great way to mitigate web scraper is to use CAPTCHA. This is an effective approach to distinguish between scrapers and real users as it eliminates the need to block access. Even though this is a great approach, there are some things that must be kept in mind. To start off, a key step that must be taken by all administrators is to disclude the solution for the CAPTCHA from the HTML markup. In correlation to this, it is important to use a decorated and working CAPTCHA service such as reCaptcha by Google. This makes it easier to implement into your website since it requires no additional work and it doesn’t store the CAPTCHA solutions anywhere in the HTML markup.

Replace Text Content With Images

Lastly, an effective approach for preventing web scraping is to render important text on your website such as contact information (emails, phone numbers, location, etc;). This prevents scrapers looking for text from gathering any information that they can use to their advantage. Although, it is important for organisations to understand that this approach comes with some downsides. Such as: less compatibility with search engines, degraded performance, and legality issues in some countries/states/provinces.

How Web Scraping Works

Web scraping is the process of extracting data from websites using a scraping application of some sort. Oftentimes, scraping applications allow you to target and extract specific data from a website. The web scraper retrieves the data from a user-specified URL in HTML format in order to parse and isolate specific data for the user. Many web scrapers that are readily available tend to also offer a functionality that allows for saving extracted data to a file or database so that it may be used later.

As will be later seen in the Proof-of-Concept section, the way that many web scrapers extract specific data is by searching for the HTML tag associated with a specific kind of data and extracting the entire line. Then, depending on the scraping application being used, it will either store the contents of what is within the HTML tags or the entire line. To put this into perspective, an HTML tag such as <a href=...> often contains hyperlinks on a webpage to navigate to other parts of the site or redirect to other sites. Depending on the scraping application, the user can search for all occurrences of this tag in order to see the extent of reach of a particular web application. Another application of web scraping would be to scrape an entire blog for users and their email addresses. Imagine having to manually copy and paste the username and associated emails of each user on a blog post. This could take hours, if not more!

Web Scraping Tools

There are hundreds, if not more, of web scraping applications and scripts that are readily available. Most often, the web scrapers charge a monetary value to be used whether this is a one-time purchase, monthly subscription, or content-based pricing. A key limitation to consider regarding generic web scraping software is that they can be difficult to set up or use due to the steep learning curve associated with each application. It is also important to consider that when switching between web scraper tools, it is not likely that they will function in the same way.

ScrapeHero

ScrapeHero is a web scraping service that is often used for large-scale data collection purposes that is widely known for its processing power and efficiency. It is commonly seen being used for research data collection, real estate and housing data, and stock market and financial data. ScrapeHero’s large infrastructure is able to support the scraping of approximately 3000 pages per second. However, this poses the threat of putting a heavy load of traffic on the target website’s servers and networks. Depending on the purpose that this application is used for, it can lead to trouble for smaller businesses if they are not able to handle the amount of incoming traffic, and thus, can effectively land the user in trouble due to the noize being created.

Mozenda

Mozenda is a cloud-based web scraping service consisting of a web console and Agent Builder, a Windows application for creating customized projects, allowing the user to run their own agents. The extracted data may be exported or directly published to a cloud storage provider. The data extraction takes place through servers in Mozenda’s data centers. With the use of Geolocation, Mozenda allows users to protect their IP addresses from being banned. Mozenda also supports documentation extraction, image extraction, multi-threaded extraction, and smart data aggregation. Due to its many available features, Mozenda is an expensive scraping tool that charges from $99 per 5000 pages, and requires a Windows PC to actively run.

Import.Io

Import.Io is a web scraping service that can be used by most operating systems. With its user-friendly interface, it serves as an easy-to-use platform consisting of a clean interface, simple dashboard, and screen capture in its design. To use this, a user simply has to click and extract data on a website and it will then be stored on a cloud service for multiple days. This service costs about $299 per month for 5000 URL queries or $4,999 per year for 500,000 URL queries.

Octoparse

Octoparse is a free web scraping service that offers scraping service for unlimited pages. It provides tools such as Regex and XPath to allow precise extraction of data. Octoparse also offers subscription plans for small teams up to enterprise level consumers. Since websites are written by people, it is likely that the coding contains mistakes. This causes normal extraction to miss irregular data during the process. Due to XPath, 80% of missing data problems can be resolved. Octoparse offers built-in templates (Amazon, Yelp, TripAdvisor) for beginners, off-the-shelf guidelines and YouTube tutorials, and free unlimited crawls. It includes various features that are smooth and easy to operate.

Custom Web Scrapers

Manually creating a web scraper can offer benefits such as customizability and scalability. Furthermore, another benefit to coding a web scraper is the avoidance of developer dependence. In other words, if the developers of a tool stop supporting it, then one would likely end up sifting through many other web scrapers for a suitable one. This can prove to be an annoyance especially when considering that the scraper’s functionality will be limited to the developer’s ideas. Furthermore, it does not cost anything for a custom-built web scraping application!

The following script (see Proof-of-Concept) was created on Linux by utilizing Python3, BeautifulSoup4 (BS4), LXML, and the Python requests library. Considering Python’s large number of web scraping frameworks that are excellent for parsing and processing data, it is a high-performance and easy-to-use option when building a custom web scraping application. Furthermore, Python has an extensive collection of libraries that can be leveraged when building a scraper depending on the functionality that is needed. Despite this, Python is relatively easy to read and write with simple syntax and powerful functions.

BS4 is particularly interesting when building a scraper due to its wide range of compatibility with various parsers, as well as its support for HTML and XML. BS4 can use LXML as a parser, known for its speed, that provides ease-of-use when wanting to parse HTML and XML. Furthermore, in order to send requests to the target website, Requests provides the functionality to generate HTML requests in a user-friendly fashion. To install these libraries, the package-management system, pip was utilized since it is simple to use on the command line.

Prerequisites

Proof-of-Concept

As can be seen above, the code to create a basic Python web scraper is easy for the average user to comprehend. The aforementioned libraries add an advantage by having a low learning curve for first-time scrapers. The code functions by prompting the user for the URL that is to be scraped. Then, it adds the HTTPS:// protocol to craft the full URL required for generating a request and saves that to a variable. Next, the requests is used to generate the aforementioned request using the HTTP GET header and saves the output of the entire page to a variable. The print() function is used in conjunction with the BS4 library to select a specific tag to output in the command line. In this case, the title HTML tag was extracted from the page’s output. To extract the hyperlinks, a for loop was leveraged to have the script iterate through and output all of the hyperlinks included in the page output of https://google.ca.

Extracted Output

As can be seen below, the custom-built script from above was tested on the Google landing page with the goal of extracting specific data. Both screenshots are from the same iteration of the script on https://google.ca, but were separated for readability.

Custom Script Output 1/2

Custom Script Output 2/2

The custom script has minimal limitations and does not propose a heavy load of traffic on the target web page. However, upon further development, the script can be leveraged to recursively search a target web page if necessary. The <title> and <a> tags were used for demonstration purposes, but can always be replaced with the tags of particular data that the user requires. Despite the hard coded tags, the script can also be manipulated to ask for user input on which HTML tags and data they require in the output. The functionality is virtually limitless.

Protection Techniques

It is a widely known fact that web scraping is located in the grey area of the legality spectrum and as such, a responsible user should take precautions to protect themselves when scraping. When sending an HTML request to a target web application, a few things can be done to protect a user.

User-Agent Strings

In an HTML request, there is a user-agent string which indicates to the web application the origin browser of the request. The user-agent string indicates which version of which browser is being used along with the operating system. A custom web scraping application can be configured to randomly pick among a list of pre-coded real user-agent strings to craft the request so that the user’s own user-agent string is not exposed.

The following is an example of the Googlebot user-agent string that can be set:

“User-agent: Googlebot”

HTTP Referrer

Additionally, some web applications will block requests that do not contain the correct HTTP referrer(s). Referrers are the header fields in HTTP requests that identify the webpage linked to the resource being requested. This effectively indicates the origin of a request to the server of the web page in question.

For example, the following could be used to set a referrer:

“Referrer”: “https://www.google.com/”

CAPTCHA Solving Service

One of the most widely known and effective ways for web applications to prevent access from scrapers and crawlers is to prompt for a CAPTCHA. CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart” and is an effective means to stop automated data extraction and access from bots.

For clarity, the image below indicates an example CAPTCHA prompt:

These tests can be addressed by using a CAPTCHA solving service such as CAPTCHA-solver and python-antiCAPTCHA. Both of these Python libraries are readily available for installation with the use of pip. Functionality and methods vary depending on the CAPTCHA solver that is used, however, with the abundance of solvers available for use, experimenting will help find the best fit for the scenario. Tutorials for setup are available on Python Software Foundation’s website, https://pypi.org.

References

“Best Data Scraping Tools for 2020 (Top 10 Reviews),” Octoparse. [Online]. Available: https://www.octoparse.com/blog/best-data-scraping-tools-for-2019-top-10-reviews. [Accessed: 06-Apr-2020].

“Making web data extraction easy and accessible for everyone,” Web Scraper. [Online]. Available: https://webscraper.io/. [Accessed: 06-Apr-2020].

“6 Web Scraping Tools for Extracting Data,” Website: https://codecondo.com/. [Online]. Available: https://codecondo.com/web-scraping-tools-extracting-data/. [Accessed: 06-Apr-2020].

“What is Web Scraping and How Does Web Crawling Work?,” Scrapinghub. [Online]. Available: https://scrapinghub.com/what-is-web-scraping/. [Accessed: 06-Apr-2020].

“Pip (package manager),” Wikipedia, 02-Apr-2020. [Online]. Available: https://en.wikipedia.org/wiki/Pip_(package_manager). [Accessed: 06-Apr-2020].

“XML and HTML with Python,” lxml. [Online]. Available: https://lxml.de/. [Accessed: 06-Apr-2020].

Lxml, “lxml/lxml,” GitHub, 21-Mar-2020. [Online]. Available: https://github.com/lxml/lxml/. [Accessed: 06-Apr-2020].

“Scalable do-it-yourself scraping - How to build and run scrapers on a large scale,” ScrapeHero, 08-Jul-2019. [Online]. Available: https://www.scrapehero.com/scalable-do-it-yourself-scraping-how-to-build-and-run-scrapers-on-a-large-scale/. [Accessed: 06-Apr-2020].

D. Ni, “5 Tips For Web Scraping Without Getting Blocked or Blacklisted,” Scraper Api, 16-Jan-2020. [Online]. Available: https://www.scraperapi.com/blog/5-tips-for-web-scraping. [Accessed: 06-Apr-2020].

“What are the methods used against web scraping?,” Scraping, 19-Feb-2020. [Online]. Available: https://www.scraping-bot.io/anti-scraping-methods/. [Accessed: 06-Apr-2020].

JonasCz, “JonasCz/How-To-Prevent-Scraping,” GitHub, 29-Sep-2019. [Online]. Available: https://github.com/JonasCz/How-To-Prevent-Scraping. [Accessed: 06-Apr-2020].

SysNucleus, “WebHarvy Web Scraper,” Web Scraping Explained. [Online]. Available: https://www.webharvy.com/articles/what-is-web-scraping.html. [Accessed: 06-Apr-2020].

M. Perez, “What is Web Scraping and What is it Used For?: ParseHub,” ParseHub Blog, 02-Oct-2019. [Online]. Available: https://www.parsehub.com/blog/what-is-web-scraping/. [Accessed: 06-Apr-2020].

“The Impact Of Web Scraping,” Radware Bot Manager, 07-Nov-2019. [Online]. Available: https://www.shieldsquare.com/the-impact-of-web-scraping/. [Accessed: 06-Apr-2020].

Benoit Bernard, “Web Scraping and Crawling Are Perfectly Legal, Right?,” Benoit Bernard, 24-Apr-2017. [Online]. Available: https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/. [Accessed: 06-Apr-2020].

A. Joy, “Web Scraping using Python [Step by Step Tutorial],” Pythonista Planet, 07-Mar-2020. [Online]. Available: https://pythonistaplanet.com/web-scraping-using-python/?fbclid=IwAR05TnObyH1O010VQJnA_U-ZZr7vj3B8JEf-as3_Q4LnsLFFNcOefTl_SmA. [Accessed: 06-Apr-2020].