Web scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.^[1] Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

For broader coverage of this topic, see Data scraping.

Scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, extraction can take place. The content of a page may be parsed, searched and reformatted, and its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be finding and copying names and telephone numbers, companies and their URLs, or e-mail addresses to a list (contact scraping).

As well as contact scraping, web scraping is used as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup, and web data integration.

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. As a result, specialized tools and software have been developed to facilitate the scraping of web pages.Web scraping applications include market research, price comparison, content monitoring, and more. Businesses rely on web scraping services to efficiently gather and utilize this data.

Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport mechanism between the client and the web server.

There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing

After the birth of the in 1989, the first web robot,^[2] World Wide Web Wanderer, was created in June 1993, which was intended only to measure the size of the web.

World Wide Web

In December 1993, the first crawler-based web search engine, , was launched. As there were fewer websites available on the web, search engines at that time used to rely on human administrators to collect and format links. In comparison, JumpStation was the first WWW search engine to rely on a web robot.

JumpStation

In 2000, the first Web API and API crawler were created. An (Application Programming Interface) is an interface that makes it much easier to develop a program by providing the building blocks. In 2000, Salesforce and eBay launched their own API, with which programmers could access and download some of the data available to the public. Since then, many websites offer web APIs for people to access their public database.

API

Blocking an either manually or based on criteria such as geolocation and DNSRBL. This will also block all browsing from that address.

IP address

Disabling any API that the website's system might expose.

web service

Bots sometimes declare who they are (using strings) and can be blocked on that basis using robots.txt; 'googlebot' is an example. Other bots make no distinction between themselves and a human using a browser.

user agent

Bots can be blocked by monitoring excess traffic

Bots can sometimes be blocked with tools to verify that it is a real person accessing the site, like a . Bots are sometimes coded to explicitly break specific CAPTCHA patterns or may employ third-party services that utilize human labor to read and respond in real-time to CAPTCHA challenges. They can be triggered because the bot is: 1) making too many requests in a short time, 2) using low-quality proxies, or 3) not covering the web scraper’s fingerprint properly.^[31]

CAPTCHA

Commercial anti-bot services: Companies offer anti-bot and anti-scraping services for websites. A few web have limited bot detection capabilities as well. However, many such solutions are not very effective.^[32]

application firewalls

Locating bots with a or other method to identify the IP addresses of automated crawlers.

honeypot

using CSS sprites to display such data as telephone numbers or email addresses, at the cost of accessibility to screen reader users.

Obfuscation

Because bots rely on consistency in the front-end code of a target website, adding small variations to the HTML/CSS surrounding important data and navigation elements would require more human involvement in the initial set up of a bot and if done effectively may render the target website too difficult to scrape due to the diminished ability to automate the scraping process.

Websites can declare if crawling is allowed or not in the file and allow partial access, limit the crawl rate, specify the optimal time to crawl and more.

robots.txt

The administrator of a website can use various measures to stop or slow a bot. Some techniques include: