The degree to which you leave traces of your online activities is referred to as your digital footprint – it’s akin to the evidence you might leave behind after going camping, such as remnants of a campfire, your dinner scraps, and the path you carved in the woods while hiking.
In the case of your digital footprint, the evidence you leave behind is data. This footprint tends to fall into two major categories, depending on whether you’re leaving an “active” digital footprint or a “passive” one.
Web scraping is the name given to the process of extracting structured data from third-party websites. In other words, it’s a way to capture specific information from one or more websites without also copying unwanted or unrelated information. It’s a common practice that has a lot of potential applications and a murky legal profile.
What to know about web scraping
Web scraping is usually an automated process, but it doesn’t have to be; data can be scraped from websites manually, by humans, though that’s slow and inefficient. More commonly, scraping is performed by software designed specifically for this application, generally in two main components. A crawler is a program that browses the internet and indexes the content of interest, and it passes this information onto the scraper.
The scraper is designed to locate the relevant structured information using markers called data locators. These locators indicate the presence of the data, which the scraper then extracts and stores offline in a spreadsheet or database for processing or analysis.
One simple example of web scraping: Consider a website that aggregates pricing information for retail products so shoppers can see which retailers have the best prices. A scraper can be programmed to index the product pages at every major retailer, with the scraper then visiting each page and using data locators to zero in just on the price field and ignore all the other data on the page – product description, reviews, and so on. The scraper can be run daily to update the webpage with the latest pricing information from around the web.
How web scraping is used
Because there is an enormous variety of data online, there is a wide variety of applications for web scraping. Here are some of the most common uses:
Price intelligence: Like the example above, many web scrapers are designed to monitor prices from retail sites. Retailers might use this to monitor prices at competitor sites, or the data might be used for competitive analysis, monitoring trends, or as a service to other users.
Real estate: Similarly, web scrapers commonly target real estate sites to monitor rental and sale prices, appraise property values in a given region, and conduct market analysis.
Lead generation: Marketers commonly use web scraping to generate leads by scraping structured data from websites like LinkedIn.
Sentiment analysis: Brands even use web scraping to understand how their products and services are being talked about online. Companies can collect data that mentions their name from social media sites like Facebook and Twitter.
The legality of web scraping
There’s no easy answer to the question of web scraping’s legality. This technology has had a number of legal challenges dating back to 2000, when online auction site eBay filed an injunction (which was granted by the court) against a site called Bidder’s Edge for scraping its auction data.