Exploring Web Mining

Web Scrapers

A web scraper, as described by Bonifacio, Barchyn, Hugenholtz, and Kienzle (2015) accesses the target website or page through its Hypertext Transfer Protocol (HTTP) and then searches it for the desired content based on the parameters provided by the user or crawler. The scraper then downloads and parses the results transforming it into a structure suitable for analysis, also according to pre-established parameters. Since a user is interested only in useful content, irrelevant content needs to be filtered out. Bhardwaj and Mangat (2014) refer to irrelevant content as “noisy” data and includes advertisements, copyright and privacy statements, logos, navigational panels along with headers, footers, and search boxes while Matsudaira (2014) refers to it as the page chrome.

While filtering irrelevant data is a challenge, the main initial challenge, according to Kokkoras, Ntonas, and Bassiliades (2013) was the unstructured (or at best semi-structured) nature of web content and for the content to be usable in an application or inserted into a database, it needed to be transformed into a structured dataset. Early scrapers assumed a linear relationship between HTML elements and were designed to detect HTML tags. As the web evolved, browser makers increasingly conformed to the World Wide Web Consortium (W3C) recommendations to render web pages according to the Document Object Model (DOM). Today, well-structured HTML pages are quite common and are better represented by tree architectures rather than linear. Common extractable structured formats include comma separated values (CSV), tab separated values (TSV), extensible markup language (XML), and JavaScript Object Notation (JSON), the latter being the newest standard structured format used in data analytics. Kokkoras et al. further explained that web content creators increasingly use tools that employ templates and databases (e.g. WordPress) making much of the content readily extractable in one of the aforementioned formats.

Web Scraper Components

For the most part, research on web scraper components is intertwined in the literature on web crawler components as previously discussed and most are intended for large-scale enterprise crawling tasks such as that used by search engines. Three notable exceptions may provide a foundation for those interested in the extraction and parsing requirements needed to get started scraping web pages. The first is Ducky, described by Kanoaka, Fujii, and Toyama (2014) that uses cascading style sheet (CSS) selectors rather than HTML tags in its extraction rules and relies largely on CSS identifiers and class names; the second is DEiXTo, described by Kokkoras et al. (2013) which uses a browser emulator that matches structural patterns defined by the user; and the third is OXPath, described by Grasso, Furche, and Schallhart (2013), which is an extension of the XPath or the XML Path language used for selecting nodes from an XML document. Ducky and DEiXTo are frameworks that consist of a graphical user interface (GUI) in which the user can enter extraction rules and other parameters with minimal knowledge of the underlying language yet are customizable depending on the user’s level of programming skill. Inasmuch as all three are supervised learning approaches, the performance of each depends entirely on how well the user can identify the best structure of the site he or she is interested in scraping and articulating that structure to the scraper.

Web Scraping Strategies

Modern web scraping strategies rely heavily on a properly implemented DOM to extract relevant content and wrap it in accordance with user defined rules. Whether the wrappers chosen and configured in the extraction rules should be HTML delimiters, CSS selectors, or something else depends on the structure of the site being scraped and the format of the desired output while understanding that not all sites conform to DOM best practices. Therefore, defining extraction rules for a given scraping task may initially be largely trial and error and may be limited to a specific website (Kokkoras et al., 2013). The most important part of parsing extracted data is the ability of the parser to handle messy markup, according to Matsudaira (2014). The author further pointed out that the parser also needs to be able to handle plain ASCII as well as Unicode text formats. The author concluded by sharing a list of available parsers. These include:

  • Scrapy (http://scrapy.org/)
  • Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/)
  • Boilerpipe (http://boilerpipe-web.appspot.com/)
  • https://www.readability.com/developers/api

An alternative way of extracting desired content from specific websites is through the site’s application programming interfaces (API). Many large websites such as Google, Yahoo, Facebook, and Twitter make their APIs available for a wide variety of data collection tasks and the API includes specifications for variables, data structures, and object classes which facilitate the interaction with other software components. Inasmuch as all have their own use policies and limitations, discussion of them is outside the scope of this project and the reader should refer to the site’s documentation for available options (Mair & Chamberlain, 2014).