Developing a Web Scraper

Conclusion

While the scraper application presented herein is not yet ready for prime time, it offers a foundation for further research in determining how such tools can be used to capture the data needed to make informed decisions when wanting to start a new company or launch a new product line. This effort was for an e-learning publisher startup and the results of this effort have had a profound influence on the founder’s perspective of what the process entails. Some very important concerns have been raised as a result of this project and are delineated below.

Ethical and security issues in web mining are at best unclear. The literature provides numerous examples of legitimate uses of web mining techniques that are, of course, essential to search engines and that help scientists, marketers, executives, researchers, entrepreneurs, educators, and consumers find the data they need to make informed decisions. However, the practice of crawling and scraping websites also causes concerns over privacy, copyrights, and other property rights protection. Invasion of a person’s privacy is not generally an issue when it comes to polite crawlers because the data contains no personally identifying information. As for copyrights, users of the data extracted by polite crawlers should cite the source and/or seek permission to use it if there is any doubt. Polite crawlers are also respectful of the target site’s resources and comply with its acceptable use policies (AUP) and its REP contained in the robots.txt file.

Unfortunately, not all crawlers are polite and indeed some are maliciously used by hackers, spammers, and identity thieves and all pose security threats. Privacy violations can occur, for example if a crawler is able to breach restricted areas of a website that contains personally identifying data. While unethical, such a breach may also constitute a violation of the privacy rights of individuals whose data was compromised. Data protected by the Health Insurance Portability and Accountability Act (HIPAA) and Family Educational Rights and Privacy Act (FERPA) would be considered obtained illegally. A crawler that overwhelms the server resulting in a denial of service (DoS) or one that does not comply with the AUP or REP is unethical and may constitute a violation of the rights of the site owners.

Scalability will likely become an issue with MySQL because of the resource use intensity needed to handle multiple concurrent queries. In addition, despite the ability to automate via cron jobs, a person/department still needs to oversee the process and ensure that the application is indeed capturing the data needed accurately and not overloading the resources of either the originator of the scrape or the target; and as target websites change their landscapes, so must the engines that harvest their content.

 
Kathleen Marrs, Ph.D.
Kathleen wants to live in a world filled with open books, open source, open hearts, and open minds in which diversity is embraced and creativity flourishes.

A long time CPA turned online professor, Kathleen’s life was transformed upon completion of her dissertation An Investigation of the Factors that Influence Faculty and Student Acceptance of Mobile Learning in Online Higher Education.

Her statistical analyses was called ”pioneering” by her committee chair Dr. Marlyn K. Littman and brought Kathleen full circle back to her number-crunching roots inspiring her to earn a second master’s in Business Intelligence.

Kathleen plans to continue her studies of contemporary issues related to teaching, learning, and technology and loves to help undergrad and grad students achieve their academic and professional goals. As a lifelong learner she also plans on continuing her quest to understand the problems posed by mobile and micro learning formats and find innovative ways of helping people maximize the benefits these emerging technologies afford.
 

Leave a Reply