Web-scraping

Earlier in the semester, I was trying to collect newspaper articles from online archives. When the structure of the archives changed from PDF to webpage links, I needed to find a way to automate the retrieval process.

Professor Settle then introduced me to Professor Van Der Veen who uses web-scraping in his own research. He also held a workshop that went through the web-scraping tutorial which can be found here.

Web scraping entails “automatically get some information from a website instead of manually copying it.”  There are several ways to go about doing this. We used Python along with several other packages and tried it out on the William & Mary Government Department website.

One of the aspects of web-scraping involves using Firebug, which is a Mozilla Firefox add-on that gives access to a variety of web development tools. Once you open up a webpage and click on Firebug, you can see what part of the webpage corresponds to the HTML code in the web development window at the bottom of the screen, as shown below. firebug

Personally, I have yet to make it all the way through the tutorial because no attempt at programming really ever goes smoothly, no matter what I try.

Although I’m no longer interested in trying to get the newspaper articles that originally led me to wanting to learn how to web-scrape, web-scraping is a useful skill that is worth learning.

References

http://www.sciedupress.com/journal/index.php/air/article/view/1390

http://stair.wm.edu/scraping.html

http://getfirebug.com/