Crawling

This is the bonus assignment. If you already finished the scraper and want to get more experience scraping data from the web, go ahead and implement the web crawler!

Web crawling & HTML scraping

For this exercise, we will practice scraping from multiple websites using web-crawling. You will learn how you can acquire data from multiple pages by following links. You’ll be scraping data from IMDB, the Internet Movie Database!

For this exercise we have provided some scaffolding that handles making HTML backups of a few selected web pages. It also handles writing Unicode to your CSV file. You will provide the code that does the actual scraping of the top 250 movie index page and the code that scrapes the individual movie pages on IMDB. Download the scaffolding here. You should submit your version of this script along with the output CSV file (top250movies.csv), and the two backup HTML files that the script creates (index.html and movie-083.html).

The goal of this exercise is to create a CSV file containing for each movie in the top 250:

Make sure that your python script produces clean results that do not need additional cleanup. For example, the entry “Writer” should only contain “Stephen King; Frank Darabont””, not “Stephen King (short story “Rita Hayworth and Shawshank Redemption”); Frank Darabont (screenplay)”.

Your output should look like sample_HTML.csv.

Hints

Alternative exercise

You are allowed to create a script from scratch that crawls another datasource but it should be of similar complexity to the IMDB data. So at least 5 attributes per entry and at least 200 entries. Also, to collect all information you should at least follow links to other web pages (not just scrape a single web page. If you are in doubt your data source is complex enough, please ask in class.

Submitting

Add a folder “Crawling” to your DataProcessing/Homework repository containing the following files: