This is the bonus assignment. If you already finished the scraper and want to get more experience scraping data from the web, go ahead and implement the web crawler!
Web crawling & HTML scraping
For this exercise, we will practice scraping from multiple websites using web-crawling. You will learn how you can acquire data from multiple pages by following links. You’ll be scraping data from IMDB, the Internet Movie Database!
For this exercise we have provided some scaffolding that handles making HTML
backups of a few selected web pages. It also handles writing Unicode to your
CSV file. You will provide the code that does the
actual scraping of the top 250 movie index page and the code that scrapes the
individual movie pages on IMDB. Download the scaffolding
here. You should submit your version of this script along
with the output CSV file (
top250movies.csv), and the two backup HTML files
that the script creates (
The goal of this exercise is to create a CSV file containing for each movie in the top 250:
- Title of movie
- Genre (separated by semicolons if multiple)
- Director(s) (separated by semicolons if multiple)
- Writer(s) (separated by semicolons if multiple)
- Actors (only the first three actors listed, separated by semicolons)
- Number of Ratings
Make sure that your python script produces clean results that do not need additional cleanup. For example, the entry “Writer” should only contain “Stephen King; Frank Darabont””, not “Stephen King (short story “Rita Hayworth and Shawshank Redemption”); Frank Darabont (screenplay)”.
Your output should look like
- Have a look at the
pattern.web.Elementdocumentation and look for the attribute selectors, they come in handy for this exercise.
- The script will likely be slow, but that is no problem.
You are allowed to create a script from scratch that crawls another datasource but it should be of similar complexity to the IMDB data. So at least 5 attributes per entry and at least 200 entries. Also, to collect all information you should at least follow links to other web pages (not just scrape a single web page. If you are in doubt your data source is complex enough, please ask in class.
Add a folder “Crawling” to your DataProcessing/Homework repository containing the following files: