Welcome to Data Processing. In this class, we will be using a variety of tools that will require some initial configuration. To ensure everything goes smoothly moving forward, we will setup the majority of those tools in this homework. While some of this will likely be dull, doing it now will enable us to do more exciting work in the weeks that follow without getting bogged down in further software configuration. This homework will not be graded and this homework has no group reflection phase, however it is essential that you complete it timely since it will enable us to set up your accounts.
First you will learn how to get started with Git in order to submit your homeworks. Please refer to this Github manual and answer the questions before proceeding with this homework. Structure your repository as follows: make two folders Homework and Design. In each folder you will weekly make a new folder with a week number and put your documents from that week in there. Specific instructions on what should be committed each week are to be found in the assignments description. Please submit the URL of your repository underneath on this page.
From the second week on you will publish your visualizations using GitHub pages. Refer to this GitHub Pages Manual and copy the link of your page to the README.
Text editor and console
Those of you who have followed Programming 1 and 2 are used to working with CS50’s IDE. In this course we remove those training wheels. This means you have to download and install a text editor on your own computer suitable for your operating system and interface preference. Free text editors that use colors to highlight your code are for example Sublime Text 2 (limitless trial for Windows/Mac/Linux) or Notepad++. Besides a text editor you will also need to work with the command prompt or terminal. For Windows, type ‘cmd’ in the search bar if you don’t know where to find it. For Mac, type ‘terminal’.
Installing Python and the pattern library
In this course we will be working with Python to gather and process datasets. There are currently two flavors of the Python language, Python 2 and Python 3. For this class (and currently in most science and business) Python 3 is not used. We require that you install version 2.7.? of Python. (Note that there are real incompatibilities between Python 2 and python 3, so installing the wrong one will break the most Python exercises in this course).
If you are a Windows user and have no pre-existing Python installation, we recommend that you install Python(x, y). This distribution includes Python 2.7 (the version we use of Python comes with all common scientific libraries (and hence saves you a lot of trouble in case you need Numpy, Scipy or Matplotlib later).
For users of OS X, you already have Python so you need to check which version.
You do this by opening a Terminal (included in every Mac) and typing
followed by pressing return. This will start Python, and if your version of Mac
OS X is recent enough you will have Python 2.7 already. If you don’t have
Python 2.7, you should install it (i.e. download it from the
Python and install it).
If you are using a version of Linux, Python is probably already installed and if it is not you can install it through the package manager of your Linux distribution.
Once you successfully installed python you will be able to run python from your command line. On Windows you can get a command line by starting the “Command Prompt” and on a Mac by starting the “Terminal”. For Linux you should also look for the terminal (emulator) program.
Starting python should look something like this:
~$ python Python 2.7.3 (default, Sep 26 2012, 21:51:14) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
Installing the Pattern Library
The Pattern library is a library that supports webscraping and interfaces to many streaming APIs. To install pattern go to their website and follow the instructions. There you can download the library in a zip file. Simply extract this zip, open a terminal in the directory where the unpacked files are and execute the command ( from the command line, not from inside Python or IDLE )
~$ python setup.py install
This will run the script stored in “setup.py” which installs the library.
Make sure that you run this with administrator privileges (add
sudo at the
front on Mac and Linux and use and Administrator command prompt on Windows)!
For details see the pattern website.
Executing Python Code
Being an interpreted language, Python requires an interpreter to execute a
script. You’ve already executed a script for the pattern installation! The
interpreter is called by typing python at the command line. If a file is passed
as the first argument (e.g.,
python myfile.py), the interpreter will execute
the specified script. If no parameters are given, the interpreter will launch
in interactive mode where you can type individual commands one at a time.
Interactive mode is similar to using the prompt in Matlab in the sense that
variables can be created, functions can be called, etc. In general, most things
that can be done via script can be done in interactive mode. However, in
practice, scripts are more common.
Therefore the process is you edit a file, e.g.,
hello_world.py in the editor
of your choice (emacs, notepad++, or a comple IDE like PyDev) You run python
and give your script as a parameter. Here is a simple example.
import math a=math.ceil(math.pow(3,2)*19) print "Hello World, and welcome to CS ", int(a), "!"
To get your feet wet with python go through this and this page of the python tutorial and play around with strings, lists, control flow, functions, etc.
Try if Pattern is correctly installed
Copy the following code to a script and run it with python:
from pattern.web import Twitter, plaintext for tweet in Twitter().search('"more important than"', cached=False): print plaintext(tweet.description)
If you get output similar to this, everything is fine:
$ python twitter_test.py RT @BillSimmons: I love when coaches decide that their gimmick "system" is more important than figuring out to put their best players in the best situation. RT @ri_zugg: Finally realizing that God is more important than everything else. #nowtoputittoplay
Note that the content you see will be different, as you are getting real tweets through the twitter API!
If this happens:
$ python twitter_test.py Traceback (most recent call last): File "twitter_test.py", line 1, in <module> from pattern.web import Twitter, plaintext File "/home/cs171/twitter_test.py", line 1, in <module> from pattern.web import Twitter, plaintext ImportError: No module named web
then python can’t find you pattern modules. There are several fixes described on the pattern start site. The easiest one is to grab the “pattern” folder from the zip and put it in the same directory as your python script. If you’re still having trouble getting things working just ask for help.