Introduction

A computer program is a sequence of instructions, written in a programming language, that combine together to solve a problem or produce a desired result. Computer programs range from very simple examples with only a few lines of source code, to very complicated. For example, Windows 11 is estimated to contain around 60–100 million lines of code.

In general, programming involves the following steps:

  1. A program is written in some programming language (e.g., C++, Python, R, or Javascript), usually stored in one or more source code files.
  2. For compiled languages like C or C++, the source code is converted into machine language by a compiler, then combined into an executable file by a linker. The executable file is run on a target machine and operating system.
  3. For interpreted languages like Python and R, the source code is processed by an interpreter. Individual lines are converted to machine language and executed one by one, as they are encountered within the source code.

This tutorial will provide an introduction to programming using Python, an interpreted programming language. Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (National Research Institute for Mathematics and Computer Science) in the Netherlands. Python 1.0 was released in 1994. The most recent version is 3.10.2, released in January 2022. To maintain backwards compatibility, Python versions 3.0 and 3.1 coincided with versions 2.6 and 2.7. We will be using Python version 3.9.x.

Since Python is an interpreted language, individual lines of code are converted to machine language and executed as they are encountered (versus a compiled language, which converts an entire program to machine code in an explicit compile–link stage). One advantage of an interpreted language is the ability to enter individual commands on a command line prompt, and immediately see their results.

% print( 2 + 5 ) 7 % l = [ 1, 2, 3 ] % print( l ) [1, 2, 3] % print( l[ 0 ] ) 1 % email = { 'healey': 'healey@ncsu.edu', 'rappa': 'mrappa@ncsu.edu' } % print( email ) {'rappa': 'mrappa@ncsu.edu', 'healey': 'healey@ncsu.edu'} % print( email[ 'healey' ] ) healey@ncsu.edu % basket = [ 'orange', 'apple', 'pear', 'apple', 'durian', 'orange' ] % fruit = set( basket ) % print( fruit ) {'durian', 'orange', 'apple', 'pear'} % print( 'durian' in fruit ) True % print( 'pomegranate' in fruit ) False

Unless we're only issuing a few commands once or twice, we normally store the commands in source code files. This allows us to load and execute the commands as often as we want. It also makes it easier to modify a program, or correct it when we discover errors. A slightly modified version of the above code (to force output to appear when it's run) is available in the source code file tut-01-intro.py.

A source code file in Python is called a module. Every Python program includes a main module. When the program starts running, it begins interpreting code in the main module. You can also split your code across additional modules to organize it, or use libraries to access functionality provided by other programmers.

Assignments

The assignment for the Summer 2 session asks you to use Open Weather Map's API (Application Programming Interface) to query forecasted temperature data. You will write your results to a CSV file. The assignment web page provides full details on what is required.

The assignment will be completed as individuals, and not as a homework team. Grading for the assignments will be on a standard 0–100% scale. You will submit the Python code for your completed assignment through Moodle. Look for "Python Programming: OpenWeatherMap API Assignment".

Running Python

For this class, we'll be using the Anaconda Scientific with Python 3 for Windows. We will be using Anaconda's Windows Python 3.7 installer. Anaconda includes the NumPy (numerical Python), pandas (Python data analysis), scikit-learn, urllib (URL management), and BeautifulSoup libraries that we'll be using in this tutorial.

Once Anaconda is installed, I recommend running Anaconda Navigator and choosing either Jupyter notebook or spyder as your programming environment.

On most platforms, Python is normally installed by default. If it is, you can run the Anaconda Prompt (Anaconda 3) and type python at a command line prompt to bring up a Python shell. Make sure you do this using the Anaconda Prompt and not a regular Windows 10 command line, to ensure all the proper PATH variables to appropriate Python packages are set.

Why Python?

Python is a powerful programming language that can perform many functions. For this class, our interest is in Python's data management capabilities. Python offers efficient ways to read data from one or more input files, then modify, correct, convert, or extend that data and write the results to an output file.

For example, suppose we had the following comma-separated value (CSV) file:

Name,Height
Jim,181
Betty,167
Frank,154
...
Annabelle,201

We want to classify each individual based on his or her height hi versus the mean and standard deviation of height μ and σ as:

producing a new CSV output file with contents:

Name,Height,Class
Jim,181,2
Betty,167,1
Frank,154,3
...
Annabelle,201,4

Python isn't the only way to produce this result (e.g., we could probably do it in Excel, or SAS, or R), but it's often easier and faster to do it with a Python program. If you're curious, this source code file classifies a comma-delimited height file. A corresponding (very simple) height file is also available in CSV format.

Variables

Every programming language provides a way to maintain values as a program runs. Usually, this is done by create a named variable, then assigning a value to the variable. In Python, variables are created interactively by specifying their name and assigning them an initial value.

% name = 'Abraham Lincoln' % height = 6.33 % age = 56 % birthplace = 'Hodgenville KY' % born = 'February 12 1809' % deceased = 'April 15 1865'

Unlike languages like C++, Python does not require you to specify a variable's type. This is inferred from the value it maintains. In the above example the variables name, birthplace, born, and deceased are inferred to be strings, height is inferred to be floating point, and age is inferred to be integer. One advantage of Python's dynamically typed variables is that you can change them to hold different types of values whenever you want. You can also ask Python what type of value a variable contains with the type() function.

% name = 'Abraham Lincoln' % print( type( name ) ) <class 'str'> % name = 25 % print( type( name ) ) <class 'int'> % name = 6.3 % print( type( name ) ) <class 'float'> % name = 305127925769258727938193819283 % print( type( name ) ) <class 'int'> % name = False % print( type( name ) ) <class 'bool'>

Here is a quick list of some of Python's basic variable types. More complicated types will be discussed later in the tutorial.

Variable Practice Problem

Write a set of Python statements that assign the names and associated phone numbers: Christopher Healey, 9195138112, Michael Rappa, 9195130480, and Aric LaBarr, 9195132957 to Python variables, then prints three lines listing each person's name and corresponding phone number.

I recommend you write your program using Jupyter notebook or spyder, save it as a Python notebook or source code file, and then test it, rather than writing the program directly in the Python shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

Variable Assignment Solution

% name_0 = 'Christopher Healey' % phone_0 = '9195138112' % name_1 = 'Michael Rappa' % phone_1 = '9195130480' % name_2 = 'Aric LaBarr' % phone_2 = '9195132957' % % print( name_0, '@', phone_0 ) Christopher Healey @ 9195138112 % % print( name_1, '@', phone_1 ) Michael Rappa @ 9195130480 % % print( name_2, '@', phone_2 ) Aric LaBarr @ 9195132957

You can download the solution file and run it on your machine, if you want.

Your choice of variable names is probably different than ours, and you might have printed the name and phone number with slightly different formatting. Regardless, the basic idea is to use six separate variables to store the names and phone numbers, then print the contents of these variables in combinations that produce the correct output.

You might think, "This works, but it doesn't seem very efficient." That's true. Once you've learned more about Python, it's unlikely you'd write this code to solve the problem. Here's a more elegant and flexible solution. When you've finished the tutorial, you'll be able to understand, and to implement, this type of code.

% db = { } % db[ '9195138112' ] = 'Christopher Healey' % db[ '9195130480' ] = 'Michael Rappa' % db[ '9195132957' ] = 'Aric LaBarr' % for phone in db.keys(): % print( db[ phone ], '@', phone ) Michael Rappa @ 9195130480 Christopher Healey @ 9195138112 Aric LaBarr @ 9195132957

Operators

Python provides a set of built-in functions or operators to perform simple operations such as addition, subtraction, comparison, and boolean logic. An expression is a combination of variables, constants, and operators. Every expression has a result. Operators in Python have precedence associated with them. This means expressions using operators are not evaluated strictly left to right. Results from the operators with the highest precedence are computed first. Consider the following simple Python expression:

% 6 + 3 * 4 / 2 + 2 14

If this were evaluated left to right, the result would be 20. However, since multiplication and division have a higher precedence than addition in Python, the result returned is 14, computed as follows.

Of course, we can use parentheses to force a result of 20, if that's what we wanted, with the following expression:

% ((( 6 + 3 ) * 4 ) / 2 ) + 2 20

Below is a list of the common operators in Python, along with an explanation of what they do. The operators are group according to precedence, from highest to lowest.

Order of precedence for Python operators
Operator Description
( ) parentheses define the order in which groups of operators should be evaluated
** exponential
+x, -x make positive, make negative
*, /, //, % multiplication, division, floor division, remainder
+, - addition, subtraction
<, <=, >, >=, !=, == less, less or equal, greater, greater or equal, not equal, equal
and boolean AND
or boolean OR

Advanced Data Types

In addition to boolean and numeric variables, Python provides a number of more complex types, including strings (str), lists (list), dictionaries (dict), or tuples (tuple). Using these types effectively will make you a much more efficient programmer.

str

String variables (str) are a sequence of one or more characters. String values are denoted by single quotes, s = 'Abraham Lincoln', or double quotes, s = "Abraham Lincoln". Because strings are a more complex data type, they support more complex operations. Here are some common operations you can perform on strings.

Here are some examples of string operations executed in a Python shell.

% s = 'Hello world!' % print( len( s ) ) 12 % print( s[ 6 ] ) w % print( s[ 2: 8 ] ) llo wo % print( s[ -2 ] ) d % print( s[ -3: 12 ] ) ld! % t = 'Must.. try.. harder..' % print( s + ' ' + t ) Hello world! Must.. try.. harder..

There are many additional operations you can perform on strings, for example, s.capitalize() to capitalize a string, or s.find( t ) to find the first occurrence of substring t in s. The Python documentation describes all the available string methods, explaining how to use them and what they do.

list

List variables are ordered sequences of values. Most data types can be stored in a list, for example, a list of int's, a list of str's, or even a list of list's. List values are denoted by square brackets, l = [ 1, 2, 3, 4 ].

You might notice that strings look suspiciously like a list of characters. Indeed, both list and str are known as "sequence types" in Python, so lists support the same len, concatenation, indexing, and slicing operations as strings.

% l = [ 1, 2, 3, 4, 5 ] % print( len( l ) ) 5 % print( l[ 2 ] ) 3 % print( l[ 1: 3 ] ) [2, 3] % print( l[ -2 ] ) 4 % print( l[ -3: 5 ] ) [3, 4, 5] % m = [ -2, -3, -4 ] % print( l + m ) [1, 2, 3, 4, 5, -2, -3, -4]

As with strings, there are many additional operations you can perform on lists, for example, l.insert( i, x ) and l.remove( i ) to add and remove items, or l.sort() to reorder items into sorted order. The Python documentation describes the available list methods, explaining how to use them and what they do.

dict

Dictionary variables are a collection of key–value pairs. This is meant to be analogous to a real dictionary, where the key is a word, and the associated value is the word's definition. dict variables are designed to support efficient searching for elements in the dictionary based on key. Dictionary values are denoted by braces, d = { key: value }.

By design, dictionaries have one important requirement: every value you store in a dictionary must have its own, unique key. For example, we could not store a person's address using their last name as a key, because if two different people had the same last name, only one of their addresses could be saved in the dictionary.

Suppose instead we wanted to find a person's name based on their phone number. To do this, we could create a dictionary with phone number as a key and name as a value.

% d = { '9195138112': 'Christopher Healey' } % d[ '9195130480' ] = 'Michael Rappa' % d[ '9195152858' ] = 'Dept of Computer Science' % print( d ) {'9195130480': 'Michael Rappa', '9195138112': 'Christopher Healey', '9195152858': 'Dept of Computer Science'} % print( d[ '9195138112' ] ) Christopher Healey

The first statement creates a dictionary variable named d and assigns a single key–value pair made up of a string phone number key 9195138112 and a string name value Christopher Healey. The next two lines add two new key–value pairs to d by specifying a key as an index (an string phone number inside the square brackets) and assigned a name as the key's value. Printing d lists all its key–value pairs. The value attached to a specific key can be queried by indexing d with the target key.

Key Types

Why did we choose to make our key a string variable and not a numeric variable? The dictionary would work if we used d = { 9195138112: 'Christopher Healey' }, with a long integer for the key rather than a string. Our choice was semantic: we view a phone number as a sequence of (numeric) characters, and not as a generic numeric value. It doesn't make sense to add or subtract phone numbers, for example, so phone numbers don't really act like numbers.

Since the dictionary works identically either way, does it really matter? In terms of functionality, probably not. In terms of understandability, that depends. We try to find the best match between the context of a variable and its type. Here, the point is subtle, so it doesn't make a big difference. In other cases, though, a proper choice (rather than simply the first choice that works) can improve a program's effectiveness, and perhaps more importantly, make it easier to understand.

One interesting difference between a dictionary and a list is that dictionaries do not maintain order. The order that items are stored in a dictionary will not necessarily match the order that you added them to the dictionary. You can see this in the above example, where phone numbers were added in the order 9195138112, 9195130480, 9195152858, but they were stored in d in the order 9195130480, 9195138112, then 9195152858.

Dictionaries are a very powerful data structure. If you need to perform efficient search, if ordering the element's isn't critical, and if you can define a key for each of the elements you're storing, a dict might be a good candidate. There are many additional operations you can perform on dictionaries, for example, d.keys() to return a list of keys in the dictionary, d.values() to return a list of its values, or d.pop( k ) to remove an entry with key k (and return its value, if you want it). The Python documentation describes the available dictionary methods, explaining how to use them and what they do.

tuple

Tuple variables are ordered sequences of values, where the position of a value within the tuple often has a semantic meaning. Tuple values are denoted by parentheses, t = ( 2013, 10, 28, 14, 15, 0 ).

You might notice that tuples look identical to lists. Again, both are "sequence types" in Python, so tuples support the len, concatenation, and slicing operations, as well as querying by index.

% t = ( 2013, 10, 28, 14, 15, 0 ) % print( len( t ) ) 6 % print( t[ 0 ] ) 2013 % print( t[ 3: 6 ] ) (14, 15, 0) % print( t[ -3 ] ) 14 % print( t[ -6: 3 ] ) (2013, 10, 28) % u = ( 1, 232, 1 ) % print( t + u ) (2013, 10, 28, 14, 15, 0, 1, 232, 1)

There are important differences between lists and tuples. Although they are subtle, understanding them will help you decide when to use a list, and when to use a tuple.

Tuples support methods that lists support, as long as the method does not modify the tuple's values. So, for example, tuples support len() and +, but not remove() or sort(), since that changes the tuple's values or the order of its values.

Conditionals

We've already seen that a Python program runs by executing the first statement in the main module, and continuing with each successive statement until it reaches the end of the program. This doesn't allow for very complicated programs. What if want to control the flow of execution, that is, what if we want one part of the program to be executed in some cases, but another part to be executed in different cases?

Conditional statements allow you to control how your program executes. For example, a conditional statement could apply a comparison operator to a variable, then execute different a block of statements depending on the result of the comparison. Or, it could cause a block of statements to be executed repeatedly until some condition is met.

Understanding condition statements is necessary for writing even moderately complicated programs. We discuss some common Python conditional operators below, and give details on how to structure your code within a conditional statement.

if-then-else

To start, we'll discuss the if-then-else conditional. Described in simple terms, this is used in a program to say, "If some condition is true, then do this, else do that."

As an example, suppose we have a variable grade that holds a student's numeric grade on the range 0–100. We want to define a new variable passed that's set to True if the student's grade is 50 or higher, or False if the grade is less than 50. The following Python conditional will do this.

% grade = 75 % if grade >= 50: % passed = True % else: % passed = False % % print( passed ) True

Although this statement appears simple, there are a number of important details to discuss.

Interestingly, the else part of the conditional is optional. The following code will produce the same result as the first example.

% passed = False % grade = 75 % if grade >= 50: % passed = True % % print( passed ) True

Suppose we wanted to not only define pass or fail, but also assign a letter grade for the student. We could use a series of if-then statements, one for each possible letter grade. A better way is to use elif, which defines else-if code blocks. Now, we're telling a program, "If some condition is true, then do this, else if some other condition is true, then do this, else do that." You can include as many else-if statements as you want in an if-then-else conditional.

% grade = 75 % if grade >= 90: % passed = True % letter = 'A' % elif grade >= 80: % passed = True % letter = 'B' % elif grade >= 65: % passed = True % letter = 'C' % elif grade >= 50: % passed = True % letter = 'D' % else: % passed = False % letter = 'F' % % print( passed ) True % print( letter ) C

while

Another common situation is the need to execute a code block until some condition is met. This is done with a while conditional. Here, we're telling the program "While some condition is true, do this." For example, suppose we wanted to print the square roots of values on the range 1–15.

% import math % i = 1 % while i <= 15: % print( 'The square root of', i, 'is', math.sqrt( i ) ) % i = i + 1 % The square root of 1 is 1 The square root of 2 is 1.4142135623730951 … The square root of 15 is 3.872983346207417

(The import math statement is needed to give us access to mathematical functions like math.sqrt()).

Notice that the variable that's compared in the while conditional normally must be updated in the conditional's code block. If you don't update the conditional variable, a comparison that initially evaluates to True will never evaluate to False, which means the while loop will execute forever. For example, consider the following code block.

% import math % i = 1 % while i <= 15: % print( 'The square root of', i, 'is', math.sqrt( i ) ) % The square root of 1 is 1 The square root of 1 is 1 The square root of 1 is 1 The square root of 1 is 1 The square root of 1 is 1 The square root of 1 is 1 The square root of 1 is 1 …

Without the i = i + 1 statement to update i in the conditional's code block, the while conditional never fails, giving us the same output over and over. You can use Ctrl+C to halt your program if it's caught in an infinite loop like this.

for

A final conditional that is very common is a for loop. Here, we're telling a program "Execute this code block for some list of values." for can work on any list of values, but it's often applied to a numeric range. The range command is used to create a sequence of integers.

% print( range( 5, 10 ) ) range( 5, 10 ) % print( range( 10 ) ) range( 10 ) % print( list( range( -15, -10 ) ) ) [-15, -14, -13, -12, -11]

Giving two values to range like range( 2, 5 ) defines a starting value of 2 and an ending value of 5. range generates an integer list from the starting value, up to but not including the ending value: [2, 3, 4]. If you only give an ending value to range like range( 5 ), range assumes a starting value of 0, producing the list [0, 1, 2, 3, 4].

Once a list is produced with range, each value in the list is given to the for conditional's code block, in order. For example, suppose we wanted to print the same set of square roots from 1–15 using a for loop.

% import math % for i in range( 1, 16 ): % print( 'The square root of', i, 'is', math.sqrt( i ) ) % The square root of 1 is 1 The square root of 2 is 1.4142135623730951 … The square root of 15 is 3.872983346207417

The for statement defines a variable to hold the "current" list value. In our case, this variable is called i. range( 1, 16 ) generates the list [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. The for conditional walks through this list and executes the code block 15 times, first with i set to 1, then with i set to 2, and so on up to the final value of 15. The statement inside the code block uses i to track the current list value, printing square roots from 1 to 15.

We don't need to use range to execute a for conditional. Any list can be used in a for loop.

% name = [ "Healey", "Rappa", "LaBarr" ] % for nm in name: % print( nm, '(', len( nm ), ')' ) % Healey ( 6 ) Rappa ( 5 ) LaBarr ( 6 )

break and continue

break

Sometimes we need to exit a for or while loop before its condition evaluates to False. The break statement allows us to do this. For example, suppose we wanted to print the elements of a list of strings, but terminate examining the list if we see the string stop.

% l = [ 'Healey', 'Rappa', 'LaBarr' ] % for i in range( len( l ) ): % if l[ i ] == 'stop': % break % print( l[ i ] ) % Healey Rappa LaBarr % l.insert( 1, 'stop' ) % print( l ) [ 'Healey', 'stop', 'Rappa', 'LaBarr' ] % for i in range( len( l ) ): % if l[ i ] == 'stop': % break % print( l[ i ] ) Healey

continue

Other times, we want to stop executing a loop's code block, and instead return to check its condition. The continue statement allows us to do this. For example, suppose we wanted to print only the odd numbers from 1 to 10.

% for i in range( 1, 10 ): % if i % 2 == 0: % continue % print( i, 'is odd' ) 1 is odd 3 is odd 5 is odd 7 is odd 9 is odd

Loop Practice Problem

Write a set of Python statements to compute the average of the following list of numbers.

I recommend you write your program using Jupyter notebook or spyder, save it as a Python notebook or source code file, and then test it, rather than writing the program directly in the Python shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

Loop Practice Solution

for loop

% num = [ 6, 12, -7, 29, 14, 38, 11, 7 ] % sum = 0 % for n in num: % sum = sum + n % % print( float( sum ) / len( num ) ) 13.75

while loop

% num = [ 6, 12, -7, 29, 14, 38, 11, 7 ] % i = 0 % sum = 0 % while i < len( num ): % sum = sum + num[ i ] % i = i + 1 % % print( float( sum ) / len( num ) ) 13.75

Notice that we have to convert the sum to a floating point value (in our case, by casting it with float()) to get the proper average of 13.75. If we had used the statement print( float sum / len( num ) ) instead, Python would have return an integer result of 13.

You can download the solution file and run it on your machine, if you want.

Debugging

Inevitably, you'll write some Python code that either doesn't do what you expect it to do, or that generates an error message when you try to execute it. When that happens, you'll need to debug the program to locate and correct the error. Consider the following code.

% l = [ '10', '20', '30' ] % sum = 0 % for val in l: % sum = sum + val

If you hit Return to close the for loop, Python would respond with an error message similar to this.

TypeError Traceback (most recent call last) <ipython-input-11-00773625d4c5>, in <module> 2 sum = 0 3 for val in l: ----> 4 sum = sum + val TypeError: unsupported operand type(s) for +: 'int' and 'str'

So, that didn't work. The first two lines of the error message give some details about how the error's being reported, and where the error occurred (on line 2 of code being entered on "<stdin>>>;", which is Python's way of saying "from the keyboard"). The most important part of the error is the last line, which tries to explain the problem Python encountered. This explanation suggests that Python doesn't know how to add (+) an int and a str.

If you look at where the error was reported (line 2 of the for loop), it attempted to execute sum = sum + val. Python is claiming the first variable in the add operation, sum, is an int, but the second variable val is a str. val is a value from the list variable l. And, when we look at l, we see that it contains three string values: '10', '20', and '30'. This is the problem that Python encountered.

There are various ways to fix this problem. One simple solution is to put integers in the list, l = [ 10, 20, 30 ]. If you wanted l to contain strings for some reason, you could cast val to be an integer in the add operation.

% l = [ '10', '20', '30' ] % sum = 0 % for val in l: % sum = sum + int( val ) % % print( sum ) 60

Now, Python accepts the for loop's body because it understands how to add to int variables. The resulting sum is printed after the loop finishes.

Computer "Bugs"

Why are errors in computer programs called bugs? Historically, the term "bug" was used in engineering to describe mechanical malfunctions.

A physical bug found during construction ofin the Harvard Mark II computer.

In 1947 computer engineers were designing the Harvard Mark II computer. An error in the machine was traced to a moth that had become trapped in one of the machine's relays. The moth was removed and taped to the engineers' log book, where they referred to it as "The first actual case of bug being found." This incident seems to have contributed to the widespread use of the term in Computer Science.

Files

One of the main reasons we're using Python is to read and write data to and from external files. Python has an extensive set of file input/output (file IO) operations to support this. The basic structure of modifying a file often follows this simple pattern.

  1. Open one or more input files to read the original data.
  2. Open an output file to write the new or modified data.
  3. Read data from the input file, usually line by line.
  4. Examine each input line, modifying it or generate new data based on its contents.
  5. Write the modified line and/or data to the output file.
  6. Close the input and output files when processing is completed.

Reading Files

Here are some operations you can use to open files and read from them.

As a file is being read, Python maintains a current position in the file (the file pointer). This is how Python finds things like the "next" line: it starts from its current position in the file, reads until it sees a newline character (\r\n) or the end of the file, then returns what it read as a string.

Maintaining a current position means that Python won't automatically "back up" for you if you want to go back and re-read some data. For example, consider the following code snippet.

% inp = open( 'input.txt', 'r', encoding='latin' ) % line = inp.readlines() % print( len( line ) ) 180 % line = inp.readlines() % print( len( line ) ) 0 % inp.close()

The first time we read input.txt and asked how many lines it contained, Python told us it had 180 lines. But the next time we read the file, Python said it had 0 lines. How is that possible?

Remember, after the first line = inp.readlines() statement, Python reads everything in the file, returning a list with 180 strings representing the file's 180 lines. Critically, the current position is now at the end of the file. We re-issue the same line = input.readlines() statement, Python starts from its current position at the end of the file, realizes there's nothing else left to read, and returns an empty list to tell us that. So, the length of that second list is 0, exactly as we saw.

How could we re-read the entire file? The easiest way to do this is to close the file, then re-open it. Doing this resets the current position back to the start of the file.

% inp = open( 'input.txt', 'r', encoding='latin' ) % line = inp.readlines() % print( len( line ) ) 180 % inp.close() % inp = open( 'input.txt', 'r', encoding='latin' ) % line = inp.readlines() % print( len( line ) ) 180 % inp.close()

seek and tell

seek

It's possible to change the current position in a file without closing and re-opening it. seek( pos ) is used to set a file's current position to pos. For example, we could re-read input.text as follows.

% inp = open( 'input.txt', 'r', encoding='latin' ) % line = inp.readlines() % print( len( line ) ) 180 % inp.seek( 0 ) % line = inp.readlines() % print( len( line ) ) 180 % inp.close()

The command inp.seek( 0 ) sets the current position to 0 bytes from the start of the file (i.e., to the start of the file). If you need to seek from the end of the file, you can specifying a negative offset and a second argument of 2 to seek, for example, inp.seek( -10, 2 ) seek 10 bytes backwards from the end of input.txt. You can also seek from the current position by specifying an offset and a second argument of 1 to seek, for example, inp.seek( 20, 1 ) to seek 20 bytes forwards from the current position, or seek( -5, 1 ) to seek 5 bytes backwards from the current position.

tell

The tell() command will return the current position in a file. For example, to determine the size of a file, you could do the following.

% inp = open( 'input.txt', 'r', encoding='latin' ) % inp.seek( 0, 2 ) % print( inp.tell() ) 4044 % inp.close()

Writing Files

Python provides similar operations open files and write to them.

Writing data is simple: put the data you want to write into a string variable (or convert a variable's value to a string value), then use write to write the data to an output file.

% import os % % out = open( 'output.txt', 'w', newline='', encoding='latin' ) % num = 50 % list = [ 'Healey', 'Rappa', 'Mostek' ] % for elem in list: % out.write( elem + os.linesep ) % % out.write( str( num ) ) % out.close()

This code snippet creates an output file output.txt and writes four lines containing Healey, Rappa, Mostek, and 50 to the file. Notice that if we need a newline after each value, we need to add it explicitly by appending '\r\n' to the string as it's being written.

newlines

Newlines are used to insert a carriage return between lines in a file. In Windows (or DOS) a newline is actually two characters: a return and a newline, denoted \r\n. This is unique to Windows. On other operating systems like Mac OS or Linux, only the newline \n is used.

Newlines also matter when you're reading data. For example, supposed you used readlines() to read all the lines in a file.

% inp = open( 'input.txt', 'r', encoding='latin' ) % line = inp.readlines() % print( line[ 0: 2 ] ) ['This is the first line\r\n', 'This is the second line\r\n']

The return and newline are included at the end of each line that's read. In some cases you want these "separators" removed when the file is read and parsed. One easy way to do this is to read the entire file using read(), then split the result using the string operator split().

To determine which character(s) represent the end of a line, use the os.linesep variable.

Windows:

% import os % os.linesep '\r\n'

Mac OS or Linux:

% import os % os.linesep '\n'

Now, we read and divide lines as follows to produce a list of lines in a file.

% import os % inp = open( 'input.txt', 'r', encoding='latin' ) % content = inp.read() % line = content.split( os.linesep ) % print( line[ 0: 2 ] ) ['This is the first line', 'This is the second line']

split() divides a string into a list of substrings based on a delimiter, throwing away the delimiter after each split. Splitting the file's contents on the delimiter os.linesep gives us what we want: the individual lines from the file without carriage returns or newlines at the end of each line.

CSV Files

A file type you are likely to encounter often is a comma-separated value (CSV) file. CSV files are text files containing a table of values. Each line represents one row in the table, with individual column values in the row identified by a separator character. The separator is often a comma (,), although it can be any character that's guaranteed not to appear in any column value. For example, the US Census Bureau maintains a file of estimated US city populations from 2010–2017. We will use an earlier file, compiled directly after the 2010 census, containing estimated 2011 city and town populations.

SUMLEV,STATE,COUNTY,PLACE,COUSUB,CONCIT,NAME,STNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011 040,01,000,00000,00000,00000,Alabama,Alabama,4779736,4779735,4785401,4802740 162,01,000,00124,00000,00000,Abbeville city,Alabama,2688,2688,2689,2704 162,01,000,00460,00000,00000,Adamsville city,Alabama,4522,4522,4522,4525 162,01,000,00484,00000,00000,Addison town,Alabama,758,758,754,754 162,01,000,00676,00000,00000,Akron town,Alabama,356,356,354,348 162,01,000,00820,00000,00000,Alabaster city,Alabama,30352,30352,30473,30799 162,01,000,00988,00000,00000,Albertville city,Alabama,21160,21160,21202,21421 162,01,000,01132,00000,00000,Alexander City city,Alabama,14875,14875,14846,14876 162,01,000,01228,00000,00000,Aliceville city,Alabama,2486,2486,2483,2438 … 157,56,045,79125,00000,00000,Upton town,Wyoming,1100,1100,1096,1084 157,56,045,99990,00000,00000,Balance of Weston County,Wyoming,2576,2576,2564,2539

We could use file and string operations to read and parse CSV files, but Python provides a csv module to help us with this. Modules are pre-written collections of operators, usually designed for a specific purpose or task. To use a module, you must first import it. To invoke one of it's operations, you precede the operator's name with the name of the module, followed by a period.

% import csv % csv.list_dialects() ['excel-tab', 'excel']

To read data from a CSV file, we normally perform the following steps.

  1. Open the CSV file to read with open(), exactly like any other input file.
  2. Attach a CSV reader to the CSV file.
  3. Use next( reader ) to read and parse any header line(s) in the CSV file.
  4. Use a for loop to read and parse the rows in the CSV file. Each row is returned as a list of column values found in the row's line.
  5. Close the CSV file.

For example, this code would read and parse the Census population file

% import csv % % inp = open( 'pop.csv', 'r', newline='', encoding='latin' ) % reader = csv.reader( inp ) % header = next( reader ) % % for row in reader: % if row[ 6 ] == row[ 7 ]: % print( 'The state of', row[ 7 ], 'has population', row[ 11 ] ) % else: % print( 'City', row[ 6 ], 'in state', row[ 7 ], 'has population', row[ 11 ] ) % The state of Alabama has population 4802740 City Abbeville city in state Alabama has population 2704 City Adamsville city in state Alabama has population 4525 … City Upton town in state Wyoming has population 1084 City Balance of Weston County in state Wyoming has population 2539 % % inp.close()

Note that we open the file with the newline='' argument. We will use the same argument we we open files intended for CSV writing below. This handles running Python 3.x on Windows, where the interpreter automatically inserts a carriage return (\r) whenever it sees a newline (\n). This built-in functionality means that, on Windows, when we read or write a CSV file, the newline \r\n on the file is automatically converted to \r\r\n, which adds an extra line break. Particularly on writing, this causes issues, making it look like there's a blank line between each data line when the file is imported into programs like Excel.

The csv module helps us read and parse a CSV file, but it doesn't tell us anything about what the rows and columns in the file represent. We need to provide that context based on our understanding of the file. For example, in the code above:

It's also possible to write data out in CSV format. This is useful, since CSV files can be easily imported into programs like Excel or SAS. A very similar sequence of steps is used to write a CSV file.

  1. Open the CSV file to write with open(), exactly like any other output file.
  2. Attach a CSV writer to the CSV file.
  3. For each row you want to write to the CSV file, store the row's column values in a list.
  4. Use writerow() to write the list's values as a comma-separated line in the CSV file.
  5. After all the rows are written, close the CSV file.

For example, the Census population file has a lot of columns we might not care about. Suppose we wanted to reduce the file to only include city name, state name, and 2011 population estimate. The following code would do this.

% import csv % % inp = open( 'pop.csv', 'r', newline='', encoding='latin' ) % reader = csv.reader( inp ) % header = next( reader ) % % out = open( 'pop-summary.csv', 'w', newline='', encoding='latin' ) % writer = csv.writer( out ) % writer.writerow( [ 'City', 'State', 'Population' ] ) % % for row in reader: % writer.writerow( [ row[ 6 ], row[ 7 ], row[ 11 ] ] ) % % inp.close() % out.close()

This will produce an output file pop-summary.csv with the following data.

City,State,Population Alabama,Alabama,4802740 Abbeville city,Alabama,2704 Adamsville city,Alabama,4525 Addison town,Alabama,754 Akron town,Alabama,348 Alabaster city,Alabama,30799 Albertville city,Alabama,21421 Alexander City city,Alabama,14876 Aliceville city,Alabama,2438 … Upton town,Wyoming,1084 Balance of Weston County,Wyoming,2539

CSV Practice Problem

Write a Python program that finds the city with the largest population in pop.csv, then prints this city's name, the name of its state, and its population.

Hint. When you read data with a CSV reader, the column values it returns are all strings. You'll want to convert the population value from a string to an integer. To do this, you cast the string using the int()operation.

% for row in reader: % pop = int( row[ 11 ] ) …

I recommend you write your program using Jupyter notebook or spyder, save it as a Python notebook or source code file, and then test it, rather than writing the program directly in the Python shell. This will let you write your code, run it to see what it does, edit it to fix problems, and run it again, without having to re-type the entire program at the command line.

Maximum Population Solution

% import csv % inp = open( 'pop.csv', 'r', encoding='latin' ) % reader = csv.reader( inp ) % header = next( reader ) % max_pop = 0 % max_city_nm = "Unknown" % max_st_nm = "Unknown" % for row in reader: % pop = int( row[ 11 ] ) % if row[ 6 ] != row[ 7 ] and pop > max_pop: % max_pop = pop % max_city_nm = row[ 6 ] % max_st_nm = row[ 7 ] % % inp.close() % print( max_city_nm, 'in state', max_st_nm, 'has population', max_pop ) Los Angeles County in state California has population 9889056

You can download the solution file and run it on your machine, if you want.

Functions

It's possible to write a program as a single, long sequence of statements in the main module. Even for small programs, however, this isn't efficient. First, writing a program this way makes it difficult to examine and understand. Second, if you're performing common operation on different variables, you need to duplicate the code every time you perform that operation.

For example, suppose we wanted to report the average of two numeric lists l and m. One obvious way to do it is to write two for loops.

% l = [ 1, 2, 3 ] % m = [ 7, 8, 14 ] % sum = 0 % for elem in l: % sum = sum + elem % % print( float( sum ) / 3 ) 2.0 % sum = 0 % for elem in m: % sum = sum + elem % % print( float( sum ) / 3 ) 9.66666666667

This has a number of problems, however. What if we had more than just two lists we wanted to average? We'd need to duplicate the for loop once for each list. What if we wanted to do something more complicated than calculating the average (e.g., what if we wanted population standard deviation instead)? The amount of code we'd need to duplicate would be much longer.

What we really want to do is to have some sort of avg() operation that we can call whenever we want to calculate the average of a numeric list.

% l = [ 1, 2, 3 ] % m = [ 7, 8, 14 ] % print( avg( l ) ) % 2.0 % print( avg( m ) ) 9.66666666667

In Python we can define a function to create new operations like avg(). A function is defined using the keyword def, followed by the function's name, followed by an argument list in parentheses, and then a colon. The function's code block defines what the function does when it's called.

% def avg( num ): % sum = 0 % for elem in num: % sum = sum + elem % return float( sum ) / len( num )

Functions can take zero or more arguments. A function with no arguments still needs open and close parentheses, def func():. A function with multiple arguments separates then with commas, def func( a, b ):.

Once a function is defined, it can be used anywhere, including in other functions. Suppose we now wanted to write a function stdev() to compute the population standard deviation of a numeric list. We can use our avg() function to help to do this.

% import math % % def stdev( num ): % sum = 0 % num_avg = avg( num ) % for elem in num: % sum = sum + ( ( elem - num_avg ) ** 2.0 ) % return math.sqrt( sum / len( num ) )

What if we wanted to allow a user to decide whether to calculate population standard deviation or sample standard deviation? We could write two separate functions to do this, but an easier way is to add an argument to the stdev() function to specify which standard deviation to calculate.

% import math % % def stdev( num, pop = True ): % sum = 0 % num_avg = avg( num ) % for elem in num: % sum = sum + ( ( elem - num_avg ) ** 2.0 ) % if pop == True: % return math.sqrt( sum / len( num ) ) % else: % return math.sqrt( sum / ( len( num ) - 1 ))

Now, stdev() takes a second argument pop. If pop is True, we return population standard deviation. Otherwise, we return sample standard deviation. Notice that in the function header we defined pop = True. This specifies a default value for pop. If the user doesn't specify a second argument, we return population standard deviation by default.

% l = [ 7, 8, 14 ] % print( 'Pop stdev of', l, 'is', stdev( l ) ) Pop stdev of [7, 8, 14] is 3.09120616517 % print( 'Sample stdev of', l, 'is', stdev( l, False ) ) Sample stdev of [7, 8, 14] is 3.7859388972

It's even possible for functions to call themselves. This is known as recursion. Consider the Fibonacci sequence:

Since Fibonacci numbers for n ≥ 2 are defined based on lower-order versions of themselves, they are a common candidate to demonstrate a recursive function.

% def fib( n ): % if n <= 0: % return 0 % elif n == 1: % return 1 % else: % return fib( n - 1 ) + fib( n - 2 ) % % print( fib( 0 ) ) 0 % print( fib( 2 ) ) 1 % print( fib( 20 ) ) 6765 % print( fib( 40 ) ) 102334155

If you look at fib(), you should see intuitively that it's very expensive to execute. If you tried to calculate fib( 100 ), for example, you'd be waiting a long time for it to finish. That's because each number in a Fibonacci sequence requires two recursive calls, which themselves each require two recursive calls, and so on.

Fibonacci, Rabbits, and Efficiency

Although the origins of Fibonacci numbers are credited to Indian mathematics, the number series is named after Leonardo of Pisa, also known as Fibonacci. In 1202 Fibonacci posed a question about an idealized rabbit population.

Fibonacci wondered, "How many pairs of rabbits would you have after n months passed?"

As you can see, this forms a numeric sequence that we now call the Fibonacci number series.

If we need to compute large Fibonacci numbers, the recursive formula is too inefficient. Instead, we use Binet's Formula to calculate F(n) directly.

% import math % % def fib_e( n ): % if n <= 0: % return 0 % elif n == 1: % return 1 % else: % sum = ( 1.0 + math.sqrt( 5 ) ) ** n % sum = sum - ( 1.0 - math.sqrt( 5 ) ) ** n % sum = sum / ( 2.0 ** n * math.sqrt( 5 )) % return round( sum )

If you're curious, F(100) = 3.542248481792631e+20. It would take about 4 years for rabbits to outnumber humans based on Fibonacci's scenario.

NumPy

NumPy (numerical Python, hereafter numpy, pronounced num-pie) is a library that provides advanced mathematical operations involving statistics and linear algebra. Our interest is mainly in numpy statistical capabilities, since this will allow us to calculate things like mean, variance, minimum, maximum, correlation, and covariance on lists of numbers.

numpy's standard data type is an array: a sequence of numbers, all of the same type. Arrays can be created in numerous ways. Common examples include:

% import numpy % % arr = numpy.array( [ 2, 3, 4 ] ) % print( arr ) [2 3 4] % print( type( arr ) ) <class 'numpy.ndarray'> % arr = numpy.zeros( 5 ) % print( arr ) [0. 0. 0. 0. 0.] % arr = numpy.arange( 10, 30, 5 ) % print( arr ) [10 15 20 25]

numpy also supports multidimensional arrays. For example, a table is a 2-dimensional (2D) array with rows and columns. A data cube is a 3-dimensional array with rows, columns, and slices. We restrict ourselves to 1D and 2D arrays in this tutorial. The easiest way to define a 2D array in numpy is to provide a list of equal-length sublists to the array() operator, one sublist for each row in the array.

% import numpy % % arr = numpy.array( [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ] ) % print( arr ) [[1 2] [4 5] [7 8]] % print( arr[ 1 ] ) [4 5] % print( arr[ 1, 1 ] ) 5 % print( arr[ -1 ] ) [7 8] % print( arr[ 1: 3 ] ) [[4 5] [7 8]] % print( arr[ 0: 2, 1 ] ) [2 5] % % print( arr.shape ) (3, 2) % arr = arr.reshape( 2, 3 ) % print( arr ) [[1 2 4] [5 7 8] % print( arr.shape ) (2, 3)

numpy provides access to elements of an array using the standard indexing operator [ ]. Negative indices and slicing can be used, similar to Python lists. It's also possible to ask for the shape of an array using shape(), which returns the number of rows for a 1D array, or a tuple with the number of rows and columns for a 2D array. It's even possible to reshape an array using reshape(), rearranging the array's values into a new (row, column) configuration.

As mentioned above, one of the main advantages of using numpy is access to a number of statistical operations. A few common examples are listed below. A full list of numpy's statistical operators is available online.

% import numpy % % arr = numpy.arange( 0, 500, 2 ) % print( arr.size ) 250 % hist = numpy.histogram( arr, 10 ) % print( hist ) (array([25, 25, 25, 25, 25, 25, 25, 25, 25, 25]), array([ 0. , 49.8, 99.6, 149.4, 199.2, 249. , 298.8, 348.6, 398.4, 448.2, 498. ]))

(Another use for numpy is to perform linear algebra operations. Although less common in the analytics program, numpy's ability to compute eigenvectors, invert matrices, or solve systems of equations is very powerful.)

pandas

The pandas (Python Data Analysis) library builds on numpy, offering an extended set of data manipulation and analysis tools. pandas is built on a few basic data types (or data structures, as they're called in pandas), together with operations on data stored using these types.

Series

One of pandas's fundamental data types is a Series, a 1D labelled array. You can think of this as a numpy array with an explicit label attached to each data value. The collection of labels is called the data's index.

A Series can be created in numerous ways: from Python lists, from a numpy array, or even from a Python dictionary.

% import numpy % import pandas % % s = pandas.Series( [ 1, 2, 3 ], [ 'a', 'b', 'c' ] ) % print( s ) a 1 b 2 c 3 dtype: int64 % % d = { 'a': 3.14, 'b': 6.29, 'x': -1.34 } % t = pandas.Series( d ) % print( t ) a 3.14 x -1.34 b 6.29 dtype: float64 % % a = numpy.array( [ 4.5, 5.5, 6.5 ] ) % u = pandas.Series( a ) % print( u ) 0 4.5 1 5.5 2 6.5 dtype: float64

Data in a Series can be queried using numeric indices and slicing, just like with Python lists and numpy arrays. It can also be accessed using index labels, similar to a Python dictionary.

% import numpy % import pandas % % s = pandas.Series( [ 1, 2, 3 ], [ 'a', 'b', 'c' ] ) % print( s[ 1 ] ) 2 % print( s[ 'c' ] ) 3 % print( s[ 1: 3 ] ) b 2 c 3 dtype: int64

More importantly, we can index by applying a conditional operation to every data element in a Series. This returns a new boolean Series with the result of applying the conditional (True or False) at each element position. The boolean Series is then used to select only those data elements that passed the conditional. For example, suppose we wanted to select the elements in a Series whose values were greater than 2, but less than 5.

% import numpy % import pandas % % s = pandas.Series( [ 1, 2, 3, 4, 5 ] ) % idx = ( s > 2 ) & ( s < 5 ) % print( idx ) 0 False 1 False 2 True 3 True 4 False dtype: bool % s_sub = s[ idx ] % print( s_sub ) 2 3 3 4 dtype: int64

This is how pandas interprets these commands.

Because pandas data have labels, we can perform operations that use data alignment. pandas will look at the variables involved in an operation, and automatically "match up" data elements with common labels.

% import numpy % import pandas % % s = pandas.Series( [ 1, 2, 3 ], [ 'a', 'b', 'c' ] ) % t = pandas.Series( { 'a': 3.14, 'b': 6.29, 'x': -1.34 } ) % print( s + t ) a 4.14 b 8.29 c NaN x NaN dtype: float64

When we apply s + t, pandas sees data with labels a and b in both variables, so it knows to add those entries together. If data with a common label is missing from any of the Series, the result is undefined, and set to NaN (not a number).

DataFrame

A DataFrame is the second fundamental data type in pandas. A DataFrame is a 2D table whose columns and rows are both labelled. You can think of a DataFrame as a table of Series arrays, one for each column in the DataFrame. Row labels are still referred to as the index, and column labels are simply called the columns.

% import numpy % import pandas % % d = [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ] % df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2' ] ) % print( df ) C1 C2 a 1 2 b 4 5 c 7 8

Indexing is more complicated with DataFrames, since there are two separate dimensions: the columns and the rows. The following operators are used to index a DataFrame variable.

% import numpy % import pandas % % d = [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ] % df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2' ] ) % print( df[ 'C2' ] ) a 2 b 5 c 8 Name: C2, dtype: int64 % print( df[ 'C1' ][ 'b' ] ) 4 % print( df.loc[ 'b' ] ) C1 4 C2 5 Name: b, dtype: int64 % print( df.iloc[ 2 ] ) C1 7 C2 8 Name: c, dtype: int64

Slicing DataFrames

It's possible to use slicing to return multiple rows (and/or columns), rather than a single result. This must be done in a particular way to properly handle and differentiate between rows and columns, however, making it more complex than slicing lists or arrays.

% import numpy % import pandas % % d = [ [ 1, 2, -1 ], [ 4, 5, -2 ], [ 7, 8, -3 ] ] % df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2', 'C3' ] ) % print( df[ 'b': 'c' ] ) C1 C2 C3 b 4 5 -2 c 7 8 -3 % print( df.loc[ 'a': 'b' ] ) C1 C2 C3 a 1 2 -1 b 4 5 -2 % print( df.loc[ :, 'C2': 'C3' ] ) C2 C3 a 2 -1 b 5 -2 c 8 -3 % print( df.loc[ 'a': 'b', 'C1': 'C2' ] ) C1 C2 a 1 2 b 4 5

You can insert and delete rows and columns from a DataFrame, or change values at specific index positions, using the same indexing operations.

% import numpy % import pandas % % d = [ [ 1, 2 ], [ 4, 5 ], [ 7, 8 ] ] % df = pandas.DataFrame( d, index = [ 'a', 'b', 'c' ], columns = [ 'C1', 'C2' ] ) % % s = pandas.Series( { 'C1': 9, 'C2': 11 } ) % s.name = 'd' % df = df.append( s ) % print( df ) C1 C2 a 1 2 b 4 5 c 7 8 d 9 11 % % df[ 'C3' ] = df[ 'C1' ] + df[ 'C2' ] % df[ 'C1' ][ 'a' ] = 99 % % print( df ) C1 C2 C3 a 99 2 3 b 4 5 9 c 7 8 15 d 9 11 20

Maximum Population Revisited

To better exemplify the power of pandas, recall the previous problem of identifying the city with the maximum estimated 2011 population from a Census Bureau CSV file. Using pandas, we could identify the city with the largest population using only a few lines of code.

% import numpy % import pandas % % pop = pandas.read_csv( 'pop.csv', encoding='latin1' ) % pop.describe() SUMLEV STATE … POPESTIMATE2011 count 81746.000000 81746.000000 … 81746.000000 mean 114.380936 30.592849 … 15929.926565 std 47.934468 13.416473 … 244346.744241 min 40.000000 1.000000 … 0.000000 25% 61.000000 19.000000 … 339.000000 50% 157.000000 29.000000 … 1209.000000 75% 157.000000 41.000000 … 5041.000000 max 172.000000 56.000000 … 37691912.000000 % % idx = pop[ 'NAME' ] != pop[ 'STNAME' ] % city_pop = pop[ idx ] % i = city_pop[ 'POPESTIMATE2011' ].idxmax() % % nm = pop[ 'NAME' ][ i ] % state = pop[ 'STNAME' ][ i ] % max_pop = pop[ 'POPESTIMATE2011' ][ i ] % print( nm, 'in state', state, 'has population', max_pop ) Los Angeles County in state California has population 9889056

This is how pandas interprets these commands.

scikit-learn

scikit-learn (sklearn) is an open-source Python library that is used extensively for analytic tasks. Built as a Google Summer of Code project by David Cournapeau, the term scikit derives from the notion that the package provides a toolkit for scipy, a scientific Python library similar to numpy.

Originally, sklearn was designed for machine learning (ML), and retains an extensive list of algorithms to implement numerous ML techniques. More recently, sklearn has extended to include dimension reduction, model parameter comparison and validation, and preprocessing such as text analytics, data standardization, non-linear transformations, categorical encoding, and imputation. The combination of base Python and sklearn are often used as an all-purpose preprocessor and derivation engine, transforming raw input into a format designed for more sophisticated follow-on analytics.

sklearn is built on top of numpy, so it uses numpy's Array as its basic sequence data structure. This also means that it integrates well with pandas, which is also built on top of numpy.

Classification

Classification is a supervised learning approach that analyzes samples with known classes. It uses one subset of the samples to train a model, and a second subset to test the model's predictive accuracy. The model uses predictor variables to estimate the target class. The model can then be used to suggest a class for unclassified samples.

There is often confusion about the differences between classification and clustering. The following table identifies some of the characteristics that separate these two approaches.

scikit-learn: classification vs. clustering
Criteria Classification Clustering
Supervised Yes No
Use Case Classify new samples into estimated classes Suggest groups of samples based on similar data patterns
Algorithms Linear regression, ridge regression, lasso, elastic net, OMP, Bayesian regression, logistic regression, SGD, perceptron, LDA, QDA, kernel ridge regression, SVM, SVC, nearest neighbours, Gaussian process regression, naive Bayes, decision trees, random forest, XGBoost k-means, affinity propagation, mean-shift, spectral, Ward, agglomerative, DBSCAN, Gaussian mixture, Birch
Required Data Labelled samples from a set of classes Unlabelled samples

As an example of classification, we will use the Iris dataset, which is available in sklearn's datasets module. Recall that Fisher's Iris dataset contains petal and sepal lengths and widths for 50 samples of 3 classes of irises: Iris setosa, Iris virginica, and Iris versicolor. This represents an excellent example dataset, since it has already been pre-classified into one of the three iris types.

To begin, we import numpy and sklearn, load the Iris dataset, and explore it by printing its features (predictor variables) and target classes. We then create the predictor matrix using all samples' petal lengths and widths, as well as the corresponding target iris types. If you prefer, you can download a Jupyter notebook with the classification code from here.

% import numpy as np % from sklearn import datasets % % iris = datasets.load_iris() % % # Print some descriptive statistics about the data and target % print( 'Data features:', iris.feature_names ) Data features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] % % print( 'Target classes:', iris.target_names ) Target classes: ['setosa' 'versicolor' 'virginica'] % % print( 'Truncated data array:', iris.data[ :5 ], '\n---\n', iris.data[ -5: ] ) Truncated data array: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]] --- [[6.7 3. 5.2 2.3] [6.3 2.5 5. 1.9] [6.5 3. 5.2 2. ] [6.2 3.4 5.4 2.3] [5.9 3. 5.1 1.8]] % % print( 'Target labels:', iris.target ) Target labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] % % % # Set predictors X as petal length and width % # Set target y as iris class % X = iris.data[ :, [ 2, 3 ] ] % y = iris.target

Notice that at the end of the code block we set the predictor matrix X to include all petal lengths and widths, and the target vector y to include the corresponding iris types for each predictor sample. You might wonder, why are we only using petal length and width, and not any sepal values? If you do some initial exploration of the data by building scatterplots of petal length versus width and sepal length versus width, you can see that petal data does a better job of separating the three classes, versus sepal data.

% # Plot petal data to inspect how samples cluster % % import numpy as np % from sklearn import datasets % from matplotlib.colors import ListedColormap % import matplotlib.pyplot as plt % % # Use petal length and width as predictor variables % iris = datasets.load_iris() % X = iris.data[ :, [ 2, 3 ] ] % y = iris.target % % markers = ( 's', 'x', 'o' ) % colors = ( 'red', 'blue', 'lightgreen' ) % cmap = ListedColormap( colors[ :len( np.unique( y ) ) ] ) % % fig = plt.figure() % for i, col in enumerate( np.unique( y ) ): % plt.scatter(\ % x=X[ y==col, 0 ], y=X[ y==col, 1 ],\ % c=[cmap( i )], marker=markers[ i ],\ % edgecolor='black', label=col ) % plt.xlabel( 'Petal Length (cm)' ) % plt.ylabel( 'Petal Width (cm)' ) % plt.legend( ['setosa', 'versicolor', 'virginica' ] ) % fig.suptitle( 'Petal Length vs. Width' )
% # Plot sepal data to inspect how samples cluster % % import numpy as np % from sklearn import datasets % from matplotlib.colors import ListedColormap % import matplotlib.pyplot as plt % % # Use sepal length and width as predictor variables % iris = datasets.load_iris() % X = iris.data[ :, [ 0, 2 ] ] % y = iris.target % % markers = ( 's', 'x', 'o' ) % colors = ( 'red', 'blue', 'lightgreen' ) % cmap = ListedColormap( colors[ :len( np.unique( y ) ) ] ) % % fig = plt.figure() % for i, col in enumerate( np.unique( y ) ): % plt.scatter(\ % x=X[ y==col, 0 ], y=X[ y==col, 1 ],\ % c=[cmap( i )], marker=markers[ i ],\ % edgecolor='black', label=col ) % plt.xlabel( 'Sepal Length (cm)' ) % plt.ylabel( 'Sepal Width (cm)' ) % plt.legend( ['setosa', 'versicolor', 'virginica' ] ) % fig.suptitle( 'Sepal Length vs. Width' )
petal-scatterplot
sepal-scatterplot

Once the predictor and target X and y are set, we need to split them into a test subset and a training subset. We will build our predictive model using the training data, then test its accuracy using the test data. sklearn provides built-in support to divide a labelled dataset using its test_train_split() function.

% # Split data in training and test % % from sklearn.model_selection import train_test_split % % X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y ) % % print( 'Number of training samples:', X_train.shape[ 0 ] ) % print( 'Number of test samples:', X_test.shape[ 0 ] ) Number of training samples: 105 Number of test samples: 45 % uniq, n = np.unique( y, return_counts = True ) % print( 'Frequency of class values in y:', uniq, n ) % uniq, n = np.unique( y_train, return_counts = True ) % print( 'Frequency of class values in training dataset:', uniq, n ) % uniq, n = np.unique( y_test, return_counts = True ) % print( 'Frequency of class values in test dataset:', uniq, n ) Frequency of class values in y: [0 1 2] [50 50 50] Frequency of class values in training dataset: [0 1 2] [35 35 35] Frequency of class values in test dataset: [0 1 2 ] [15 15 15]

test_size defines the relative size of the test dataset to the training dataset, in this example, 70% for training, 30% for test. By default, test_train_split randomly assigns samples to the training and test datasets. The stratify parameter forces test_train_split to produce training and test sets with a proportion of target classes y in each set equivalent to the original target dataset. That is, the class proportions in each set are equivalent to the relative frequency of each class in the original target dataset.

Next, we want to standardize our predictive variables in the training and test datasets, to try to fit the assumptions of many modelling algorithms of a mean of zero and equal variance for each variable. sklearn's StandardScaler can be used to achieve this, assuming the underlying predictor variable distribution is bell-shaped. Centering and scaling happen on a per-variable basis, with mean and standard deviation saved within the StandardScaler variable for later use during variable transformation.

Note that, if your distribution is skewed, per Dr. Race's preprocessing instructions, you would not want to use a (μ=0, σ=1) standardization. Instead, you can use the MinMaxScaler to scale each feature based on its min–max range.

% from sklearn.preprocessing import StandardScaler % % # Fit a standard scalar to the predictive variables % sc = StandardScaler() % sc.fit( X_train ) % % # Transform predictive datasets based on the scalar % X_train_std = sc.transform( X_train ) % X_test_std = sc.transform( X_test ) % % # Join (stack) the data training and test sets into a sequence of rows % X_comb_std = np.vstack( ( X_train_std, X_test_std ) ) % % # Join (stack) the target training and test sets into a 1D horizontal column % y_comb = np.hstack( ( y_train, y_test ) ) % % # Print some descriptive statistics about training and target % print( 'Standardized data array:' ) % print( X_comb_std[ :5 ], '\n---\n', X_comb_std[ -5: ] ) % print( 'Target array:' ) % print( y_comb ) Standardized data array: [[ 0.20072503 0.12914551] [-1.28030022 -1.31903956] [-1.33726273 -1.31903956] [ 0.71338762 0.65575826] [-1.22333771 -1.31903956]] --- [[ 0.77035013 0.52410507] [ 0.77035013 0.91906464] [ 1.05516268 1.3140242 ] [ 0.42857507 0.12914551] [-1.28030022 -1.31903956]] Target array: [1 0 0 1 0 0 1 1 1 0 1 2 0 1 0 0 2 1 2 1 2 0 0 1 0 1 2 0 2 0 1 0 0 2 2 0 0 0 0 1 0 2 0 0 2 0 1 2 2 2 2 2 1 1 1 1 0 1 2 1 1 1 0 1 1 1 2 0 2 2 0 1 0 1 2 2 2 2 2 2 1 2 0 0 0 2 0 1 1 0 2 2 0 2 0 1 2 1 1 2 1 2 1 2 2 1 2 0 0 2 2 1 2 1 0 1 1 2 0 0 1 0 1 2 2 2 1 0 1 2 0 0 0 1 0 0 1 0 1 2 0 2 2 2 1 1 2 2 1 0]

At this point, the data has been preprocessed into a state that is ready for building classification models. sklearn supports many different modeling algorithms. We will explore four of these: Perceptron (neural net), SVC (support vector classification), KNN (k nearest neighbours), and random forest, an ensemble decision tree approach.

One of the useful advantages of using sklearn exclusively is that classification follows a fairly standard set of steps, regardless of the specific algorithm being used. We will look at Perceptron in some detail, then provide examples for the other three methods, with discussion limited to the specific differences (if any) for each.

The Perceptron classifier is sometimes described as a simple feed-forward single-layer neural net. In visual terms, consider a perceptron approach as one that attempts to choose linear boundaries that separate classes from one another, to the extent that this is possible. Technically, the perceptron method uses an input vector of attribute values and a weight vector of similar length (together with a bias) to compute a value for the input. Based on that value, one of the available classes is suggested. If the suggestion is incorrect, the weight vector is updated. This continues until some type of convergence is achieved.

In sklearn, the Perceptron algorithm falls within the linear_model group. The following code builds a perceptron prediction model for our Iris dataset example.

% from sklearn.linear_model import Perceptron % from sklearn.metrics import accuracy_score % % # Build a perceptron model based on training data, then classify test data % ppn = Perceptron( max_iter=40, eta0=0.1, random_state=0 ) % ppn.fit( X_train_std, y_train ) % y_pred = ppn.predict( X_test_std ) % % # Print results of classification % print( 'Perceptron Model:' ) % print( 'Misclassified samples: %d' % ( y_test != y_pred ).sum() ) % print( 'Accuracy on training data: {:.2f} (out of 1)'.format( ppn.score( X_train_std, y_train ) ) ) % print( 'Accuracy on test data: {:.2f} (out of 1)'.format( ppn.score( X_test_std, y_test ) ) ) Perceptron Model: Misclassified samples: 5 Accuracy on training data: 0.90 (out of 1) Accuracy on test data: 0.89 (out of 1)

As you can see, the code to build a prediction model is simple once the proper datasets have been defined and prepared. We build a Perceptron prediction model, fit it to the training data, and predict target classes with the predict() function. The Perceptron model constructor supports numerous options, including:

Next, we use ppn's score() function to report overall classification accuracy on both the training target dataset y_train and the test target dataset y_test (90% and 89% respectively in our example).

If we instead wanted to use an SVC classifier, only minor modifications would be needed to the perceptron code.

% from sklearn.svm import SVC % % # Build an SVC model based on training data, then classify test data % svc = SVC( kernel='rbf', gamma=0.1, C=1.0, random_state=0 ) % svc.fit( X_train_std, y_train ) % y_pred = svc.predict( X_test_std ) % % # Print results of classification % print( 'SVC Model:' ) % print( 'Misclassified samples: %d' % ( y_test != y_pred ).sum() ) % print( 'Accuracy on training data: {:.2f} (out of 1)'.format( svc.score( X_train_std, y_train ) ) ) % print( 'Accuracy on test data: {:.2f} (out of 1)'.format( svc.score( X_test_std, y_test ) ) )

As you can see, building and evaluating the model is nearly identical to the perceptron example. Next, let's look at k nearest neighbours.

% from sklearn.neighbors import KNeighborsClassifier % % # Build a KNN model based on training data, then classify test data % knn = KNeighborsClassifier( n_neighbors=5, p=2, metric='minkowski' ) % knn.fit( X_train_std, y_train ) % y_pred = knn.predict( X_test_std ) % % # Print results of classification % print( 'KNN Model:' ) % print( 'Misclassified samples: %d' % ( y_test != y_pred ).sum() ) % print( 'Accuracy on training data: {:.2f} (out of 1)'.format( knn.score( X_train_std, y_train ) ) ) % print( 'Accuracy on test data: {:.2f} (out of 1)'.format( knn.score( X_test_std, y_test ) ) ) KNN Model: Misclassified samples: 0 Accuracy on training data: 0.96 (out of 1) Accuracy on test data: 1.00 (out of 1)

Finally, here is code to run a random forest classifier on the iris data.

% from sklearn.ensemble import RandomForestClassifier % % # Build a random forest model based on training data, then classify test data % rf = RandomForestClassifier( n_estimators=100, oob_score=True ) % rf.fit( X_train_std, y_train ) % y_pred = rf.predict( X_test_std ) % % # Print results of classification % print( 'Random Forest Model:' ) % print( 'Misclassified samples: %d' % ( y_test != y_pred ).sum() ) % print( 'Accuracy on training data: {:.2f} (out of 1)'.format( rf.score( X_train_std, y_train ) ) ) % print( 'Accuracy on test data: {:.2f} (out of 1)'.format( rf.score( X_test_std, y_test ) ) ) % print( 'Out-of-bag accuracy: {:.2f} (out of 1)'.format( rf.oob_score_ ) ) Random Forest Model: Misclassified samples: 1 Accuracy on training data: 0.99 (out of 1) Accuracy on test data: 0.96 (out of 1) Out-of-bag accuracy: 0.95 (out of 1) Out-of-bag accuracy: 0.95 (out of 1)

As we discussed above, each approach subdivides 2D space (in our case, since we have two predictor variables) into regions representing the three different iris types. Below are images, along with the dataset samples, showing the class regions for each of the four example algorithms. Samples outlined in black represent the test data y, while the remaining samples represent the training data X.

ppn
svc
knn
forest

Clustering

Clustering attempts to divide a set of samples into clusters based on data properties or patterns identified within the samples. Clustering is unsupervised; no pre-labeled data is required to generate clusters.

During our examination of sklearn's clustering algorithms, we will continue to use the Iris dataset. X represents 150 samples defined by petal length and width, and y represents the class (iris type) for each corresponding sample. Since clustering is unsupervised, we will only use y to calculate accuracies. It will not be needed during the clustering process itself.

We will provide examples for three common clustering algorithms: k-means, agglomerative, and DBSCAN. Similar to classification, a common pattern is used by sklearn, regardless of which clustering algorithm is applied. We will study k-means, as well as some of its strengths and limitations, in some depth. Agglomerative clustering and DBSCAN will be presented, but with a focus on their differences in implementation, requirements, and the type of data they are designed to analyze. If you prefer, you can download a Jupyter notebook with the clustering code from here.

Accuracy. One issue to consider is how to measure accuracy. The idea of determining how well an algorithm splits the Iris dataset into three groups corresponds to how well cluster IDs are attached to elements in the target y. In a perfect scenario, elements with target value yi would have identical cluster IDs cj (remember, in general yi ≠ cj). We can assume the most frequent cj is the cluster ID assigned to elements in class yi. Any other cluster IDs can be considered errors.

The following misclassify() function takes target classes tg and corresponding cluster IDs pred, returning a total misclassification count and a list of misclassifications per target class as a list. Although we will not explain the function in detail, its comments should help interested readers understand how we are attempting to determine overall cluster accuracy.

% from collections import Counter % % def misclassify( tg, pred ): % # Try to determine number of misclassified elements during clustering % # % # tg: Target class % # pred: Predicted cluster ID % % # Get number of classes, clusters, create a "misclassified" label % tg_n = len( set( tg ) ) % pr_n = len( set( pred ) ) % miss = pr_n + 1 % % # Initialize errors per class, set of cluster IDs seen % err = [ 0 ] * tg_n % clust_ID = [] % % # For each target class, compute cluster ID errors % for i in range( 0, tg_n ): % # Grab subset of targets, predictions for current class % beg = i * 50 % tg_sub = tg[ beg: beg + 50 ] % pr_sub = pred[ beg: beg + 50 ] % % # "Mark as error" any cluster ID already assigned to a class % for j,val in enumerate( pr_sub ): % if val in clust_ID: pr_sub[ j ] = miss % % # Build match vector as target class ID minus cluster ID % # Copy over errors from cluster ID vector % match = tg_sub - pr_sub % for j,val in enumerate( pr_sub ): % match[ j ] = miss if val == miss else match[ j ] % % # Build dict of IDs, descending by frequency % count = Counter( match ) % freq = count.most_common() % % # Find most frequent ID that is not "miss" error % max_freq = miss % for val in freq: % if val[ 0 ] != miss: % max_freq = val[ 0 ] % err[ i ] = 50 - val[ 1 ] % break % % # If all entries "miss" error, all entries are incorrect % if max_freq == miss: % err[ i ] = 50 % % # Copy miss to entire subrange of original prediction list % for j in range( 0, len( pr_sub ) ): % pred[ j + beg ] = miss % continue % % max_ID = miss % for j,val in enumerate( match ): % if val != max_freq: pred[ j + beg ] = miss % if val == max_freq: max_ID = j % % # Add cluster ID to list of cluster IDs seen so far % clust_ID = clust_ID + [ pr_sub[ max_ID ] ] % % # Return total errors, error per class % return sum(err), err

k-means clustering divides a dataset into k clusters. Theoretically, the goal of k-means is to partition a set of n samples X = (x1, x2, ..., xn) into k subsets S = {s1, s2, ..., sk} such that within-cluster sum of squares (WCSS or variance) is minimized,

\[ \text{arg$\,$min}_S \sum_{i=1}^{k} \sum_{x \in S_i} ||x-\mu_i||^2 = \text{arg$\,$min}_S \sum_{i=1}^{k} |S_i|\,\text{Var}\,S_i \]

where μi is the mean of the samples in Si.

The basic approach for performing k-means (often known as Lloyd's algorithm) uses the following set of steps:

  1. Randomly choose k cluster centroids within the bounding region of the dataset.
  2. Assign each sample to the closest centroid, producing k clusters.
  3. Compute the centroids for the clusters, producing a new set of k centroids.
  4. Re-assign samples as needed based on the new centroid positions.
  5. Continue iterating in this manner until reasonable convergence is achieved, or until a fixed number of iterations have been performed.

The obvious problem with k-means is how to choose k, the "correct" number of clusters to generate. In our example, we know there are three classes of irises, so we will ask for k=3 clusters. If the proper number of clusters is unknown, the elbow method is often used: start at k=2, compute the sum of squared errors (SSE) between each point and its centroid, move to k=3, and continue until we find a minimum SSE. The k that produces this is selected at the "optimal" number of clusters. Notice that this choice is based completely on mathematical considerations, and may not necessarily match the semantics of the data being clustered, particularly since neither WCSS nor SSE consider data context.

% from sklearn.cluster import KMeans % % # Partition data into three clusters % km = KMeans( n_clusters = 3, random_state = 101 ) % km.fit( X ) % y_pred = km.predict( X ) % % n = len( y_pred ) % err = misclassify( y, y_pred ) % % # Print accuracy results % print( 'k-Means Clustering:' ) % print( 'Misassigned samples:', err[ 0 ] ) % print( 'Accuracy: {:.2f} (out of 1)'.format( ( n - err[ 0 ] ) / n ) ) k-Means Clustering: Misassigned samples: 6 Accuracy: 0.96 (out of 1)
k-means

Remember that, because the initial centroids are randomly chosen, k-means results are stochastic (not stable). Rerunning the algorithm with a different random_state seed value can produce different results, misassigned sample counts, and accuracies.

Another common clustering approach is agglomerative clustering, which builds a hierarchical, bottom-up collection of clusters. Here, we repeatedly take the two most similar elements and combine them into a cluster until X forms a single cluster. What we mean by "most similar" is critical to how the clusters will form, and is based on a linkage strategy and a distance metric (e.g., Euclidean, Manhattan, Hamming, cosine, and so on). In all cases, we assume a single element forms a 1-element cluster. Linkage strategies include:

sklearn supports ward, complete, and average when the agglomerative clustering object is created, with a default of Ward.

% from sklearn.cluster import AgglomerativeClustering % % Clustering( n_clusters=3, linkage='average' )

The sequence of combinations performed produces a dendrogram, a tree with the root at the top (a single cluster containing X), and each split generating two children as we walk down the dendrogram. To choose k clusters, we simply pick the k clusters that are closest to the root cluster.

Agglomerative clustering is implemented in a nearly identical manner to k-means. The one difference is that you can fit a model, or fit and return predictions. The AgglomerativeClustering module does not provide a separate predict() function.

% from sklearn.cluster import AgglomerativeClustering % % # Partition data into three clusters % agg = AgglomerativeClustering( n_clusters = 3 ) % y_pred = agg.fit_predict( X ) % % n = len( y_pred ) % err = misclassify( y, y_pred ) % % # Print accuracy results % print( 'Agglomerative Clustering:' ) % print( 'Misassigned samples:', err[ 0 ] ) % print( 'Accuracy: {:.2f} (out of 1)'.format( ( n - err[ 0 ] ) / n ) ) Agglomerative Clustering: Misassigned samples: 6 Accuracy: 0.96 (out of 1)
agglomerative

Our final example uses DBSCAN (Density-BaSed Clustering of Applications with Noise). DBSCAN is a spatial algorithm that clusters points that are "closely packed" to one another. Specifically, basic DBSCAN uses the following definitions to build its clusters.

Given these definitions, core point p forms a cluster with all points (core or non-core) that can reach p. Every cluster contains at least one core point. Non-core points form the cluster's edge, since they cannot be used to reach additional points.

Implementation of DBSCAN is identical to AgglomerativeClustering, except we do not specify a number of clusters. This is determined automatically by DBSCAN's ε and min values (eps and min_samples arguments during the creation of a DBSCAN object, with defaults of 0.5 and 5, respectively.)

% from sklearn.cluster import DBSCAN % % # Partition data into three clusters % db = DBSCAN() % y_pred = db.fit_predict( X ) % % n = len( y_pred ) % err = misclassify( y, y_pred ) % % # Print accuracy results % print( 'DBSCAN Clustering:' ) % print( 'Misassigned samples:', err[ 0 ] ) % print( 'Accuracy: {:.2f} (out of 1)'.format( ( n - err[ 0 ] ) / n ) ) Agglomerative Clustering: Misassigned samples: 50 Accuracy: 0.67 (out of 1)
dbscan

Notice that DBSCAN produces a much larger misassignment error than either k-means or agglomerative clustering. Examining the cluster plot makes the issue clear: DBSCAN merged iris classes 2 and 3 together, producing one class that was perfectly clustered, and one that was entirely misassigned. This points to a common issue with DBSCAN in particular, and spatial clustering algorithms in general: when semantically different groups of elements overlap, they are often merged, since DBSCAN has no notion of what each element represents, only distances, cluster sizes, and how they compare to ε and min.

Imputation

Imputation of values in a dataset is another function provided by sklearn. It is common to receive raw data files with missing values, errors, or other types of incorrect values. Imputation allows you to replace these values through techniques like estimation, replacement with constants, or replacement with statistically derived results on the values that are available. If you prefer, you can download a Jupyter notebook with the imputation code from here.

sklearn provides an SimpleImputer that, in its default form, implements three basic strategies for imputing missing values: (1) use the column's mean; (2) use the column's median; or (3) use the column's most frequent value. The axis argument can modify this: axis=0 for columns, axis=1 for rows (default is axis=0.) These approaches are reasonable, although not particularly sophisticated.

% import numpy as np % from sklearn.impute import SimpleImputer % % X = [[np.nan,2,3,1,5,2], [6,np.nan,1,8,2,1], [7,6,np.nan,np.nan,7,np.nan]] % print( 'X: ', X ) % % imp = SimpleImputer( missing_values=np.nan, strategy='mean' ) % imp.fit( X ) % % print( 'Impute w/mean:' ) % print( imp.transform( X ) ) % % imp = SimpleImputer( missing_values=np.nan, strategy='most_frequent' ) % imp.fit( X ) % % print( 'Impute w/most frequent:' ) % print( imp.transform( X ) ) X: [[nan, 2, 3, 1, 5, 2], [6, nan, 1, 8, 2, 1], [7, 6, nan, nan, 7, nan]] Impute columns w/mean: [[6.5 2. 3. 1. 5. 2. ] [6. 4. 1. 8. 2. 1. ] [7. 6. 2. 4.5 7. 1.5] Impute rows w/most frequent [[6. 2. 3. 1. 5. 2.] [6. 2. 1. 8. 2. 1.] [7. 6. 1. 1. 7. 1.]]

Rather than identifying missing values, you may want to apply a custom transformer to "derive" new values based on existing values in your dataset. sklearn allows this through its FunctionTransformer object. To use it, we build a FunctionTransformer with a specific function attached to it. We can then use that function to transform all values in a numpy Array, just like any built-in sklearn transformer.

% import numpy as np % from sklearn.preprocessing import FunctionTransformer % % X = [[0,1,2],[3,4,5],[6,7,8]] % print( 'X:', X ) % % trans = FunctionTransformer( np.log1p ) % print( 'Transform using 1+log(p):' ) % print( trans.transform( X ) ) X: [[0, 1, 2], [3, 4, 5], [6, 7, 8]] Transform using 1+log(p): [[0. 0.69314718 1.09861229] [1.38629436 1.60943791 1.79175947] [1.94591015 2.07944154 2.19722458]]

pandas also provides a number of functions specifically designed to handle NaN and perform different types of correction or imputation. For example, pandas's isnull() function returns True for NaN entries, and False otherwise. We can take advantage of the face that True evaluates to 1 and False evaluates to 0 to count the number of NaNs in a data frame. The following examples will use a U.S. Census construction CSV file to demonstrate pandas's capabilities. We will work with the EMP column, which is a Census definition for number of employees.

% import pandas as pd % import os % % os.chdir( 'C:/users/healey/Downloads' ) % census_df = pd.read_csv( 'construction-census.csv', sep=',', skiprows=[ 1 ] ) % % # Count NaN, True=1, False=0, so sum True/False isnull() value list % nan_n = census_df.isnull().values.sum() % print( nan_n, 'values out of', census_df.size, 'values total, {:.2f}%'.format( nan_n / census_df.size * 100 ) ) 25905 values out of 170100 values total, 15.23%

One common use of isnull is to identify and remove columns from a data frame that are entirely NaN. This is important, because these types of columns can interfere with normal pandas analytic operations.

Notice that in all the following examples, we make a copy of the data frame with the copy command. This ensures that we do not change any of the values in the original data frame. You cannot copy a data frame with df_cp = df. If you do this, df_cp will simply reference df. Referencing means the variables df and df_cp with both point to the same data frame, so any changes through either will affect the original data. The copy() function ensures an entirely new data frame is created, containing values identical to the original data frame.

% import os % % # First, make a copy of the original data frame % os.chdir( 'C:/users/healey/Downloads' ) % census_df = pd.read_csv( 'construction-census.csv', sep=',', skiprows=[ 1 ] ) % df_cp = census_df.copy() % % # Check first 15 columns, remove if entirely NaN % row_n = df_cp.shape[ 0 ] % col_nm = list( df_cp )[ :15 ] % % for nm in col_nm: % nan_n = df_cp[ nm ].isnull().values.sum() % if nan_n == row_n: % print( nm, 'has only NaN values, removing...' ) % del df_cp[ nm ] % elif nan_n > 0: % print( '{:s}, {:d} NaN values ({:.2f}%)'.format( nm, nan_n, ( nan_n / row_n * 100 ) ) ) % else: % print( nm, 'has no NaN values' ) GEO.id has no NaN values GEO.id2 has no NaN values GEO.display-label has no NaN values GEO.annotation.id has only NaN values, removing... NAICS.id has no NaN values NAICS.display-label has no NaN values NAICS.annotation.id has only NaN values, removing... YEAR.id has no NaN values ESTAB has no NaN values ESTAB_S has no NaN values EMP, 38 NaN values (2.41%) EMP_S, 38 NaN values (2.41%) PAYANN, 57 NaN values (3.62%) PAYANN_S, 57 NaN values (3.62%) BENEFIT, 61 NaN values (3.87%)

It's not necessary that columns be entirely NaN to be removed. It would be easy, for example, to remove any column whose percentage of NaN was over some reasonable threshold, for example, 60%, with if nan_n > 0.6 * row_n:.

In terms of replacing NaNs, pandas can fill them with a constant value with ffill(), moving the most recent non-NaN value forward, or bfill(), moving the next non-NaN value backwards.

% import os % % # First, make a copy of the original data frame % os.chdir( 'C:/users/healey/Downloads' ) % census_df = pd.read_csv( 'construction-census.csv', sep=',', skiprows=[ 1 ] ) % df_cp = census_df.copy() % % # We can fill NaNs in a column with a constant value % df_cp = census_df.copy() % % print( 'Original EMP column:' ) % print( df_cp.iloc[ 217:223 ][ 'EMP' ] ) % % print( 'Constant value imputation:' ) % df_cp[ 'EMP' ] = df_cp[ 'EMP' ].fillna( 0 ) % print( 'Column EMP now has', df_cp[ 'EMP' ].isnull().values.sum(), 'NaNs' ) % print( df_cp.iloc[ 217:223 ][ 'EMP' ] ) % % # We can backfill (move next value back) or pad (more last value forward) % df_cp = census_df.copy() % print( 'Backfill value imputation:' ) % df_cp[ 'EMP' ] = df_cp[ 'EMP' ].bfill() % print( df_cp.iloc[ 217:223 ][ 'EMP' ] ) % % df_cp = census_df.copy() % print( 'Pad value imputation:' ) % df_cp[ 'EMP' ] = df_cp[ 'EMP' ].pad() % print( df_cp.iloc[ 217:223 ][ 'EMP' ] ) Original EMP column: 217 659.0 218 NaN 219 431.0 220 944.0 221 NaN 222 1143.0 Name: EMP, dtype: float64 Constant value imputation: Column EMP now has 0 NaNs 217 659.0 218 0.0 219 431.0 220 944.0 221 0.0 222 1143.0 Name: EMP, dtype: float64 Backfill value imputation: 217 659.0 218 431.0 219 431.0 220 944.0 221 1143.0 222 1143.0 Name: EMP, dtype: float64 Pad value imputation: 217 659.0 218 659.0 219 431.0 220 944.0 221 944.0 222 1143.0 Name: EMP, dtype: float64

Finally, if we integrate into scipy, the scientific Python toolkit sklearn extends, we can gain access to a larger number of interpolation methods. For example, rather than performing a linear interpolation of missing values, we can use a cubic spline or a second or third-degree polynomial to produce a "smooth" interpolation of both available and missing values.

% from scipy.interpolate import interp1d % import matplotlib.pyplot as plt % import os % import numpy as np % % # More recent versions of pandas support new methods of interpolation through scipy % % # First, make a copy of the original data frame % os.chdir( 'C:/users/healey/Downloads' ) % census_df = pd.read_csv( 'construction-census.csv', sep=',', skiprows=[ 1 ] ) % df_cp = census_df.copy() % % # Original subset of EMP values with numerous NaNs % df_cp = census_df.copy() % nan_df = df_cp[ 'EMP' ].isnull() % % fig = plt.figure() % fig.suptitle( 'Employee Count by Sample ID' ) % plt.xlabel( 'Sample ID' ) % plt.ylabel( 'Number of Employees' ) % df_cp.iloc[ 219:247 ][ 'EMP' ].plot() % plt.show() % % # Replace NaN with linear interpolation % df_cp = census_df.copy() % df_cp[ 'EMP' ] = df_cp[ 'EMP' ].interpolate( method='linear' ) % % fig = plt.figure() % fig.suptitle( 'Linear vs. Cubic Interpolation of Employee Count' ) % plt.xlabel( 'Sample ID' ) % plt.ylabel( 'Number of Employees' ) % df_cp.iloc[ 219:247 ][ 'EMP' ].plot() % % # Replace NaN with cubic interpolation % df_cp = census_df.copy() % df_cp[ 'EMP' ] = df_cp[ 'EMP' ].interpolate( method='cubic' ) % x = df_cp.iloc[ 219:247 ][ 'EMP' ].index % y = df_cp.iloc[ 219:247 ][ 'EMP' ] % f = interp1d( x, y, kind='cubic' ) % % x = np.linspace( 219, 246, num=50, endpoint=True ) % plt.plot( x, f( x ), '--' ) % plt.legend( [ 'linear', 'cubic' ] ) % plt.show()
orig-EMP
lin-cubic-EMP

interp1d supports numerous additional interpolation methods like polynomials, splines, and nearest, previous, and next, which produce stair-step interpolation plots. It is also possible to interpolate multivariate data superimposed on a uniform grid using scipy's griddata() function.

Web Parsing

The ability read and parse web page content is often very useful, since data is commonly available on a web page, but in a format that's not convenient to simply copy and paste.

Two Python libraries are used to simplify web scraping: urllib and BeautifulSoup 4. urllib supports posting HTTP requests to a web server. In our case, this is normally a request for HTML content of a target web page. BeautifulSoup parses HTML into a parse tree, then provides operations that allow you to find specific information within the tree, for example, all the HTML links, or text attached to an HTML ID in the document.

These are the steps you would follow to read and parse a web page.

This code will read the HTML for this tutorial, then return all the HTML links in the document.

% import urllib % from bs4 import BeautifulSoup % % url = urllib.request.urlopen( 'https://www.csc2.ncsu.edu/faculty/healey/msa/python/index.html' ) % doc = url.read() % tree = BeautifulSoup( doc ) % links = tree.find_all( 'a' ) % for l in links: % print( l.get( 'href' ) ) % https://www.csc2.ncsu.edu/faculty/healey https://www.ncsu.edu https://www.python.org …
https://www.crummy.com/software/BeautifulSoup

One you have a parse tree, most of your work will involve navigating the tree to find the things you're looking for. BeautifulSoup provides numerous operations to search within the tree. This can be done by HTML tag, by ID, by string value, and so on. This requires work on your part, however.

For example, suppose you parsed this HTML page using BeautifulSoup into a variable called tree. You could search for the div with a specific id and the value within that div as follows.

Another common task is finding specific text within a web page. For example, we might want to find all the paragraph (<p> ... </p>) blocks that contain the string Python. The obvious way to do this might be as follows.

% tree.find_all( 'p', string='Python' ) []

Notice the result is an empty list, even though we know that the string Python occurs multiple times within this document. The problem is that the command above asks for a paragraph block that contains EXACTLY the string Python, not a paragraph block that contains at least one occurrence of the string Python. To correct this, we pass the string as a regular expression. There is insufficient space to delve into the details of regular expressions (REs), but suffice it to say that REs allow us to specify text with wildcards, options on parts of text that occur in the string we're specifying, whether a string has to start at the beginning or end of a line, the number of times a string has to occur, and many other conditions. REs are extremely powerful, and also fairly difficult to master. Moreover, although all programming languages include support for REs, they often specify REs in slightly different ways, making it difficult to port anything but the most simple REs between languages.

Regardless of these issues, in our case it is very simple to convert the from looking for a paragraph block with exactly Python to a paragraph block containing Python by simple asking the regular expression module re to convert the explicit string into a RE that represents "an occurrence of this string."

% import re % % tree.find_all( 'p', string=re.compile( 'Python' ) ) [<p>Since Python is an interpreted language, individual lines of code are converted to machine language and executed as they are % it works as it should.</p>, <p>If you want a quick summary of the steps I use to try to install Python packages in the Anaconda environment, they're listed in order below</p>]

Now, all paragraph blocks containing the term Python are returned as elements of a list of results.

Understanding the structure of a web page can't be done for you. It's something you'll need to complete as a first step for extracting data from the web page. Perhaps more importantly, if a web page doesn't have any useful structure, then it might be difficult (or impossible) to isolate the data you're interested in. Usually, however, web pages are built with a logical design. In these cases, web scraping can save you significant time and effort over manually copying or re-typing individual data values.

Practice Examples

For practice, here are five Practical Python practice problems. A short description of the problem is provided, with one possible solution presented. I strongly recommend that you attempt your own solution first, before looking at the one we provide.

Cows and Bulls

Create a program to play the game "cows and bulls." This is how the game works.

For example, if the target was 7934 and the user guessed 7274, they would receive two cows for the first 7 and the 4, and two bulls.

A basic pseudocode outline for how you might want to structure your program design could be something like this.

Cows and Bulls Pseudocode

  1. Initialize the guesses to proper starting value.
  2. Generate a target value to guess.
  3. While the number of cows isn't four, loop.
  4. Initialize the number of cows and bulls to proper starting values.
  5. Get an input on the valid range of 0 through 9999.
  6. Determine how many cows (correct digits in the correct position) and bulls (incorrect digits in a given position) the input contains, and report it to the user.
  7. Update the number of guesses prior to returning to the while loop's condition.
  8. Once the user has correctly guessed the target (four cows), you will exit the while loop. At this point you could end the game or ask if the user wants to play again.

Here is one solution to the Cows and Bulls game. If you're practising, you should program your own solution before looking at ours.

Cows and Bulls Solution

% import random % % def generate_tg(): % """Generate a random four-digit string % % Generate a random digit between 0-9999, then covert it to a string % with padded zeros 0000-9999, used for the cows and bulls guessing game. % % Returns: % str: A four-character string in the range 0000-4000 % """ % % v = random.randint( 0, 9999 ) % v_str = f'{v:04d}' % % return v_str % % % def cmp_tg( inp, tg ): % """Compare target and input % % Compare 4-digit target to 4-digit input, each matched digit is a cow, each % mismatched digit is a bull, return number of cows and bulls % % Args: % inp (str): 4-digit input as a string 0000-9999 % tg (str): 4-digit target as a string 0000-9999 % % Returns: % tuple (cow,bull): Number of cows and bulls as a tuple % """ % % # Ensure correct input and target format % % inp_v = int( inp ) % if len( inp ) != 4 or inp_v < 0 or inp_v > 9999: % print( 'cmp_tg(), incorrect input ' + inp + ', must be (0000-9999)' ) % return (0,4) % % tg_v = int( tg ) % if len( tg ) != 4 or tg_v < 0 or tg_v > 9999: % print( 'cmp_tg(), incorrect target ' + tg + ', must be (0000-9999)' ) % return (0,4) % % # Initialize number of cows and bulls % % cow = 0 % bull = 0 % % # Loop over four digits, matches increase cows, mismatches increase bulls % % for i in range( 0, 4 ): % if inp[ i ] == tg[ i ]: % cow = cow + 1 % else: % bull = bull + 1 % % return (cow,bull) % % # Main game loop % % # Initialize cows, bulls, and number of guesses made % % cow = 0 % bull = 4 % guess = 0 % % # Generate target to guess % % tg = generate_tg() % % # Continue until four cows, all input digits match target % % while cow != 4: % % # Continue looping until valid input % % inp = -1 % while inp < 0 or inp > 9999: % inp = input( 'Enter a value between 0000 and 9999: ' ) % try: % inp = int( inp ) % except: % inp = 0 % % # Convert input to 4-digit string, get cows and bulls % % inp_s = f'{inp:04d}' % cow,bull = cmp_tg( inp_s, tg ) % % guess += 1 % print( f'{cow} cow(s), {bull} bull(s)' ) % % print( f'It took you {guess} guess(es) to match the target' )

Birthday Plots

Create a program to assign a random number of birthdays to each month, based on a maximum number of birthdays defined by the user.

Birthday Plot Pseudocode

  1. Initialize the number of birthdays for each month to zero.
  2. Ask the user for the maximum number of birthdays for a month.
  3. Assign a random number between 0 and month-maximum for each month.
  4. Plot the results using pyplot.

Plot the number of birthdays in a bar chart using pyplot. You can Google search for simple instructions on how to import and use pyplot to plot a bar graph (e.g., from somewhere like here).

Birthday Plot Solution

% import random % import matplotlib.pyplot as plt % % def gen_birthday( bday_max): % """Generate birthdays for 12 months % % Generate random number of birthdays for each of the twelve months, up to a % maximum number specified by the caller. % % Args: % bday_max (int): An integer specifying the maximum number of birthdays for any month % % Returns: % int list: A list of 12 non-negative integers, number of birthdays per month % """ % % # Initialize birthdays to 0 for each month % % bday_n = [ 0 ] * 12 % % # Ensure maximum birthday is valid % % if bday_max < 0: % print( 'gen_birthday(), maximum birthdays per month', bday_max, 'is invalid' ) % return bday_n % % for i in range( 0, 12 ): % bday_n[ i ] = random.randint( 0, bday_max ) % % return bday_n % % # Mainline % % # Define months % % month = [ 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' ] % % # Generate random birthdays per month % % birthday = gen_birthday( 30 ) % % # Use pyplot to plot birthday frequencies per month % fig = plt.figure( figsize=(8,6) ) % bar = plt.bar( month, birthday, 0.75 ) % % # Put frequency count on top of each bar % % for i,rect in enumerate( bar ): % ht = rect.get_height() % plt.text( rect.get_x() + rect.get_width() / 2, ht, '%s' % ht, ha='center', va='bottom' ) % % # Annotate the visualization % % plt.ylabel( 'Birthday(s)' ) % plt.grid( axis='y', linestyle='dotted' ) % plt.xlabel( 'Month' ) % plt.title( 'Birthday(s) per Month' ) % plt.show()

Morse Code

Morse code, also known as American or Railroad Morse code, was developed by Samuel F. B. Morse in the United States during the 1830s to represent letters of the alphabet, numerals, and punctuation for electric telegraphy using an arrangement of dots and dashes. The original Morse code was further improved by Alfred Lewis Vail, Morse's assistant and partner. Later use of Morse code in Europe identified that it was insufficient to support non-English text like diacritical marks. This led to the development of International Morse Code, or Continental Morse Code as it is now known, in 1951.

The most famous Morse code sequence is probably the one for SOS (Save Our Souls), denoted dot-dot-dot, dash-dash-dash, dot-dot-dot. This is because dot-dot-dot represents S, and dash-dash-dash represents O in Morse code.

Create a program to decode a message into Morse code. Morse code uses a sequence of dots and dashes to encode letters, numbers, and the space character. The table below shows the conversion from American Morse code to individual characters. A single space separates the dot–dash pattern for each character. Three spaces represent a space character. You can use period for dot and hyphen for dash in your program.

Code Character Code Character Code Character Code Character
· –A – · · ·B – · – ·C – · ·D
·E · · – ·F – – ·G · · · ·H
· ·I · – – –J – · –K · – · ·L
– –M – ·N – – –O · – – ·P
– – · –Q · – ·R · · ·S T
· · –U · · · –V · – –W – · · –X
– · – –Y – – · ·Z · – – – –1 · · – – –2
· · · – –3 · · · · –4 · · · · ·5 – · · · ·6
– – · · ·7 – – – · ·8 – – – – ·9 – – – – –0

A basic pseudocode outline for how you might want to structure your program design could be something like this.

Morse Code Pseudocode

  1. Build a data structure to hold the letter-number to dot-dash conversions.
  2. Ask the user to enter a sentence.
  3. Convert the sentence to uppercase.
  4. Use your data structure to convert each letter, number, and space to its equivalent dot-dash pattern. You can ignore anything that's not a letter, number, or space.
  5. Print the conversion for the user.
  6. Ask the user if they want to enter another sentence. Based on this, accept another sentence for conversion or quit the program.

Here is one solution to the Morse code program. If you're practising, you should program your own solution before looking at ours.

Morse Code Solution

% code_cvt = { % "A": ".-", % "B": "-...", % "C": "-.-.", % "D": "-..", % "E": ".", % "F": "..-.", % "G": "--.", % "H": "....", % "I": "..", % "J": ".---", % "K": "-.-", % "L": ".-..", % "M": "--", % "N": "-.", % "O": "---", % "P": ".--.", % "Q": "--.-", % "R": ".-.", % "S": "...", % "T": "-", % "U": "..-", % "V": "...-", % "W": ".--", % "X": "-..-", % "Y": "-.--", % "Z": "--..", % " ": " ", % "1": ".----", % "2": "..---", % "3": "...--", % "4": "....-", % "5": ".....", % "6": "-....", % "7": "--...", % "8": "---..", % "9": "----.", % "0": "-----" % } % % # Mainline % % cvt = True % % while cvt: % sent = input( 'Enter a sentence: ' ) % sent = sent.upper() % % code = '' % orig = '' % for c in sent: % c = c % if c in code_cvt.keys(): % code = code + f'{code_cvt[ c ]:5s} ' % if c == ' ': % c = 'spc' % orig = orig + f'{c:5s} ' % % print( code ) % print( orig ) % % inp = input( 'Do you want to enter another sentence (Y/N)? ' ) % inp = inp.upper() % % cvt = ( inp == 'Y' )

Hangman

Create a program to play the game "hangman." Here are instructions on how the game works.

Here is a JSON file with a few thousand words you can use to select from for the word to guess. You can read the contents of this file into a list using the following code.

% import json % with open( 'words.json' ) as inp: % data = json.load(inp) % words = data[ "data" ]

The basic game-play would follow a pattern something similar to this.

Hangman Pseudocode

  1. Randomly choose a target word.
  2. Present the user with the number of letters in the target word, and fill in any letters they'd properly guessed (initially there would be no letters guessed).
  3. Ask the user to enter a letter that hasn't already been selected.
  4. Determine if that letter is part of the target word that the user has not guessed yet.
  5. Let the user know whether their guess was in the (remainder of the) word or not.
  6. Update letters guessed correctly if a correct guess was made, or number of mistakes if an incorrect guess was made.
  7. Continue until the player guesses the entire word (success), or until six incorrect guesses are made (failure).

Here is one solution to the Hangman game. If you're practising, you should program your own solution before looking at ours.

Hangman Solution

% import json % import random % % hangman = [ # Hangman progress as mistakes are made % """ % +--| % | % | % | % """, % """ % +--| % | % | % / | % """, % """ % +--| % | % | % / \ | % """, % """ % +--| % | % / | % / \ | % """, % """ % +--| % | % /| | % / \ | % """, % """ % +--| % | % /|\ | % / \ | % """, % """ % +--| % o | % /|\ | % / \ | % """, % ] % % def choose_word(word_list): % try: % return random.choice(word_list).lower() % except: % return "random" % % def new_game(): % c = "" % while c != "Y" and c != "N": % c = input("Would you like to play again (Y or N)? ") % c = c.upper() % return c == "Y" % % def get_input(avail): % for c in avail: % print(f"{c} ", end="") % print() % % c = input("Choose an available character: ") % c = c.upper() % % if len(c) != 1: % print("You must choose a single character") % elif ord(c) < ord("A") or ord(c) > ord("Z"): % print('You must choose a character between "A" and "Z"') % elif c not in avail: % print(f"Character {c.upper()} has already been selected") % % pos = avail.index(c) % avail[pos] = "_" % % return (c.lower(), avail) % % def print_word(word, lbl=""): % if len(lbl) > 0: % print(f"{lbl}: ", end="") % % for c in word: % print(f"{c} ", end="") % print("\n") % % def read_word(): % with open("words.json") as inp: % data = json.load(inp) % % return data["data"] # Return word list % % # Mainline % % words = read_word() % % game = True % while game: % tg_word = choose_word(words) % known_word = "_" * len(tg_word) % % correct = 0 % mistake = 0 % avail = [chr(v) for v in range(ord("A"), ord("Z") + 1)] % % print() % print(hangman[mistake]) % print_word(known_word, "Word") % print() % % while correct != len(tg_word) and mistake < 6: % c, avail = get_input(avail) % % if c not in tg_word: % mistake += 1 % print(f"\n{c.upper()} is not in the word to guess. ", end="") % print(f"{6-mistake} guesses left.") % else: % print(f"\n{c.upper()} is in the word to guess.") % % pos = -1 % while c in tg_word[pos + 1 :]: % pos = tg_word.index(c, pos + 1) % known_word = ( known_word[0:pos] + c.upper() + known_word[pos + 1 :] ) % correct += 1 % % print(hangman[mistake]) % print_word(known_word, "Word") % print() % % if correct == len(tg_word): % print(f"Congratulations! You guessed the word {tg_word.upper()}") % elif mistake == 6: % print("Sorry, you've hung yourself...") % print(f"{tg_word.upper()} was the correct word\n") % % game = new_game()

Friday the 13th

Friday the 13th is considered an unlucky day in many countries, but not all. There is no definitive answer as to why this day is considered unlucky, but a common belief is that it originates from Judas Iscariot being the 13th guest to arrive at the last supper. Another explanation is that in Norse lore, 12 gods came together at Valhalla and Loki, the mischievous god, arrived unannounced as the 13th guest.

Regardless of its origin, in this problem you are asked to find all occurrences of Friday when it is the 13th day of the month, for a given year chosen by a user. This will allow you to practice using the datetime data structure in Python. datetime is specifically designed to allow date and time manipulation in a simple manner. For example, with a datetime variable, you can easily perform the following (often more complicated) tasks.

strptime

strptime converts a string containing a date, a time, or both into a Python datetime object, a standard data structure Python uses to record information about a specific date and/or time. To use the method, you specify the string to convert, and the format of the string, which defines the type and location of date, time, and character separator elements within the string like hour, minute, second, month, day, year, colon, backslash, and so on. Certain elements can be specified in different ways. For example, a day can be numeric (1, 2, \(\ldots\), 30), zero-padded (01, 02, \(\ldots\), 30), or a full name (Monday, Tuesday, \(\ldots\), Sunday). Month can be numeric from 1 to 12, zero-padded from 01 to 12, as a full name from January to Decemeber, or an abbreviated name from Jan to Dec. The documentation details all available formats strptime recognizes. Here are some examples of converting different types of date, time, and date plus time strings into datetime objects.

% from datetime import datetime % % dt_tm = datetime.strptime( "Jan 01, 2024", "%b %d, %Y" ) % dt_tm = datetime.strptime( "March 3/24", "%B %d/%y" ) % dt_tm = datetime.strptime( "11:30:30", "%I:%M:%S" ) % dt_tm = datetime.strptime( "10:00:00.005999PM", "%I:%M:%S.%f%p" ) % dt_tm = datetime.strptime( "18:15:00", "%H:%M:%S" ) % dt_tm = datetime.strptime( "Nov 09/24, 12:00:00PM", "%b %d/%y, %I:%M:%S%p" )

Once you've converted the date and/or time string into a datetime object, you can access any of the elements of the date and time as components of the object.

% from datetime import datetime; % % dt_tm = datetime.strptime( "Nov 09/22, 2:00:15PM", "%b %d/%y, %I:%M:%S%p" ); % print( dt_tm.year ); 2022 % print( dt_tm.month ); 11 % print( dt_tm.day ); 9 % print( dt_tm.hour ); 14 % print( dt_tm.minute ); 0 % print( dt_tm.second ); 15 % print( dt_tm.microsecond ); 0 % print( dt_tm.weekday() ); 2

One reason we want to work with datetime objects and not strings representing dates and/or times is because datetime objects understand and automatically handle things like number of days in a given month, leap years, and so on. Two very useful functions are timedelta() and relativedelta(), which use a datetime's understanding of date and time-specific information to allow you to increment or decrement a datetime object by specific properties and amounts of that property. If you plan on using relativedelta, be sure the read its documentation. It can handle incrementing and decrementing properties like month and year that timedelta cannot, in addition to a wide range of useful operations, e.g., finding the last Friday of the month, checking for leap years, and incrementing or replacing properties of a datetime object.

% from datetime import datetime; % from datetime import timedelta; % from dateutil.relativedelta import relativedelta; % % dt_tm = datetime.strptime( "Dec 31/22, 2:00:15PM", "%b %d/%y, %I:%M:%S%p" ); % dt_tm = dt_tm + timedelta( hours=12 ) % print( dt_tm.year ); 2023 % print( dt_tm.month ); 1 % print( dt_tm.day ); 1 % print( dt_tm.hour ); 2 % % dt_tm = dt_tm + timedelta( days=-15 ) % print( dt_tm.year ); 2022 % print( dt_tm.month ); 12 % print( dt_tm.day ); 17 % dt_tm = dt_tm + relativedelta( months=+2 ) % print( dt_tm.year ); 2023 % print( dt_tm.month ); 2 % print( dt_tm.day ); 17 % dt_tm = dt_tm + relativedelta( month=10 ) % print( dt_tm.month ); 10

strftime

strftime is the complement to strptime. Rather than converting a string to a datetime object, it converts a datetime object to a formatted string, suitable for printing in your program. The codes used to format the string are identical to those used to parse the string during datetime creation.

% from datetime import datetime; % from datetime import timedelta; % % dt_tm = datetime.strptime( "Jan 01, 2024", "%b %d, %Y" ); % dt_tm = dt_tm + timedelta( days=130 ); % dt_str = dt_tm.strftime( "%b %d, %Y" ); % print( dt_str ); May 10, 2024

Managing Zero Padding

Many of the online documentation pages for strptime and strftime refer to formatting options like %-d, %-m, or %-I to refer to days, month numbers, or hours without leading zeros. Unfortunately, this is Linux ONLY and WILL NOT WORK on Windows. So, how do you deal with non-zero padded date and time elements on Windows? For strptime this is normally handled automatically. For example, for a day of "5" or "05" the formatting code %d will correctly parse both examples. The same is true for other date and time properties that can optionally be zero padded.

For creating strings with strftime, we need to augment strftime with some type of string formatting. For example, we can use Python f-strings to format elements of a datetime object to avoid leading zeros.

% from datetime import datetime; % % dt_tm = datetime.strptime( "Jan 01, 2024", "%b %d, %Y" ); % % # Print Jan 01, 2024 % % dt_str = dt_tm.strftime( "%B %d, %Y" ); % print( dt_str ); Jan 01, 2024 % % # Notice that even tho our string is "Jan 1" not "Jan 01" % # the use of %d in strptime continues to work and recognize % # the day even without the leading zero % % dt_tm = datetime.strptime( "Jan 01, 2024", "%b %d, %Y" ); % % # Print January 1, 2024 % % # Here we use the strftime %-format codes %B for full month % # name and %Y for 4-digit year, but we explicitly query % # the day property of the date-time object and use :d to % # format it as a decimal without any leading zeros % % dt_str = f"{dt_tm:%B} {dt_tm.day:d}, {dt_tm:%Y}"; % print( dt_str ); January 1, 2024

Time Zones

As discussed in class, time zones are difficult to deal with because they are ambiguous. Python's datetime documentation states "...rules for time adjustment across the world are more political than rational, change frequently, and there is no standard suitable for every application..." However, there are times where we need to manage timezone-formatted times. Python has its own built-in tzinfo class built on Coordinated Universal Time (UTC) to provide basic timezone processing. More recently, zoneinfo was added to provide an implementation of the IANA (Internationally Assigned Numbers Authority) time zone database, a collection of codes and data that represent the local time for many locations throughout the world. External libraries build on this to offer much more extensive timezone manipulation. Some of the basic operations tzinfo provides include UTC offset, daylight savings time identification, time zone naming, and integration into strptime and strftime. All three are part of the datetime object structure.

% from datetime import datetime; % from datetime import tzinfo; % import zoneinfo; % % cur_dt_tm = datetime.now() % % # By default, no timezone information is included as % # part of a datetime object. This is knows as a "naive" % # datetime object in Python % % print( cur_dt_tm.tzinfo ); None % # Convert the current datetime to be timezone aware and % # print the offset from UTC, which for EST during daylight % # savings is four hours to the West (-4:00) % % cur_dt_tm_tz = cur_dt_tm.astimezone() % print( cur_dt_tm_tz ) 2024-04-29 14:59:48.010824-04:00 % print( cur_dt_tm_tz.tzinfo ) Eastern Daylight Time % % # As of Python 3.9, the zoneinfo class allows basic % # manipulation of timezone information % % cur_dt_tm_tz = cur_dt_tm.astimezone( zoneinfo.ZoneInfo( "US/Eastern" ) ) % % # Return daylight savings time as a timedelta object, 0 if % # we are not in DST, 1 hour if we are % % print( cur_dt_tm_tz.dst() ) 1:00:00 % print( zoneinfo.available_timezones() ) {'America/Grand_Turk', 'America/Detroit', 'America/Belem', 'Europe/Tirane', 'Asia/Baghdad', 'America/Regina', 'Europe/Amsterdam', 'Asia/Calcutta', 'Asia/Vladivostok', 'Pacific/Guam', 'Greenwich', 'GB-Eire', … 'Australia/Brisbane', 'Africa/Niamey', 'Asia/Brunei', 'America/Punta_Arenas', 'Asia/Ashkhabad', 'America/Eirunepe', 'Pacific/Honolulu', 'Etc/GMT+10', 'Indian/Christmas', 'Pacific/Auckland', 'Etc/GMT+7', 'Etc/UCT', 'Europe/Zagreb', 'Africa/Conakry', 'America/Indiana/Vevay'}

The most basic approach to solving the Friday the 13th problem is to ask the user for a target year, then using datetime and timedelta walk through all 365 days in the year and use datetime's properties to check each day to see if it is a Friday and the 13th day of the month.

Friday the 13th Pseudocode

  1. Ask the user to input a valid year.
  2. Convert the year into a datetime variable representing 01/01/year.
  3. Loop over potential Friday 13 days in the year and check each day to see if it is, in fact, Friday the 13th. You can use the timedelta() and weekday() functions, as well as the ability to directly access a datetime's year, month, and day using .year, .month, and .day as needed.
  4. Print each date that corresponds to Friday the 13th.

Here are three solutions to the Friday the 13th problem: basic, standard, and production-ready. If you're practising, you should program your own solution before looking at ours.

The basic solution is the simplest possible code to generate a correct result, with no error checking or consideration for efficency.

Friday the 13th Basic Solution

% from datetime import datetime % from datetime import timedelta % % inp = input( 'Enter a year to query in the form YYYY: ' ); % % # Convert year to first day of year % % inp = 'Jan 01 ' + inp; % cur_dt_tm = datetime.strptime( inp, '%b %d %Y' ); % % # Walk entire year, searching for Friday 13 occurrences % % n = 0; % yr = cur_dt_tm.year; % while( cur_dt_tm.year == yr ): % if ( cur_dt_tm.weekday() == 4 and cur_dt_tm.day == 13 ): % n = n + 1 % cur_dt_tm = cur_dt_tm + timedelta( days=1 ); % % # Report Friday 13 occurrences % % print( 'Number of Friday 13: ' + str( n ) );

A standard solution with basic error checking, support for the most general case, consideration for efficiency, and basic useful feedback.

Friday the 13th Standard Solution

% from datetime import datetime % from dateutil.relativedelta import relativedelta % % yr = 0 % while( yr < 1 ): % try: % yr = input( 'Enter a year to query in the form YYYY: ' ); % yr = int( yr ); % % if ( yr < 1 ): % print( 'Year must be a number > 0\n' ); % except: % print( yr + ' is not a year in the form YYYY\n' ); % yr = -1; % % # Convert year to first possible Friday 13th % % yr = 'Jan 13 ' + f'{yr:4d}'; % cur_dt_tm = datetime.strptime( yr, '%b %d %Y' ); % % # Search 13th day of each month to see if it's Friday % % n = 0; % yr = cur_dt_tm.year; % % print( '\n', end='' ); % while( cur_dt_tm.year == yr ): % if ( cur_dt_tm.weekday() == 4 and cur_dt_tm.day == 13 ): % print( cur_dt_tm.strftime( "%B %d, %Y" ) + ' is Friday the 13th' ); % n = n + 1 % cur_dt_tm = cur_dt_tm + relativedelta( months=+1 ); % % # Report Friday 13 occurrences % % print( '\n', end='' ); % print( 'Number of Friday 13: ' + str( n ) );

A production-level solution with docstring comments, comprehensive error checking, and advanced features like the ability to support BCE/BC (before common era/before Christ) and CE/AD (common era/anno Domini) year definitions.

Friday the 13th Production Solution

% #- FRIDAY-13.PY---------------------------------------------------------# % # A program to determine the days of a given year that are Friday # % # the 13th. # % # # % #- When: ------ Who: ------------------ Comments: ----------------------# % # # % # 30-Aug-24 Christopher G. Healey Initial implementation # % #-----------------------------------------------------------------------# % % from datetime import datetime % import numpy as np % % def datetime64_to_str( yr ): % """ % Return current year as MMM 13, YYYY[BC]. % % :param yr: Current year, month, day as numpy datetime64 object % :return: yr formatted as 'MMM DD, YYYY[BC]' % """ % suffix = '' # Assume AD and not BC % dt_str = str( np.datetime_as_string( [ yr ], unit='D' )[ 0 ] ) % % # BC dates start with 0000 or -x[x...] % % if dt_str.startswith( '0000' ) or dt_str.startswith( '-' ): % suffix = 'BC' % % # Ensure 1BC => 0000-mm-dd converts to -0000-mm-dd % % dt_str = '-' + dt_str if dt_str.startswith( '0' ) else dt_str % % # Remember 1BC is 0, 2BC is -1, so convert e.g., -1 to 2, -2 to 3, etc. % % tok = dt_str.split( '-' ); % tok[ 1 ] = str( int( tok[ 1 ] ) + 1 ); % dt_str = f'{int(tok[ 1 ] ):04d}-{tok[2]}-{tok[3]}' % % cur_dt = datetime.strptime( dt_str, '%Y-%m-%d' ) % dt_str = f'{cur_dt:%b} {cur_dt.day:d}, {cur_dt.year:d}{suffix}'; % % return dt_str % % # End function datetime64_to_str % % def get_year(): % """ % Determine valid year to query. Valid year format is Y[YYY][BC|AD]. % Examples: 2024, 2024AD, 12BC, if AD/BC not specified, AD assumed. % % :return: Year to query as a numpy datetime64 object % """ % yr_int = None % % while yr_int == None: # While valid year not yet entered % ad = False # AD and BC are initially not specified % bc = False % % # Get year, remove spaces, lowercase to standardize % % yr_str = input( 'Enter target year as Y[YYY][BC|AD]: ' ) % yr_str = yr_str.replace( ' ', '' ).lower() % % # If year entered ends in BC or AD, handle appropriately % % while yr_str.endswith( 'bc' ) or yr_str.endswith( 'ad' ): % if yr_str.endswith( 'bc' ): % bc = True % elif yr_str.endswith( 'ad' ): % ad = True % % yr_str = yr_str[ :-2 ] % % if bc and ad: # Ensure both BC and AD were not entered % print( 'Year cannot end in both BC and AD\n' ) % continue % % if len( yr_str ) < 1 or len( yr_str ) > 4: # Ensure valid year % print( 'Year does not meet format Y[YYY][BC|AD]\n' ) % continue % % try: # Try to convert string to integer % yr_int = int( yr_str ) % if yr_int == 0: # 0BC or 0AD are not valid years % print( 'Year cannot be 0, 0BC, or 0AD\n' ) % yr_int = None % continue % % if bc: # If BC specified, negate date: 1BC = 0, 2BC = -1, ... % yr_int = 1 - yr_int % except: # Catch invalid year % print( 'Year ' + yr_str + ' does not meet format Y[YYY][BC|AD]\n' ) % % # Account for "-" on BC dates (5 wide versus 4 wide) % % yr_str = f'{yr_int:05d}' if bc else f'{yr_int:04d}' % return np.datetime64( yr_str ) % % # End function get_year % % # Mainline % % yr_np = get_year() # Get target year to search % % for i in range( 0, 12 ): # For all 12 months in target year % # Set appropriate month and 13th day, then check if it's a Friday % % cur_yr = yr_np + np.timedelta64( i, 'M' ) + np.timedelta64( 12, 'D' ) % if np.is_busday( cur_yr, weekmask="Fri" ): % yr_str = datetime64_to_str( cur_yr ) % print( yr_str + " is Friday the 13th" )

Advanced Problems

Below are some more advanced problems for students who have previous Python programming experience or students who have finished the first four practice problems.

2048

2048 is a game played on a 4×4 grid. The goal is to combine values in squares on the grid to add up to 2048. Initially, two random positions are given values of 2. The remaining seven positions are empty. Players can push the non-empty squares up, down, left, or right. All the non-empty squares will move as far as possible in the given direction, until they hit the edge of the grid or they hit a non-empty square. If a square hits another square with the same value the two squares "merge," placing the sum of the values in the square that was hit, and emptying the second square. After each "push" a random empty square is selected and a value of 2 is placed in the square.

The game continues until the player merges two squares to a value of 2048, or until the board is full and no moves that form merges exist. You can try an online version of the game to see how it works and understand the rules for what happens when you push squares, when merges occur, and how the game ends.

                                                         
    2                       Push Right                          2 
                                                       
    2                                    2             2 
 
                                         2             4 
                         2  Push Up                            
                                2                      
           2             2                               
 
           2             4        2      4               
                            Push Left                            
    2                           2      2               
                                                         
       

We have provided one solution to the 2048 game in Python. You'll need to conda install conda-forge::keyboard to install the keyboard library so it can run properly.

String Matching

String matching and partial string matching are two critical tasks in computer science. The first asks, "What is the minimum number of operations required to convert a source string \(s\) into a target string \(t\)?" The second asks, "From a set of strings, which strings \(T = \{ t_1, t_2, \ldots, t_n\}\) are within a threshold similarity \(\epsilon\) of a source string \(s\)?"

String Conversion

Suppose we have two strings \(s\) and \(t\) and we want to perform the minimum number of "operations" on \(s\) so that it is equivalent to \(t\). Here, valid operations on \(s\) are inserting a new character, deleting an existing character, or replacing an existing character (i.e., swapping it for something else). The Levenshtein distance, proposed by Vladimir Levenshtein in 1965, defines the minimum number of operations needed to convert \(s\) to \(t\). The distance matrix \(D\) it builds can also be used to determine which operations to perform, which is probably as important (or perhaps more important) that the actual minimum operation count itself.

Levenshtein distance is calculated by recursively constructing a distance matrix \(D\) of size \(|s| \times |t|\). Entry \(D[i][j]\) holds the Levenshtein distance for \(s\) to \(t\) up to position \(i\) in \(s\) and \(j\) in \(t\). This means the lower-right corner of \(D\) holds the final distance for converting \(s\) to \(t\). A specific algorithm based on values of \(s\) and \(t\) is used to construct \(D\).

\begin{equation} \textrm{lev}(s,t) = \begin{cases} |s| & \textrm{if} \; |t| = 0 \\ |t| & \textrm{if} \; |s| = 0 \\ \textrm{lev}(\textrm{tail}(a), \textrm{tail}(b)) & \textrm{if} \; a[0] = b[0] \\ 1 + \textrm{min} \begin{cases} \textrm{lev}(\textrm{tail}(a),b) \\ \textrm{lev}(a,\textrm{tail}(b)) \\ \textrm{lev}(\textrm{tail}(a),\textrm{tail}(b)) \\ \end{cases} & \textrm{otherwise} \end{cases} \end{equation}

This tells us the Levenshtein distance for (sub)strings \(s\) and \(t\) is

  1. \(|s|\) if \(t\) is empty (delete all characters in \(s\), requiring \(|s|\) operations)
  2. \(|t|\) if \(s\) is empty (append all characters in \(t\) to the end of \(s\), requiring \(|t|\) operations)
  3. The Levenshtein distance of \(s[1:]\) and \(t[1:]\) if the first characters of \(s\) and \(t\) are identical (nothing needs to be done since the strings match at their first character)
  4. Otherwise, we know a deletion, insertion, or replacement (swap) is required, so we add 1 (for the operation) to the minimum of:

To actually construct \(D\), we use dynamic programming, which divides a large problem into a sequence of smaller ones. Obviously, the large problem is calculating the Levenshtein distance between \(s\) and \(t\). We can reduce this to calculating the Levenshtein distance between prefixes of \(s\) and \(t\). As an example, suppose \(s=\textrm{kelm}\) and \(t=\textrm{hello}\). The prefixes of kelm are k, ke, kel, and kelm. Similarly, the prefixes of hello are h, he, hel, hell, and hello. The Levenshtein distance between all pairs of prefixes forms the entries of \(D\).

To further reduce this to something we can easily work with, consider comparing the prefix k to all prefixes of hello. The first one, \((\textrm{k},\textrm{h})\) has a Levenshtein distance of \(\textrm{lev}(\textrm{k},\textrm{h})=1\), since we swap k for h. The second, \((\textrm{k},\textrm{he})\) requires a swap and an insertion of e, \(\textrm{lev}(\textrm{k},\textrm{he})=2\). Notice this is an extension of the first \((\textrm{k},\textrm{h})\) operation. For \((\textrm{k},\textrm{hel})\) it's a swap, an insertion, and an insertion, for \(\textrm{lev}(\textrm{k},\textrm{hel})=3\). You can easily guess that \(\textrm{lev}(\textrm{k},\textrm{hell})=4\) and \(\textrm{lev}(\textrm{k},\textrm{hello})=5\). This represents the first row of \(D\), or values of \(D[1,1]\) through \(D[1,5]\).

To describe this in a linear programming fashion, we will consider the case of calculating \(D[i,j]\) given \(D[i-1,j]\), \(D[i,j-1]\), and \(D[i-1,j-1]\), that is, the problem reduced to a \(2 \times 2\) matrix. Suppose \(D[i-1,j]=2\), \(D[i,j-1]=3\), and \(D[i-1,j-1]=5\).

5 2
3  

The missing element at \(D[i,j]\) will have a Levenshtein distance based on two options.

  1. The current character of the two prefixes are identical, and no operation is required if we do a replacement (since we'd be swapping the same characters), so \(D[i,j] = \min(1+D[i-1,j],1+D[i,j-1],D[i-1,j-1])\).
  2. The current character of the two prefixes are different, and a replacement does require a swap, so \(D[i,j] = \min(1+D[i-1,j],1+D[i,j-1],1+D[i-1,j-1])\).

To complete the discussion, notice that \(i\) and \(j\) index from \(1\), but to get the three values we need to calculate the fourth, we need a row and column indexed at \(0\) in \(D\). Therefore, this is the \(D\) we start with when comparing kelm to hello.

h e l l o
0 1 2 3 4 5
k 1
e 2
l 3
m 4

Given this table, to calculate \(D[1,1]\) we observe that the prefixes k and h are not the same, so the distance is \(\min(1 + D[0,1],1 + D[1,0],1 + D[0,0]) = 1 + 0 = 1\).

h e l l o
0 1 2 3 4 5
k 1 1
e 2
l 3
m 4

When the current characters of the prefixes are the same, the replacement cost is 0, and our options are insert, delete, or do nothing. The value for \(D[i,j]\) is therefore \(\min(1+D[i-1,j],1+D[i,j-1],D[i-1,j-1])\). If the prefix characters are different, then an deletion, insertion, or swap is required and the replacement cost is 1, so the value for \(D[i,j]\) is \(\min(1 + D[i-1,j],1 + D[i,j-1],1 + D[i-1,j-1])\). This procedure is used to fill in row \(D[1,:]\).

h e l l o
0 1 2 3 4 5
k 1 1 2 3 4 5
e 2
l 3
m 4

For the next row, we are comparing ke to hello. Remember, even though the second row is labelled e, it corresponds to the prefix from the first to the second row: ke. So as we walk through hello, we would say:

What about the first two entries in the next row, \(D[2,1]\) and \(D[2,2]\)? For \(D[2,1]\) we are comparing e to h, and since they are different the distance is \(\min(1+1,1+2,1+1) = 2\). For \(D[2,2]\), however, we are comparing e to e, and since these are the same, no operation is needed for replacement, meaning \(D[2,2] = \min(1+2,1+2,1) = 1\).

h e l l o
0 1 2 3 4 5
k 1 1 2 3 4 5
e 2 2 1
l 3
m 4

If we continue this for all entries in \(D\), we obtain a final table whose lower-right entry \(D[4,5]=3\), meaning \(\textrm{lev}(\textrm{kelm},\textrm{hello})=3\). Intuitively this seems correct, since we would swap k with h, insert an l, and swap m with o. You will see below, however, that our algorithm uses a different set of three operations to convert kelm to hello.

h e l l o
0 1 2 3 4 5
k 1 1 2 3 4 5
e 2 2 1 2 3 4
l 3 3 2 1 2 3
m 4 4 3 2 2 3

Given this description, write a program that calculates the Levenshtein distance between two words, then describes how to convert the source word to the target word in the number of Levenshtein distance operations allowed. You may get a different set of operations depending on whether you modify the source front-to-back or back-to-front and depending on how you break ties in minimum distances for insert, delete, and replace, but the number of operations (i.e., the Levenshtein distance) will always be the same.

kelm → hello Levenshtein distance: 3 kelm → kelmo (I) kelmo → kello (R) kello → hello (R) smelt → salmon Levenshtein distance: 5 smelt → smeltn (I) smeltn → smelton (I) smelton → smelmon (R) smelmon → smlmon (D) smlmon → salmon (R)

String Correction

Levenshtein reports the edit distance between two words, and with some extensions, how to perform edit operations to convert one word to another. Although interesting, this may not seem like a common problem. However, Levenshtein provides another critical property: a way to characterize the similarity between two words based on their edit distance. This can be used in any situation where "word similarity" is important. The example we will use is spell checking. The basic strategy would be to say: if a word is misspelled, suggest a small set of words that are "similar" to the misspelled word, that is, a small set of words that have a small Levenshtein distance from the misspelled word.

Since we need suggestions to be identified quickly, we need to encode all words in a dictionary in a way that supports rapid identification of dictionary words within a threshold Levenshtein distance of a target word. This can be done with a Burkhard-Keller tree, commonly called a BK-tree. In a BK-tree words are stored in tree nodes and the edge between two nodes contains the Levenshtein distance between the nodes' words. A BK-tree is constructed in the following manner.

  1. Choose any word in a dictionary \(w_\textrm{root}\) as the root of the BK-tree.
  2. For all remaining words \(w_i\) in the dictionary, compute the Levenshtein distance \(d_{i,\textrm{root}}\) between \(w_i\) and \(w_\textrm{root}\).
  3. If no child of \(w_\textrm{root}\) exists with an edge value of \(d_{i,\textrm{root}}\), add \(w_i\) as a child of \(w_\textrm{root}\).
  4. If a child does exist with an edge value of \(i,d_{\textrm{root}}\)

As an example, suppose our (small) dictionary had the words { help, helm, hello, shell, helper, loop, helps, troop }. We arbitrarily choose help as the root of the tree, then add helm Since \(\textrm{lev}(\textrm{help},\textrm{helm})=1\) and help has no child node with an edge value of \(1\), we add it to the BK-tree. Similarly, hello with a Levenshtein distance of \(2\) is also added as a child of help.

BK-01

Next we add shell. Since \(\textrm{lev}(\textrm{help},\textrm{shell})\) is \(2\), we already have a child with Levenshtein distance \(2\) (hello). Therefore, we move to node hello and attempt to add shell there. The Levenshtein distance between hello and shell is 2, and no child of hello with edge value 2 exists, so we insert shell as a child of hello.

BK-02

This process continues, adding the remaining words to create the final BK-tree.

agglomerative

Once the tree is built, it is searched for words "close" to a target word, for example, a misspelled word. The threshold Levenshtein distance \(\epsilon\) is called the radius. To search the BK-tree, for a target word \(w_\textrm{tg}\), perform the following iterative procedure.

  1. Create an empty match list \(M\).
  2. Create a candidate list \(C\) containing the root node \(n_\textrm{root}\).
  3. While \(C\) is not empty, remove the next node \(n_c \in C\).
  4. At this point, \(M\) contains all words in the BK-tree within Levenshtein distance \(\epsilon\) of \(w_\textrm{tg}\).

Suppose we were searching for suggestions to the misspelled word hllo with a threshold of \(\epsilon=1\). We start with an empty \(M\) and a candidate list \(C = \{ \textrm{help} \}\) containing the root node. We remove the first entry help from \(C\). Because the Levenshtein distance \(\textrm{lev}(\textrm{help},\textrm{hllo}) = 2 \gt \epsilon\), we do not add help to \(M\). Our soft threshold boundaries are \( 2 - \epsilon\) to \(2 + \epsilon = [1, \ldots, 3]\). helm and hello are added to \(C\) since their edge values are within the soft threshold.

We remove the next entry helm from \(C\). \(\textrm{lev}(\textrm{helm},\textrm{hllo}) = 2 \gt \epsilon\). Because the Levenshtein distance is greater than \(\epsilon\), helm is not added to \(M\). The soft threshold is \([1, \ldots, 3]\), so helps is added to \(C\).

We remove the next entry hello from \(C\). \(\textrm{lev}(\textrm{hello},\textrm{hllo}) = 1 \leq \epsilon\). Because the Levenshtein distance is less than of equal to \(\epsilon\), hello is added to \(M\). The soft threshold is \([0, \ldots, 2]\), so shell is added to \(C\).

To complete the example, \(\textrm{lev}(\textrm{helps},\textrm{hllo}) = 3 \gt \epsilon\), and \(\textrm{lev}(\textrm{shell},\textrm{hllo}) = 3 \gt \epsilon\), so neither is added to \(M\). The final set of recommended corrections for hllo is therefore \(M = \{\textrm{hello}\}\).

Add code to your program to construct a BK-tree from a set of words, then allow a user provide a target word and threshold \(\epsilon\). The search function should return the set of words that are within \(\epsilon\) of the target word. As an example, here is our Python program that includes both the Levenshtein algorithm and the routines to construct and search a BK-tree. As a dictionary, we use the same set of words we provided for the hangman practice problem. If you want a larger dictionary, an abridged version of the 2016 Webster's English Dictionary is also available, although it will take longer to build a BK-tree since this file contains significantly more words. To load data from the Webster's Dictionary, execute the following Python commands.

% import json % % with open( 'dicitionary.json' ) as inp: % data = json.load( inp ) % words = list( data.keys() )

Tic-Tac-Toe

The final practice problem is also your assignment for the Practical Python module. You are asked to implement a simple tic-tac-toe game to allow two individuals to play one another. The assignment web page provides full details on what is required.

The assignment is due by 5:00pm Friday, September 8. The submission link is included on the Moodle page in the Practical Python section. As with the OpenWeatherMap assignment, you must submit a Python .py source code so that we can automatically test it to ensure it works as it should.

Appendix: Installing Python Packages

In a typical Python environment, you can install packages using Python's pip command. pip automatically determines dependencies, that is, whether other packages are required to run the package you've requested. If any dependencies are not installed, pip will attempt to install them prior to installing the package you've requested.

In an Anaconda environment, however, we use the conda command the perform the same function. This can be done either in Anaconda Navigator or at the windows command line interface (CLI). Anaconda has an extensive help page on how to perform package installations, both those that Anaconda maintains in its own package library, and those that live outside Anaconda's servers. For specific details on installation, this is the best guide to reference.

If you want a quick summary of the steps I use to try to install Python packages in the Anaconda environment, they're listed in order below

  1. Run Anaconda Navigator and choose the "Environments" option on the left-hand panel. This will bring up a package manager with a list of installed packages on the right. Enter the name of your package in the "Search Packages" field to see if it's already installed as part of the default Anaconda package group. If the package doesn't appear as an installed package, use the drop-down menu to switch from "Installed" to "Not Installed." If your package appears in this list, it is a package Anaconda maintains in their package library. Select the package (click on the radio button to the left of its name), then click "Apply" to install the package.
  2. If the package is not in the "Not Installed" list, it is not a package Anaconda includes in its default installation. Hit Windows+Q (Windows key plus Q key) to bring up the search option in Windows and type "anaconda". The list of Recommended Apps should include "Anaconda Prompt." Select this, since it guarantees all the proper Anaconda environment information is set before the CLI is run.

    Search the web for the terms anaconda package where package is the package you want to install. Normally, the first result returned is an Anaconda page that specifies exactly how to install the given package.
  3. If all of these fail and you're willing to accept the possibility of corrupting your current Anaconda environment, you can try using pip. To do this, we first need to ensure pip itself is installed. Type
    conda list
    at the command line. This will list all installed package in alphabetical order. You can scroll backwards to see if pip is in the list of installed packages. If pip is not installed, you must install it with the command
    conda install pip
    This will install pip (which is a package Anaconda maintains) into the Anaconda environment. Once you know pip is available, you can install your desired package using
    pip install package
    where package is the name of the package you want to install. To check if the package installed properly, you can either use conda list again, or run python and try to import package to see if it loads properly.