We will start this lecture notes with a broader overview of what Object-Oriented Programming means. This might sound a bit generic, but it is a super important concept for you to grasp and build a more general understanding of Python.
After that, we will go more into a classic introduction to Python, with a focus on data types
and data collectors
This notebook draws from materials developed by Dr. Eric Dunford for a previous iteration of PPOL 5203
Python is an object-oriented programming language (OOP) where the object plays a more fundamental role for how we structure a program. Specifically, OOP allows one to bundle properties and behavior into individual objects. In Python, objects can hold both the data and the methods used to manipulate that data.
As you are progressing in the DSPP, you are also being introduced to R. R, on the other hand, is a functional programming language where functions are objects and data is manipulated using functions.
At first glance, the distinction is subtle, but the way we build programs in R and Python differs considerably. In practice, the OOP vs. Functional distinction changes how one engages with objects instantiated in the environment.
In Python, methods (functions) are self-contained in the object; whereas in R functions are external to the object. In other words, while much of the work in R consists on writing functions that are stored outside of classes/objects, in Python, you can borrow from general classes, inherit their methods/functions, or just add new functionalities to objects created by others.
The core of OOP Languages are the objects. They are also important in R, but not as flexible as in Python
As an example, see the differences between taking the mean of a vector in R (using a function) and the mean of a pandas series in Python (using a method).
=
is the assignment operator in Python. Different from R, in which there are multiple assignment operators, in Python, you only have the (=
) assignment operator
# creating an object in Python
x = 4
1) A reference is assigned to an object (e.g. below, x
references the object 4
in the statement x = 4
). This is the name of the object as it is saved in your environment. But notice this is not how the object is saved in your machine
# in your machine
id(x)
Use names that make sense. This simple action will make you code much easier to read.
2) An objects type is defined at runtime (also known as "duck typing"). Python is a dynamically typed language, which differs from other languages where type must be made explicit (e.g. C++, Java). Type cannot be changed once an object is created (coercing an object into a different type actually creates a new object).
# Creating object in C
int result = 0;
3) Object's class is instantiated upon assignment. An objects class provides a blueprint for object behavior and functionality. We use the pointer .
to access an objects methods and attributes.
# what is the class?
type(x)
# Access methods (behaviors) using .
x.bit_length()
# see all
dir(x)
Every time we create an object, this objects inherits a class. This is what we called instantiating.
Classes are used to create user-defined data structures. A class is a blueprint for how something should be defined.
An instance is a realization of a particular class. And this instance inherits the characteristics of its class.
Imagine we have a class called dog()
, every time I use this class to create a new, concrete dog (object), I am capturing a instance, or a realization of this abstract class.
Classes have two major components:
Attributes: these are constant features, data, a characteristic of the broader class
Methods: these are actions, behaviors of this class.
Both attributes and methods are accessed through .
function.
We will see later how to create our own classes. Most important now is for you to understand that every object in python has a class, and every realization of this class inherits both attributes and methods of the pre-defined class.
Let's see a quick example here.
# create a class
class dog():
def __init__ (self, name, breed, age):
self.name = name
self.breed = breed
self.age = age
def say_hello_to_my_friends(self):
print('Hi, I am ' + self.name, " and I am a " + self.breed)
# Instatiate
brisa = dog(name="Brisa", breed="Beagle Mix", age="6 years old")
type(brisa)
# Attributes
print(brisa.name + " " + brisa.breed + " " + brisa.age)
# method
brisa.say_hello_to_my_friends()
Here we can print out all the different methods using the dir()
function (which provides an internal directory of all the methods contained within the class). As we can see, there is a lot going on inside this single set
object!
dir(brisa)
There are two ways of instantiating a data class in Python:
[]
list()
Python comes with a number of built-in data types. When talking about data types, it's useful to differentiate between:
These built-in data types are the building blocks for more complex data types, like a pandas DataFrame (which we'll cover later).
Type | Description | Example | Literal | Constructor |
---|---|---|---|---|
int |
integer types | 4 |
x = 4 |
int(4) |
float |
64-bit floating point numbers | 4.567 |
x = 4.567 |
float(4) |
bool |
boolean logical values | True |
x = True |
bool(0) |
None |
null object (serves as a valuable place holder) | None |
x = None |
Note two things from the above table:
Int
¶Definition
: Int. Int, or integer, is a whole number, positive or negative, without decimals, of unlimited length
Here we assign an integer (3
) to the object x
.
# int
x = 3
x
# check type
type(x)
Definition
: Floating point numbers are decimal values or fractional numbers
Now let's coerce the integer to a float using the constructor float()
. float
represent real numbers with both an integer and fractional component
# creating a float
y = 4.56
type(y)
# float
float(x)
x
# check type
type(x)
Note that behavior of the object being coerced depends both on the initial class and the output class.
#int
x=3
# add a int + float = float
type(x + 3.0)
Boolean objects that are equal to True are truthy (True
), and those equal to False are falsy (False
). Numerically, int
values equal to zero are False
, and larger than zero are True
# literalx
x=True
x
# constructor
x = bool(1)
id(x)
type(x)
x
Finally, all scalar data types are immutable, meaning they can't be changed after assignment. When we make changes to a data type, say by coercing it to be another type as we do above, we're actually creating a new object. We can see this by looking at the object id.
id()
tells us the "identity" of an object. That shouldn't mean anything to you. Just know that when an object id is the same, it's referencing the same data in the computer. We'll explore the implications of this when we look at copying.
x = 4
id(x)
Here we coerce x
to be a float
and then look up its id()
. As we can see, there is a new number associated with it. This means x
is a different object after coercion.
id(float(x))
x=6
id(x)
Python knows how to behave given the methods assigned to the object when we create an instance. The methods dictate how different data types deal with similar operations (such as addition, multiplication, comparative evaluations, ect.).
Using what we learned from OOP, it means that for every class, we have specific methods. These methods can have specific names -- any user-defined function -- or they can same universal names (Magic or Dunder Methods). See the addition example with int
instances.
# create int
x=4
# add literally
x + 4
# what is happening under the hood?
x.__add__(4)
Every class has a self-contained __add__
method. For this reason, the output of adding two int
or an int
and a float
are different
# create int
x=4
# add literally
type(x + 4.2)
Collection Data Types
Type | Description | Example | Mutable | Literal | Constructor |
---|---|---|---|---|---|
list |
heterogeneous sequences of objects | [1,"2",True] |
✓ | x = ["c","a","t"] |
x = list("cat") |
str |
sequences of characters | "A word" |
✘ | x = "12345" |
x = str(12345) |
tuples |
heterogeneous sequence of objects | (1,2) |
✘ | x = (1,2) |
x = tuple([1,2]) |
sets |
unordered collection of distinct objects | {1,2} |
✓ | x = {1,2} |
x = set([1,2]) |
dicts |
associative array of key/value mappings | {"a": 1} |
keys ✘ values ✓ |
x = {'a':1} |
x = dict(a = 1) |
Each built-in collection data type in Python is distinct in important ways. Recall that an object's class defines how the object behaves with operators and its methods.
I'll explore some of the differences in behavior for each class type so we can see what this means in practice
Note the column referring to Mutable and Immutable collection types. Simply put, mutable objects can be changed after it is created, immutable objects cannot be changed. All the scalar data types are immutable. Even when we coerced objects into a different class, we aren't changing the existing object, we are creating a new one.
Some collection types, however, allow us to edit the data values contained within without needing to create a new object. This can allow us to effectively use the computer's memory. It can also create some problems down the line if we aren't careful (see the tab on copies).
In practice, mutability means we can alter values in the collection on the fly.
my_list = ["sarah","susan","ralph","eddie"]
my_list
## see id
id(my_list)
my_list[1] = "josh"
my_list
## see id
# Still the same object, even though we changed something in it
id(my_list)
Immutability, on the other hand, means that we cannot alter values after the object is created. Python will throw an error at us if we try.
my_tuple =("sarah","susan","ralph","eddie")
my_tuple
my_tuple[1] = "josh"
list
¶Lists allow for heterogeneous membership in the various object types. This means one can hold many different data types (even other collection types!). In a list, one can change items contained within the object after creating the instance.
x = [1, 2.2, "str", True, None]
x
A list constructor takes in an iterable object as input. (We'll delve more into what makes an object iterable when covering loops, but the key is that the object must have an .__iter__()
method.)
list([1, 2.2, "str", True, None])
At it's core, a list is a bucket for collecting different types of information. This makes it useful for collecting data items when one needs to store them. For example, we can store multiple container types in a list.
a = (1,2,3,4) # Tuple
b = {"a":1,"b":2} # Dictionary
c = [1,2,3,4] # List
# Combine these different container objects into a single list
together = [a,b,c]
type(together[0])
type(together[1])
type(together[2])
A list
class has a range of specific methods geared toward querying, counting, sorting, and adding/removing elements in the container. For a list of all the list
methods, see here.
Let's explore some of the common methods used.
country_list = ["Russia","Latvia","United States","Nigeria","Mexico","India","Costa Rica"]
country_list
Inserting values
Option 1: use the .append()
method.
country_list.append("Germany")
country_list
Option 2: use the +
(add) operator.
country_list = country_list + ['Canada']
country_list
Addition means "append"?: Recall that an objects class dictates how it behaves in place of different operators. A
list
object has a.__add__()
method built into it that provides instructions for what the object should do when it encounters+
operator. Likewise, when it encounters a*
multiplication operator and so on. This is why it's so important to know the class that you're using. Different object classes == different behavior.
You can also combine list through the reference names
more_countries = ["Brazil", "Argentina"]
country_list + more_countries
Deleting values
Option 1: use the del
operator + index.
# Drop Latvia
del country_list[1]
country_list
Option 2: use the .remove()
method
country_list.remove("Nigeria")
country_list
Sorting values
country_list.sort()
country_list
dir(country_list)
str
¶Strings are containers too. String elements can be accessed using an index, much like objects in a list (See the tab on indices and keys).
s = "This is a string"
s[2]
The literal for a string is quotations: ''
or ""
. When layering quotations, one needs to opt for the quotation type different than the one used to instantiate the string object.
s = 'This is a "string"'
print(s)
s = "This is a 'string'"
print(s)
A Multiline string can be created using three sets of quotations. This is useful when writing documentation for a function.
s2 = '''
This is a long string!
With many lines
Many. Lines '''
print(s2)
String are quite versatile in Python! In fact, many of the manipulations that we like to perform on strings, such as splitting text up (also known as "tokenizing"), cleaning out punctuation and characters we don't care for, and changing the case (to name a few) are built into the string class method.
For example, say we wanted to convert a string to upper case.
str1 = "the professor is here!"
str1.upper()
str1.split(" ")
Or remove words.
str1.replace("professor","student")
This is just a taste. The best way to learn what we can do with a string is to use it. We'll deal with strings all the time when dealing with public policy data. So keep in mind that the str
data type is a powerful tool in Python. For a list of all the str
methods, see here.
tuple
¶Like a list
, a tuple
allows for heterogeneous membership among the various scalar data types.
However, unlike a list
, a tuple
is immutable, meaning you cannot change the object after creating it.
The literal for a tuple
is the parentheses ()
my_tuple = (1,"a",1.2,True)
my_tuple
The constructor is tuple()
. Like the list
constructor, tuple()
an iterable object (like a list
) as an input.
my_tuple = tuple([1,"a",1.2,True])
my_tuple
Tuples are valuable if you want a data value to be fixed, such as if it were an index on a data frame, denoting a unit of analysis, or key on a dictionary. Tuples pop up all the time in the wild when dealing with more complex data modules, like Pandas. So we'll see them again and again.
One nice thing about tupes is that that it is a data type that allow for unpacking. Unpacking allows one to deconstruct the tuple
object into named references (i.e. assign the values in the tuple
to their own objects). This allows for flexibility regarding which objects we want when performing sequential operations, like iterating.
# Unpacking
my_tuple = ["A","B","C"]
# Here we're unpacking the three values into their own objects
obj1, obj2, obj3 = my_tuple
# Now let's print each object
print(obj1)
print(obj2)
print(obj3)
# list
my_tuple = ["A","B","C"]
# Here we're unpacking the three values into their own objects
obj1, obj2, obj3 = my_tuple
type(my_tuple)
Also, like a list
, a tuple
can store different collection data types as well as the scalar types. For example, we can store multiple container types in a tuple
.
a = (1,2,3,4) # Tuple
b = {"a":1,"b":2} # Dictionary
c = [1,2,3,4] # List
# Combine these different container objects into a single tuple
together = (a,b,c)
together
For a list of all the tuple
methods, see here.
set
¶A set
is an unordered collection of unique elements (this just means there can be no duplicates). set
is a mutable data type (elements can be added and removed). Moreover, the set
methods allow for set algebra. This will come in handy if we want to know something about unique values and membership.
The literal for set
is the brackets {}
.1
Note that this is very similar to the literal for a dict
ionary but in that data structure we define a key/value pair (see the dict
tab)↩
my_set = {1,2,3,3,3,4,4,4,5,1}
my_set
The constructor is set()
. As before, it takes an iterable object as an input.
new_set1 = set([1,2,4,4,5])
new_set1
new_set2 = set("Georgetown")
new_set2
In the above, we can see that order isn't a thing for a set
.
We can add elemets to a set
using the .add()
or .update()
methods.
my_set.add(6)
my_set
my_set.update({8})
my_set
Where a set
really shines is with the set operations. Say we had a set of country names.
countries = {"nigeria","russia","united states","canada"}
And we wanted to see which countries from our set were in another set (say another data set). Not a problem for a set!
other_data = {"nigeria","netherlands","united kingdom","canada"}
Which countries are in both sets?
countries.intersection(other_data)
Which countries are in our data but not in the other data?
countries.difference(other_data)
Note that values in a set cannot be accessed using an index.
my_set[1]
Rather we either .pop()
values out of the set.
my_set.pop()
Or we can .remove()
specific values from the set.
my_set.remove(3)
my_set
Finally, note that sets can contain heterogeneous scalar types, but they cannot contain other mutable container data types.
set_a = {.5,6,"a",None}
set_a
In set_b
, the list
object is mutable.
set_b = {.5,6,"a",None,[8,5,6]}
All this is barely scratching the surface of what we can do with sets. For a list of all the set
methods, see here.
dict
¶A dictionary is the true star of the Python data types. dict
is an associative array of key-value pairs. That means, we have some data (value) that we can quickly reference by calling its name (key). As we'll see next week, this allows for a very efficient way to look data values, especially when the dictionary is quite large.
There is no intrinsic ordering to the keys, and keys can't be changed once created (that is, the keys are immutable), but the values can be changed (assuming that the data type occupying the value spot is mutable, like a list
). Finally, keys cannot be duplicated. Recall we're going to use the keys to look up data values, so if those keys were the same, it would defeat purpose!
The literal for a dict
is {:}
as in {<key>:<value>}
.
my_dict = {'a': 4, 'b': 7, 'c': 9.2}
my_dict
The constructor is dict()
. Note the special way we can designate the key value pairing when using the constructor.
my_dict = dict(a = 4.23, b = 10, c = 6.6)
my_dict
The dict
class has a number of methods geared toward listing the information contained within. To access the dict
's keys, use the .keys()
method.
l = my_dict.keys()
l
Just want the values? Use .values()
my_dict.values()
Want both? Use .items()
. Note how the data comes back to us --- as tuple
s nested in a list
! This just goes to show you how intertwined the different data types are in Python.
a, s, b = my_dict.items()
print(a)
print(s)
print(b)
We can combine dictionary with other data types (such as a list) to make an efficient and effective data structure.
grades = {"John": [90,88,95,86],"Susan":[87,91,92,89],"Chad":[56,None,72,77]}
We can use the keys for efficient look up.
grades["John"]
We can also use the .get()
method to get the values that correspond to a specific key.
grades.get("Susan")
Updating Dictionaries
We can add new dictionary data entries using the .update()
method.
new_entry = {"Wendy":[99,98,97,94]} # Another student dictionary entry with grades
grades.update(new_entry) # Update the current dictionary
grades
In a similar fashion, we can update the dictionary directly by providing a new key entry and storing the data.
grades["Seth"] = [10, "sh"]
grades
One can also drop keys by .pop()
ing the key value pair out of the collection...
grades.pop("Seth")
...or deleting the key using the del
operator.
del grades['Wendy']
grades
Likewise, one can drop values by:
# Example of using .clear()
grades.clear()
grades
This is barely scratching the surface. For a list of all the dict
methods and all the things you can do with a dictionary, see here.
Learning how to access the data types is a foundation of your fluency as a data scientist.
As you transition across different languages, keep track of accessing methods across different data types is actually quite challenging. You will definitely find yourself searching online many times for this. The important issue here is make an effort to understand general rules for acessing elements across languages and data types
A first way to access elements in collectors is through their index position.
different from R, Python objects start its index at zero
# Define a list
x = [1, 2.2, "str", True, None]
# first element in python
x[0]
x[1]
# can see how many values are in our container with len()
len(x)
# Can look up individual data values by referencing its location
x[3]
# Python throws an error if we reference an index location that doesn't exist
x[7]
# We use a negative index to count BACKWARDS in our collection data type.
x[-3]
This way to access data using index position is going to be very standard across a range of data type.
# tuples
tup = (1, 2, "no", True)
# first element
tup[0]
# Last element
tup[-1]
We use the : operator to slice (i.e. select ranges of values). This works using the numerical indices we juat learned. Slicing in a nutshell goes like this
# To pull out values in position 1 and 2
x[1:3]
# When we leave left or right side blank, Python implicitly goes to the beginning or end
x[:3]
x[2:]
# Define a dictionary
grades = {"John":[90,88,95,86],"Susan":[87,91,92,89],"Chad":[56,None,72,77]}
# Unlike lists/tuples/sets, we use a key to look up a value in a dictionary
grades["John"]
# We can then index in the data structure housed in that key's value position
# as is appropriate for that data object
grades["John"][0:2]
# Copies with mutable objects -----------------------
# Create a list object
x = ["a","b","c","d"]
x
# Dual assignment: when objects reference the same data.
y = x
print(id(x))
print(id(y))
# If we make a change in one
y[1] = "goat"
# That change is reflected in the other
print(x)
Because these aren't independent objects
# We can get around this issue by making **copies**
y = x.copy() # Here y is a copy of x.
id(y)
This duplicates the data in memory, so that y and x are independent.
# (1) Use copy method
y = x.copy()
# (2) Use constructor
y = list(x)
# (3) Slice it
y = x[:]
list_exercise = ["Ramy", "Victorie", "Letty", "Robin", "Antoine", "Griffin"]
1. Add "Cathy O'Neil" to the list. Insert " Professor Crast" as the first element of the list
2. Remove "Letty" from the list. Also remove the last element of the list.
3. Find the index of the occurrence of the name "Robin". Count the number of times None appears in the list.
4. Create a new list with the names in alphabetical order, copy this list as a new list without changing the values of the original list
5. Add the string "Lovell" to copied_list and ensure that list_exercise remains unchanged.
dict_exercise = {"Ramy": "India",
"Victorie":"Haiti",
"Letty":"England",
"Robin":"Canton",
"Antoine":"Nigeria",
"Griffin":"China"}
dict_exercise
Look up the keys in the dictionary, and story them in a list object called keys
Add yourself, and two other collegues in this dictionary. The values are the countries the person in the key was born.
Remove "Ramy" from the dictionary, and save as another dictionary
babel = "That's just what translation is, I think. That's all speaking is. Listening to the other and trying to see past your own biases to glimpse what they're trying to say. Showing yourself to the world, and hoping someone else understands."
babel
Determine if the word "Babel" is present in the string.
Count how many times the word "translation" appears
Convert the entire string to upper case
Convert the pronoum "I" to "We" in the entire text.
Strip any punctuation (like commas, exclamation marks, etc.) from the string.
!jupyter nbconvert _week-03_data_types.ipynb --to html --template classic