PPOL 5203 Data Science I: Foundations

Comprehensions & Generators

Tiago Ventura


Learning goals

In this notebook, we will cover:

  • List and Dicitonary Comprehensions
  • Generators of iterables
    • itertools
    • zip
    • enumerate

As a Data Scientist, quite often you will need to iterate over a sequence. We learned one general approach to iterations: loops.

There are some more efficient and more readable options to repeat operations compared to loops.

For Efficiency: we will always prefer a vectorize approach compared to element-wise repetions. This will be covered in the Numpy notebook

Instead of focusing on efficiency, this notebook will focus on readability. The notebook discusses comprehensions and generators. These are different ways, in general more redable, to repeat operations in Python over a series of elements.

Comprehensions

Provide a readable and effective way of performing a particular expression on a iterable series of items.

The general form of the comprehension:

See here for more details.

Equivalency in Loops

This operation would be equivalent to the following loop:

a_list = [...]
result = []
for e in a_list:
    if type(e) == int:  # use int for Python 3
        result.append(e**2)

Attention on the output:

  • Comprehensions already simplify and aggregates your operation into a list/dict/set.

List Comprehensions

Using the list literals [] (brackets), we construct a for loop from within.

In [1]:
words = "This is a such a long    course".split(" ")
words
Out[1]:
['This', 'is', 'a', 'such', 'a', 'long', '', '', '', 'course']

simple list comprehension

In [2]:
[w for w in words]
Out[2]:
['This', 'is', 'a', 'such', 'a', 'long', '', '', '', 'course']

easy to apply functions to elements

In [3]:
[len(w) for w in words]
Out[3]:
[4, 2, 1, 4, 1, 4, 0, 0, 0, 6]

as well as adding conditions

In [4]:
[w for w in words if "This" in w]
Out[4]:
['This']

List comprehensions are a tool for transforming one list (or any container object in Python) into another list. This is a syntactic work around for the long standing filter() and map() functions in python, or a loop.

In [8]:
# object
words = "This is a such a long   course".split(" ")

# Filter step: include those with more than one character
filtered_words = list(filter(lambda word: len(word) > 1, words))
print(filtered_words)
['This', 'is', 'such', 'long', 'course']
In [9]:
# Map step: apply a function to a container
lengths = list(map(len, filtered_words))
print(lengths)
[4, 2, 4, 4, 6]

Work around with loops

In [10]:
result_list = []
for word in words:
    if len(word) > 1:
        result_list.append(len(word))
In [5]:
# much easier with list comprehensions
list_comp = [len(w) for w in words if len(w) > 1]
In [15]:
result_list
Out[15]:
[4, 2, 4, 4, 6]
In [16]:
list_comp
Out[16]:
[4, 2, 4, 4, 6]

Set Comprehensions

(New to Python 3)

Using the set literals {}, we construct a for loop from within.

In [10]:
# example 1
{len(word) for word in words}
Out[10]:
{0, 1, 2, 4, 6}
In [11]:
# example 2
{word for word in [1, 2, 3, 3, 3, 3, 4]}
Out[11]:
{1, 2, 3, 4}

Dictionary Comprehensions

(New to Python 3)

Using the set literals {} and assigning a key value pair {key : value}, we construct a for loop from within.

As with lists/sets:

  • Dictionary comprehension can replace loops when creating dictionaries

  • Or for transforming one dictionary into another dictionary.

    • Items/Keys within the original dictionary can be conditionally included
    • Items/keys can be transformed as needed.
In [32]:
# object
words = "This is a such a long course".split(" ")

# dict comprehension
dict_ = {word:len(word) for word in words}

print(dict_)
{'This': 4, 'is': 2, 'a': 1, 'such': 4, 'long': 4, 'course': 6}
In [33]:
# modifying a fully formed dictionary => Use .items() methods
dict_.items()
Out[33]:
dict_items([('This', 4), ('is', 2), ('a', 1), ('such', 4), ('long', 4), ('course', 6)])
In [34]:
# dict comprehension modifying both keys and values
{keys.upper(): values for keys, values in dict_.items()}
Out[34]:
{'THIS': 4, 'IS': 2, 'A': 1, 'SUCH': 4, 'LONG': 4, 'COURSE': 6}
In [37]:
# dict compehension with dict as inputs
{keys.lower(): values for keys, values in dict_.items()}
Out[37]:
{'this': 4, 'is': 2, 'a': 1, 'such': 4, 'long': 4, 'course': 6}

if statements in comprehensions

In [22]:
# Quickly produce a series of numbers
[i for i in range(10)]
Out[22]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [23]:
[i for i in range(10) if i > 5 ]
Out[23]:
[6, 7, 8, 9]

else statements aren't valid in a comprehension, so the code statement needs to be kept simple.

In [24]:
[i for i in range(10) if i > 5 else "hello"]
  Cell In[24], line 1
    [i for i in range(10) if i > 5 else "hello"]
                                   ^
SyntaxError: invalid syntax

Conditional Expressions

Concise if-then statements

<this_thing> if <this_is_true> else <this_other_thing>
In [25]:
x = 4
"Yes" if x > 5 else "No"
Out[25]:
'No'
In [26]:
x = 6
"Yes" if x > 5 else "No"
Out[26]:
'Yes'
In [27]:
["Yes" if x > 5 else "No" for x in range(10)]
Out[27]:
['No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes']

Nested comprehensions

In [28]:
# Started with a nested list
nested_list = [[1, 2, 3, 4, 5], 
              ["this", "is", "starting", "to", "get", "weird"]]

nested_list
Out[28]:
[[1, 2, 3, 4, 5], ['this', 'is', 'starting', 'to', 'get', 'weird']]

Works as a nested loop: starts from the outer to the inned element.

In [29]:
# Unnesting a nested list, for example. 
[element for sublist in nested_list for element in sublist] # notice the order inverted here
Out[29]:
[1, 2, 3, 4, 5, 'this', 'is', 'starting', 'to', 'get', 'weird']

Alert, Alert, Alert!!

Loops, Map/Filter or Comprehensions?

Read here for a in-depth discussion about the performance of loops, map/filter, and comprehensions techniques in Python.

TLDR:

  • Comprehensions are generally faster (no middle-man).

  • Map/Filter can improve perfomance on complex functions.

  • IMHO:

    • Learn all the three, and get proficient on reading code with any.
    • Write your code with the technique you feel more comfortable with.
    • If writing code at large scale, experiment with more efficient solutions.

Example of speed boost from list comprehension

Comprehensions not only make our code more concise, they also increase the speed of our code

In [125]:
import time
start = time.time()
container = []
for i in range(100000000):
    container.append(i)
end = time.time()
end-start
Out[125]:
2.597698926925659
In [126]:
import time
start = time.time()
container = []
container = [i for i in range(100000000)]
end = time.time()
end-start
Out[126]:
1.8831946849822998

The comprehension expression takes half of the time!


Generators

We will now introduce an important tool in Python called generators.

Definition: Generators are a special type of function in Python that creates an iterator. It allows you to generate a series of values over time, rather than computing them at once and holding them in memory.

Compare with list

Let's compare the idea of a generator with a simple iterable object, a list. A list readily stores all of its members; you can access any of its contents via indexing, or iterating over them with a loop. A generator, on the other hand, works with lazy evalution, and only create contents as request. A generator produces one value at a time, on the fly, only when you ask for it.

The whole point of this is that you can use a generator to produce a long sequence of items, without having to store them all in memory

The range generator

An extremely popular built-in generator is range. Range is often used as a way to implement loops. It takes the following inputs range(start, stop, step), where:

‘start’ (inclusive, default=0)

‘stop’ (exclusive)

‘step’ (default=1)

As a generator, range will generate the corresponding sequence of integers (from start to stop, using the step size) upon iteration. But remember, this will be a lazy evaluation.

Build a very simple Generator:

In [38]:
def gen123():
    yield 1
    yield 2
    yield 3
In [39]:
gen123
Out[39]:
<function __main__.gen123()>
In [41]:
g = gen123() # instantiate in an object
g
Out[41]:
<generator object gen123 at 0x1102b4b40>

Generators are similar to functions; however, rather than use the return keyword, we leverage the yield keyword. If you use the yield keyword once in a function, then that function is a generator.

Understanding yield function

When a generator function calls yield, the "state" of the generator function is frozen; the values of all variables are saved and the next line of code to be executed is recorded, until the generator is called again. Once it is, it picks up where it left off and continues execution until it hits another yield statement.

To recap, Generators:

  • Specify an iterable sequence that you evaluate lazily (compute on demand).
  • All generators are iterators
  • Can model infinite sequences (such as data streams with no definite end)

Behaves just like an iterator; however, the next thing being demanded isn't the next item, but rather the next computation

In [42]:
next(g)
Out[42]:
1
In [43]:
next(g)
Out[43]:
2
In [44]:
next(g)
Out[44]:
3
In [45]:
next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[45], line 1
----> 1 next(g)

StopIteration: 
In [12]:
l_ =[1, 2, 3]
l_iter = iter(l_)
next(l_iter)
Out[12]:
1
In [13]:
next(l_iter)
Out[13]:
2
In [14]:
next(l_iter)
Out[14]:
3

To keep in mind:

A generator is a simple way to construct a new iterator object, and evaluate lazily. You can feed a non-iterable object, and make it iterable with a generator.

That flexibility gives Python many different ways to create iterable objects.

Let's see some examples of three built-in modules that return generators. Notice that when a function returns a generator, it doesn't have to generate all the output at once. Instead, it generates each item one at a time on-the-fly as you iterate over the generator. This means that a generator can generate a very large, or even infinite, amount of output while using very little memory.

We will focus on two that are heavily used in data science:

  • range()
  • zip()
  • enumerate()

range()

An extremely popular built-in generator is range. Range is often used as a way to implement loops. It takes the following inputs range(start, stop, step), where:

‘start’ (inclusive, default=0)

‘stop’ (exclusive)

‘step’ (default=1)

As a generator, range will generate the corresponding sequence of integers (from start to stop, using the step size) upon iteration. But remember, this will be a lazy evaluation.

Consider the following example usages of range:

In [48]:
# create a range
r = range(0, 10, 2)
In [49]:
# type
type(r)
Out[49]:
range
In [50]:
# list comprehension
print([r_ for r_ in r])
[0, 2, 4, 6, 8]
In [51]:
for r_ in range(10):
    print(r_)
0
1
2
3
4
5
6
7
8
9

zip()

syncs two series of numbers up into tuples.

In [52]:
a = list(range(10))
b = list(range(-10,0))
sync = zip(a,b) # It's own object type
sync
Out[52]:
<zip at 0x1102c7fc0>
In [64]:
next(sync)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[64], line 1
----> 1 next(sync)

StopIteration: 
In [65]:
[item for item in zip(a,b)]
Out[65]:
[(0, -10),
 (1, -9),
 (2, -8),
 (3, -7),
 (4, -6),
 (5, -5),
 (6, -4),
 (7, -3),
 (8, -2),
 (9, -1)]
In [66]:
# you can also unpack those
for a_, b_ in zip(a,b):
    print("element a:", a_, "element b:", b_)
element a: 0 element b: -10
element a: 1 element b: -9
element a: 2 element b: -8
element a: 3 element b: -7
element a: 4 element b: -6
element a: 5 element b: -5
element a: 6 element b: -4
element a: 7 element b: -3
element a: 8 element b: -2
element a: 9 element b: -1

enumerate()

Generates an index and value tuple pairing

In [67]:
my_list = 'Iterator tools are useful to move across iterable objects in complex ways.'.split()
print(my_list)
['Iterator', 'tools', 'are', 'useful', 'to', 'move', 'across', 'iterable', 'objects', 'in', 'complex', 'ways.']
In [68]:
my_list_gen = enumerate(my_list)
In [69]:
next(my_list_gen)
Out[69]:
(0, 'Iterator')
In [70]:
[i for i in enumerate(my_list)]
Out[70]:
[(0, 'Iterator'),
 (1, 'tools'),
 (2, 'are'),
 (3, 'useful'),
 (4, 'to'),
 (5, 'move'),
 (6, 'across'),
 (7, 'iterable'),
 (8, 'objects'),
 (9, 'in'),
 (10, 'complex'),
 (11, 'ways.')]

Additional Topic: itertools

Part of the python standard library. Itertools deals with pythons iterator objects. This provides a robust functionaliy for iterable sequences. Functions in itertools operate on iterators to produce more complex iterators.

There are many methods in itertools. See the documentation here. Most importantly, try to understand what is going on behind each function just reading the documentation. It is fun!

.combinations()

Permutations of all potential combinations

In [40]:
import itertools
x = ['a','b','c','d']
[i for i in itertools.combinations(x,2)]
Out[40]:
[('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')]

.permutations()

In [41]:
for i in itertools.permutations(x):
    print(i)
('a', 'b', 'c', 'd')
('a', 'b', 'd', 'c')
('a', 'c', 'b', 'd')
('a', 'c', 'd', 'b')
('a', 'd', 'b', 'c')
('a', 'd', 'c', 'b')
('b', 'a', 'c', 'd')
('b', 'a', 'd', 'c')
('b', 'c', 'a', 'd')
('b', 'c', 'd', 'a')
('b', 'd', 'a', 'c')
('b', 'd', 'c', 'a')
('c', 'a', 'b', 'd')
('c', 'a', 'd', 'b')
('c', 'b', 'a', 'd')
('c', 'b', 'd', 'a')
('c', 'd', 'a', 'b')
('c', 'd', 'b', 'a')
('d', 'a', 'b', 'c')
('d', 'a', 'c', 'b')
('d', 'b', 'a', 'c')
('d', 'b', 'c', 'a')
('d', 'c', 'a', 'b')
('d', 'c', 'b', 'a')

.count()

Creates a count generator.

In [46]:
# don't loop over this. it is a infinite generator
counter = itertools.count(start=0,step=.3)
In [47]:
next(counter)
Out[47]:
0
In [48]:
next(counter)
Out[48]:
0.3
In [49]:
list(zip(itertools.count(step=5),"Georgetown"))
Out[49]:
[(0, 'G'),
 (5, 'e'),
 (10, 'o'),
 (15, 'r'),
 (20, 'g'),
 (25, 'e'),
 (30, 't'),
 (35, 'o'),
 (40, 'w'),
 (45, 'n')]

.chain()

lazily concatenate lists together without the memory overhead of duplication.

In [50]:
list(itertools.chain('ABC', 'DEF'))
Out[50]:
['A', 'B', 'C', 'D', 'E', 'F']

.repeat()

In [51]:
list(itertools.repeat("a",10))
Out[51]:
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

These are but a few! Check out all that itertools has to offer here

In [71]:
!jupyter nbconvert _week_4_comprehension_generators.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_4_comprehension_generators.ipynb to html
[NbConvertApp] Writing 343863 bytes to _week_4_comprehension_generators.html