PPOL 5203 Data Science I: Foundations

File Management

Tiago Ventura


Learning Goals

In this notebook, we will cover:

  • connection management: open(), close()
  • Reading/writing files
  • using with to manage connections.
  • Reading .csvs

TLDR: Most often we will use high-level functions from Pandas to load data into Python objects. However, if you are migrating from R or Stata to Python, the use of connection management functions (open(), close() and with()) are very characteristic of writing code in Python, and are not heavily used in other languages. These file handlers are also important when working non-tabular data, in which most often you don't need to (or don't want to) fit into a dataframe.

**Working Directory:** Notice that we will read some local files in this notebook. These files need to be in the same working directory as your notebook, or else Python will not know where the files are on your computer. Please download the accompanying files (`news-story.txt` and `student_data.csv`) from the class website, and place into a place you can easily set you python towards. Ideally, on the folder you have for this class!

Reading: Check Section 3.3 of Python for Data Analysis to learn more about the topics covered in the notebook.

In [1]:
import os
In [5]:
# check my working directory
os.getcwd()
Out[5]:
'/Users/tb186/Dropbox/courses/ds-1/ppol5203/lecture_notes/week-04'

Reading Files

open()

The built-in open() function opens files on our system. The function takes the following arguments:

  • a file path
  • a mode describing how to treat the file (e.g. read the file, write to the file, append to the file, etc.). Default is read mode ("r").
  • an encoding. Default is "UTF-8" for most systems.
In [6]:
file = open("redrising.txt",mode='r',encoding='UTF-8')

Alert:

open() returns a special item type _io.TextIOWrapper. It is file-like-object which is just loosely defined in Python.

In [7]:
type(file)
Out[7]:
_io.TextIOWrapper
In [8]:
file_ = file.read()
file_
Out[8]:
'1 - Helldiver\n\nThe first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. \n\nI did not cry. \n\nNot when the Society televised the arrest. \n\nNot when the Golds tried him. Not when the Grays hanged him.\n\nMother hit me for that. \n\nMy brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. \n\nI just watched and thought it a shame that he died dancing but without his dancing shoes.\n\nOn Mars there is not much gravity. So you have to pull the feet to break the neck. \n\nThey let the loved ones do it.\n\n'
In [9]:
# Once we've read through the items, the file object is empty
print(file.read()) 


close()

Once we are done with a file, we need to close it.

In [10]:
file.close()

Opening and forgetting to close files can lead to a bunch of issues --- mainly the mismanagement of computational resources on your machine.

Moreover, close() is necessary for actually writing files to our computer


Methods available when reading in files

Methods in object type `TextIOWrapper`

Method Description
._CHUNK_SIZE() int([x]) -> integer int(x, base=10) -> integer
._finalizing() bool(x) -> bool
.buffer() Create a new buffered reader using the given readable raw IO object.
.closed() bool(x) -> bool
.encoding() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.errors() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.line_buffering() bool(x) -> bool
.mode() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.name() str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
.readlines() Return a list of lines from the stream.
.reconfigure() Reconfigure the text stream with new parameters.
.write_through() bool(x) -> bool
In [11]:
file = open("redrising.txt",mode='rt',encoding='UTF-8')
file.readlines() # convert all items to a list
Out[11]:
['1 - Helldiver\n',
 '\n',
 'The first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. \n',
 '\n',
 'I did not cry. \n',
 '\n',
 'Not when the Society televised the arrest. \n',
 '\n',
 'Not when the Golds tried him. Not when the Grays hanged him.\n',
 '\n',
 'Mother hit me for that. \n',
 '\n',
 'My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. \n',
 '\n',
 'I just watched and thought it a shame that he died dancing but without his dancing shoes.\n',
 '\n',
 'On Mars there is not much gravity. So you have to pull the feet to break the neck. \n',
 '\n',
 'They let the loved ones do it.\n',
 '\n']
In [12]:
# Is the file closed?
file.closed
Out[12]:
False

File modes

Mode Description
r "open for reading" default
w open for writing
x open for exclusive creation, failing if the file already exists
a open for writing, appending to the end of the file if it exists
b binary mode
t text mode (default)

Examples,

  • mode = 'rb' → "read binary"
  • mode = 'wt' → "write text"
In [13]:
f = open('redrising.txt',mode="rt",encoding='utf-8')

# Print the mode
print(f.mode)

f.close()
rt

Writing files

In [38]:
f = open('text_file.txt',mode="wt",encoding='utf-8')
f.write('This is an example\n') 
f.write('Of writing a file...\n')
f.write('Neat!\n')
f.close()

NOTE that you must close() for your lines to be written to the file

Now, read the file back in in "read mode"

In [14]:
f = open('text_file.txt',mode="rt",encoding='utf-8')
print(f.read())
This is an example
Of writing a file...
Neat!


Iterating over files

Look at the code below:

In [39]:
file = open("redrising.txt",mode='rt',encoding='UTF-8')
[i for i in dir(file) if (i=="__iter__" or i=="__next__")]
Out[39]:
['__iter__', '__next__']

We'll note when looking at the object's attributes that there is an __iter__() and __next__() method, meaning we can iterate over the open file object.

Most times we will iterate over to convert our open file to a single list.

See a few options below:

In [42]:
# with a loop
file = open("redrising.txt",mode='rt',encoding='UTF-8')
text=[]
for line in file:
    text.append(line)
file.close()
In [43]:
## It is an object in python now
text
Out[43]:
['1 - Helldiver\n',
 '\n',
 'The first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. \n',
 '\n',
 'I did not cry. \n',
 '\n',
 'Not when the Society televised the arrest. \n',
 '\n',
 'Not when the Golds tried him. Not when the Grays hanged him.\n',
 '\n',
 'Mother hit me for that. \n',
 '\n',
 'My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. \n',
 '\n',
 'I just watched and thought it a shame that he died dancing but without his dancing shoes.\n',
 '\n',
 'On Mars there is not much gravity. So you have to pull the feet to break the neck. \n',
 '\n',
 'They let the loved ones do it.\n',
 '\n']
In [44]:
# with list comprehension
file = open("redrising.txt",mode='rt',encoding='UTF-8')
result = [line.replace("\n", "") for line in file if line!="\n"]
file.close()
result
Out[44]:
['1 - Helldiver',
 'The first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. ',
 'I did not cry. ',
 'Not when the Society televised the arrest. ',
 'Not when the Golds tried him. Not when the Grays hanged him.',
 'Mother hit me for that. ',
 'My brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. ',
 'I just watched and thought it a shame that he died dancing but without his dancing shoes.',
 'On Mars there is not much gravity. So you have to pull the feet to break the neck. ',
 'They let the loved ones do it.']

Or you can assign the output of .read() to an object:

In [18]:
file = open("redrising.txt",mode='rt',encoding='UTF-8')
result = file.read()
result
Out[18]:
'1 - Helldiver\n\nThe first thing you should know about me is I am my father’s son. And when they came for him, I did as he asked. \n\nI did not cry. \n\nNot when the Society televised the arrest. \n\nNot when the Golds tried him. Not when the Grays hanged him.\n\nMother hit me for that. \n\nMy brother Kieran was supposed to be the stoic one. He was the elder, I the younger. I was supposed to cry. Instead, Kieran bawled like a girl when Little Eo tucked a haemanthus into Father’s left workboot and ran back to her own father’s side. My sister Leanna murmured a lament beside me. \n\nI just watched and thought it a shame that he died dancing but without his dancing shoes.\n\nOn Mars there is not much gravity. So you have to pull the feet to break the neck. \n\nThey let the loved ones do it.\n\n'

Example: How many words are in each line?

In [45]:
file = open("redrising.txt",mode='rt',encoding='UTF-8')

for line in file:
    if line == '\n':
        continue
    n_words_per_line = len(line.split())
    print(n_words_per_line)
    
file.close()
3
25
4
7
12
5
54
17
18
7

with: beyond opening and closing with context managers

As you'll note, the need to open() and close() files can get a bit redundant after awhile. This issue of closing after opening to deal with resource cleanup is common enough that python has a special protocol for it: the with code block.

In [46]:
# using list comprehension
# with open() as alias:
with open("redrising.txt",mode='rt',encoding='UTF-8') as file:
    res=[len(line.split()) for line in file if line!="\n"]

print(res)
[3, 25, 4, 7, 12, 5, 54, 17, 18, 7]

Reading Comma Separated Values (CSV)

Here we will pretty much always use pandas.read_csv() to import csv files to Python. In case you want to learn a bit about the csv module, here are some examples.

See the python documentation for more on the csv module located in the standard library.

In [48]:
import csv
In [49]:
with open("student_data.csv",mode='rt') as file:
    data = csv.reader(file)
In [50]:
print(data)
<_csv.reader object at 0x111dc5f50>

Reading in .csv data

In [22]:
with open("student_data.csv",mode='rt') as file:
    data = csv.reader(file)
    for row in data:
        print(row)
['Student', 'Grade']
['Susan', 'A']
['Sean', 'B-']
['Cody', 'A-']
['Karen', 'B+']
In [51]:
with open("student_data.csv",mode='rt') as file:
    data = csv.reader(file)
    output = [row for row in data]
output
Out[51]:
[['Student', 'Grade'],
 ['Susan', 'A'],
 ['Sean', 'B-'],
 ['Cody', 'A-'],
 ['Karen', 'B+']]

Writing csv data

In [52]:
# Student data as a nested list.
student_data = [["Student","Grade"],
                ["Susan","A"],
                ["Sean","B-"],
                ["Cody","A-"],
                ["Karen",'B+']]
In [53]:
# Write the rows with the .writerows() method
with open("student_data_write.csv",mode='w') as file:
    csv_file = csv.writer(file)
    csv_file.writerows(student_data)
In [54]:
!jupyter nbconvert _week_4_file_management.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_4_file_management.ipynb to html
[NbConvertApp] Writing 316318 bytes to _week_4_file_management.html