Comma Separated Values (CSV) Data Files

In this chapter you will learn how to write and read data to and from CSV files using Python. We will first review basic file output, and then move on to writing data in a CSV format that can be used by many other programs. After we learn to write data to CSV format, we will proceed to the slightly more involved topic of reading CSV files.

Basic File Output

Writing text to a file is a three step process. First, you must open a file for writing (using the “w” argument) with the open function. After you have created a file object, you can write to it. And of course, once you are finished writing text, you must close the file object.

1
2
3
4
5
myFileObj = open("filename.txt", "w")
myFileObj.write("This text is written to the file!")
myFileObj.write("Some more text to be written...")
myFileObj.write("\n") #Start a new line
myFileObj.close()

A few things to note about the code example above. First, the name of the variable used to refer to the file object is chosen by the user. In line 1 of this example it is specified to be myFileObj, but you could as easily use a different variable name such as myFile. If you used myFile, you would then use myFile.write(...) and myFile.close() when working with that file object. Second, unlike the print(...) function, the write(...) method DOES NOT add a newline character ( \n ) to the string it writes to the file. If you want the file to have newlines, you must manually add them as is done in line 4 of the example. Third, although we do not take advantage of it here, the write(...) method will return an integer representing the actual number of characters that were successfully written. Typically this is the length of the string that you gave it to write, unless something goes wrong!

CSV File Output

CSV stands for comma separated value(s), and is a simple text based file format used to store tabular data in a format that is easy for humans to read as well as being easy to load for programs. Here is an example CSV file that records the number of As Seen on TV products a factory produces each day.

Download asotv.csv

Date, ShamWow, Cami Lace, Instant Switch, Flowbee
Monday, 1232, 3221, 638, 893
Tuesday, 1532, 2832, 543, 789
Wednesday, 1132, 3148, 593, 827
Thursday, 1341, 2944, 601, 832
Friday, 1242, 1234, 621, 794

Note that each line is a data entry (data row), and in this example the first line is special, in that it has a text descriptor of each column. Also note that even though the columns are not lined up in the text CSV file, it is understood that the commas should be used to indicate column breaks, such that on Friday the factory produced 794 flowbees.

You can export data in CSV format simply by putting a comma between each data item, and placing a newline n character at the end of each record. Of course, the specific python code to accomplish this depends upon what internal format your data is stored in. Here is an example stored in a list:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
factoryOutput = [ ["Monday", 1232, 3221, 638, 893],
                  ["Tuesday", 1532, 2832, 543, 789],
                  ["Wednesday", 1132, 3148, 593, 827],
                  ["Thursday", 1341, 2944, 601, 832],
                  ["Friday", 1242, 1234, 621, 794]       ]

def writeOutput(filename, output):
   #Open the file
   ourFile = open(filename, "w")

   #First, write the 'special' header of column labels, followed by a newline
   ourFile.write("Date, ShamWow, Cami Lace, Instant Switch, Flowbee")
   ourFile.write("\n")

   #Second, iterate through the data elements (records) with a for loop
   for record in factoryOutput:
      for item in record:
         itemAsString = str(item)
         ourFile.write(itemAsString)
         ourFile.write(",")
      #End each full record with a newline.
      ourFile.write("\n")

   #Third, close the file.
   ourFile.close()


writeOutput("factoryOutput.csv", factoryOutput)

Note the lines of code that do the actual writing. They consist of a doubly nested for loop. Line 16, the first for record in factoryOutput loop goes through each day’s factory output, while the second for item in record loop (line 17) writes out each item. Note that because we can not directly write an integer, we have to convert each item to a string before writing it, including the string that represents the day. Line 18 does this conversion for us. It is safe to convert a string to a string, which has no real effect. If we failed to do this, python would give us an error such as the following as soon as we tried to write an integer.

File "factoryOutput2CSV.py", line 27, in <module>
   writeOutput("factoryOutput.csv", factoryOutput)
File "factoryOutput2CSV.py", line 19, in writeOutput
   ourFile.write(item)
TypeError: must be str, not int

Line 20 places a coma after each item. However, our output will NOT exactly match our example, as we will have a comma after EVERY data item, including the last!:

Date, ShamWow, Cami Lace, Instant Switch, Flowbee
Monday,1232,3221,638,893,
Tuesday,1532,2832,543,789,
Wednesday,1132,3148,593,827,
Thursday,1341,2944,601,832,
Friday,1242,1234,621,794,

Luckily for us, most programs that read CSV formated data will accept this, although some of them may have blank or null entries as the last item in each record, and some may complain that the first row (the header) does not have as many entries as the rest unless you also place an extra coma at the end of the header. It will be left as an exercise for the reader to write CSV files that omit the trailing comma.

Basic File Input

Opening a file for reading is very similar to writing, the only thing that changes is that you give the open(...) function a mode of “r” for reading.

aFile = open("fileName.txt","r")

Reading files is slightly more complicated than writing, because you have a choice of three different methods for reading from a file depending upon how you prefer to deal with the data. The methods are:

  • read() - This function will return a string of characters that may include every character in the file. If you give the Read() method an optional integer parameter, it will only return the first N characters, allowing you to read and deal with smaller sections of the file sequentially. When you are finished reading the file, it will return an empty string.

    1
    2
    3
    4
    5
    6
    myFile = open("asotv.csv","r")
    first10 = myFile.read(10)
    print("The first ten characters of the file are:", first10)
    rest = myFile.read()
    print("The rest of the file is:", rest)
    myFile.close()  #Closing the file when you are done is always good practice
    
  • readline() - This function will return a string of characters including the newline ( \n ) character at the end of the line. You call it repeatedly to read each line in the file. When you are finished reading the file, it will return an empty string. When reading CSV files, we typically either use readline or readlines().

    1
    2
    3
    4
    5
    aFile = open("asotv.csv", "r") #Note this time we used aFile instead of myFile
    firstLine = aFile.readline()
    secondLine = aFile.readline()
    print("First line is:", firstLine, "Second line is:", secondLine)
    aFile.close()
    
  • readlines() - This function will return a list containing all of the lines of text in the file. You can get the same behavior by calling ReadLine() multiple times and appending the strings it returns into a list. The only downside to ReadLines() is that it will attempt to read the entire file all at once, which is usually what you want, unless the file is of unmanageable size.

    f = open("asotv.csv", "r")
    allLines = f.readlines()
    f.close()
    print("File contents:", allLines)  #Prints the list of strings
    

As you can see, reading a text file is not difficult. However, we must do extra work to process the raw text into a nice list of data items. As an example, the variable allLines above references a list as follows:

['Date, ShamWow, Cami Lace, Instant Switch, Flowbee \n',
'Monday, 1232, 3221, 638, 893\n', 'Tuesday, 1532, 2832, 543, 789\n',
'Wednesday, 1132, 3148, 593, 827\n', 'Thursday, 1341, 2944, 601, 832\n',
'Friday, 1242, 1234, 621, 794\n', '\n']

You can see that the strings include a line of data each, and have trailing newline characters (\n) tacked on to the end. We even have one empty line at the very end which has nothing in it other than a newline character!

CSV File Input

Reading data from CSV files is not a difficult prospect if you break the problem down into small manageable pieces and solve each sub-problem individually. You need to do the following:

  1. Read the data from the file.
  2. Remove any blank lines.
  3. Split each line at commas, to separate each data item.

The following function will read in data from a file as a list of strings.

1
2
3
4
5
def readData(fileName):
   f = open(fileName, "r")
   data = f.readlines()
   f.close()
   return data

This function will process a list of strings, and remove any strings that have nothing but whitespace in them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#Data is a list of strings
def removeBlankLines(data):
   goodLines = []

   for thisLine in data:
       thisLine = thisLine.rstrip()
       if len(thisLine) != 0:
           goodLines.append(thisLine)


   return(goodLines)

This function will split lines up using a comma as the delimiter.

1
2
3
4
5
6
def splitLines(data):
   newLines = []
   for line in data:
       myList = line.split(",")
       newLines.append(myList)
   return(newLines)

Finally, you can use all of these functions together to get a list that itself contain one or more lists of data elements.

rawdata = readData("asotv.csv")
newData = removeBlankLines(rawdata)
csvData = splitLines(newData)
print(csvData)

This results in a data structure that looks like the following:

1
2
3
4
5
6
7
8
csvData = [
['Date', ' ShamWow', ' Cami Lace', ' Instant Switch', ' Flowbee'],
['Monday', ' 1232', ' 3221', ' 638', ' 893'],
['Tuesday', ' 1532', ' 2832', ' 543', ' 789'],
['Wednesday', ' 1132', ' 3148', ' 593', ' 827'],
['Thursday', ' 1341', ' 2944', ' 601', ' 832'],
['Friday', ' 1242', ' 1234', ' 621', ' 794']
]

Note a few things about this data structure:

  1. Every item is still a string, even the strings of digits that represent how many items were produced on each day.
  2. The first row of data contains the column headers.

We can use standard list indexing to access any element in this doubly nested list. For example, the number of Cami Lace’s produced on Thursday is accessed using:

csvData[4][2]

Note that the first index (4) is accessing the row, and the second index (2) is accessing the column, because we have stored our data in a row-major data structure. Also note that indexing starts from zero, so index 4 is actually the fifth “row” including the row of header data.

Reading CSV files using the CSV module

Although it is good for you to understand HOW to read CSV files, this operation is done so frequently that Python includes a module that will do the heavy lifting for you. The name of the module is “csv”. Here is how you would use a reader object from the csv module to do the same procedure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def readCSV(filename):
 import csv
 lines = []

 f = open(filename)
 csvReader = csv.reader( f, delimiter="," )

 for row in csvReader:
   lines.append(row)

 return lines


data = readCSV("asotv.csv")
print(data)

Note that we used a named parameter (delimiter) to specify what character our file used as it’s delimiter. Also note that the csv.reader constructor generated an object (referenced by the csvReader variable above) which is iterable. Each time you call the .__next()__ method on the object, it returns a new row from the file. [In the code above, the FOR loop does this for us. The FOR loop also handles the StopIteration exception that is raised when the data in the file runs out.] If we wanted to get the headers separately from the data, we could modify the above code as follows:

1
2
3
4
5
6
7
8
9
data = readCSV("asotv.csv")

#separate out the first row:
headers = data[0]
del data[0]  #remove the row we just copied out!

print(headers)
print("\n") #newline between the headers and the data.
print(data)

Real CSV Files

In our example asotv.csv file above, we were lucky in that the numbers were all formatted using only digits. What if we had formatted numbers such as 4,382? Then the comma in the numbers would have completely messed up our system of breaking apart lines at commas!

(Numbers such as $432,232.08 are very common in REAL CSV files...)

To handle this, most real programs that export data to CSV files actually “quote” each data using some OTHER character, such as a double-quote mark. For example, a line of data may be represented as follows:

“234,423.83”,”443,212.83”,”434,324.77”

Note that this data has commas inside the numbers, as well as commas separating each data item. Making our job harder is the edge conditions. Each line starts with a ” then data items are separated with a ”,” and finally the last data item ends with a single ”.

The use of double-quotes around each “column” of data allows the CSV module to parse this data. The only extra work we have to do is to specify the character that is used to “quote” the data when we create the csv.reader, as follows:

1
2
3
4
5
import csv
file = open("ourFileName.csv", "r")
csvReader = csv.reader( file,  delimiter=",", quotechar='"')
for row in csvReader:
  print(row)

Note how we use single quotes to make a string that contains a double-quote. Handling the edge conditions (starts, middle, end of the line) is done for us automatically by the csv module!

Instead of specifying the delimiter and quotechar every time you want to open a file, the CSV file module allows you to specify a “dialect” of CSV files, which is a way to easily specify a delimiter and quoting character. Common dialects are “excel” or “excel_tab”. If you do not specify a dialect, delimiter or quotechar, the csv module defaults to using the standard excel dialect.

You should read more about the csv module at the python.org documentaiton site.

Writing CSV files using the CSV module

In addition to reading CSV files, the csv.writer object makes it easy to write to CSV files! Instead of opening the file for reading, we open it for writing (“w”) and then create a csv.writer object. This object allows us to write a single row at a time, or to write multiple rows all at once. Due to a wierdness with how Python opens files on the windows platform, you need to tell the open method to not use newlines (the newline=”” bit on line 2 below). That newline named parameter is optional on linux and Mac platforms, but doesn’t hurt, so I suggest you use it regardless of what type of computer you are developing your code on, so that the code will work correctly if it is ran on a windows computer. If you leave it out, you would get extra blank lines between each line of data if you run the code on a windows computer.

1
2
3
4
5
6
import csv
file = open("outFile.csv", "w", newline="")
csvWriter = csv.writer( file )  #Defaults to the excel dialect
csvWriter.writerow( ['item1', 'item2','item3','4','5','6'] )
csvWriter.writerow( ['1stInSecondRow', "2", "22", str(3)] )
file.close()   #Required, or the data won't get flushed to the file!

Note that if you have all of the rows at once, you can write them all at the same time by using the “writerows” method instead of the “writerow” method:

1
2
3
4
5
6
import csv

file = open("outFile.csv", "w", newline="")
csvWriter = csv.writer( file )  #Defaults to the excel dialect
csvWriter.writerows( [ ['item1', '2','3'] , ['1stInSecondRow', "2",  str(3)]  ] )
file.close()   #Required, or the data won't get flushed to the file!

Note that we are giving the writerows method a list of lists.