Help support the author by donating or purchasing a copy of the book (not available yet)



Input, File & Text Processing

10.1 - What is file & text processing?

File processing is the process of creating, storing and accessing a files contents. Up until now we've only been able to temporarily store data in our programs in objects such as lists. In the real world, this wouldn't be much use. We need some way of being able to save data permanently. To do this, our programs need to store data on a hard disk. This is persistent storage and the data in this storage will be able to survive a system reboot or our program crashing. If we reboot our system, RAM is cleared and we lose our data. We've been storing data in RAM until now.

When doing file processing we follow these steps:

Up until now we've been using input() and print() for input and output (I/O). These are high-level interfaces to the underlying operating-system services and sometimes we need finer control, especially when we want to read and write files.

Python implements file objects as general interfaces to data streams. These file objects allow us to read and write to and from standard input and standard output (this is what input() and print() do and we'll get to standard input and output later). The file objects also allow us to read and write files.

Text processing on the other hand is the processes of manipulating textual data. Text processing is a very common task in computing. Python has great support for text processing and we've even met some of the methods that Python provides to allow us to do this. These are split(), strip() and join() to name a few.

10.2 - The Sys module

The sys module provides information about constants, functions and methods of the Python interpreter. A module is a file consisting of Python code. A module can define functions, classes and variables. To access a module you must import it into your code using the import keyword.

import sys

# YOUR CODE

Through the sys module we can access arguments that are passed to our script at the command line. Lot's of Python scripts need access to the these arguments. We get access to them via argv (sys.argv).

argv is shorthand for argument vector. This is a list containing the command-line arguments passed to the script. The first element in this list is the script itself. The arguments for the script come after the name of the script. Let's look at how this works. Create a new Python file and save it as myscript.py and enter the following code:

import sys

print(sys.argv)

Then, at the command-line type the following, then Enter:

$ py myscript.py arg1 arg2

If you are on linux or macOS, type: python3 myscript.py arg1 arg2

The $ at the beginning is the command prompt, don't type this.

You will get the following output:

$ py myscript.py arg1 arg2
['myscript.py', 'arg1', 'arg2']

Let's use this in a more constructive way. Create a new file called calculator.py and paste in the following code:

import sys

total = 0

i = 1
while i < len(sys.argv):
    total += int(sys.argv[i])
    i += 1
print(total)

At the command line type the following, then Enter:

$ py calculator.py 5 3 6 
14

We can pass the numbers we want to add together to the script. This saves us having to ask the user for input each time we want new input. Remember though, argv is a list of strings and thats why we had to cast the arguments to integers in the above code.

The sys module also provides us with three standard file objects. These are: stdin, stdout, and stderr. These stand for standard input, standard output and standard error.

In the sections that follow we will look at each of them.

10.3 - Reading from standard input

Standard input is stream data (usually text) going into a program. In Python, we access standard input through the sys module which provides standard input (stdin) as a file object.

We have various methods available to us as to how we read from standard input.

Important: The following methods, unlike input() and print(), will never add or remove newline characters. We will have to handle newline characters ourselves.

The first is the read() method. It is the most basic of the three. read() will read the entire contents of the file and the entire contents will be assigned to a single string.

import sys

contents = sys.stdin.read()

print(contents)

Important: We can pass a file to our program using a command-line operator called the input redirection operator. When we use this operator we redirect input from our keyboard to the text file.

The syntax for this method is file.read() where sys.stdin is a file (file object here).

You should create a text file and fill it up with some text. I recommend you have about 10 lines of text. Each line can be a single word if you like.

To run the above code type the following at the command line. Make sure the text file and the Python script are in the same directory. Type the following, then Enter.

$ py myscript.py < input.txt
This is line one.
I'm line two
And I'm line three

This works the same on Linux, Windows and Mac.

This is the output we get. Yours will look different depending on what you have in your file. As you can see we've printed out that file you just created.

It reads the file in as a single string and the string looks like so:

"This is line one.\nI'm line two\nAnd I'm line three"

The second method we have available to us is the readlines() method. This is a little more complicated as it stores each line as an element in a list.

import sys

contents = sys.stdin.readlines()

print(contents)

The syntax for this method is file.readlines() where sys.stdin is a file (file object here).

Now we run our script the same way as before:

$ py myscript.py < input.txt
['This is line one.\n', "I'm line two.\n", "And I'm line three"]

As you can see our text file is now stored in a list, with each line being an element in the list. Notice how the newline characters haven't been removed.

This method is good as we can now do some manipulation on each line of text more easily.

Rather than using the input redirection operator we can just read lines from terminal.

You can run the above script again, this time omitting the redirection operator.

Important: With files we eventually reach the end of them. This is indicated in the file by EOF (end of file). We don't need to worry about how this works as it's handled at a lower level. When we do not redirect standard input to read from a file, we do however, need indicate EOF at the command line. On Windows this is done by pressing ctrl+z. On Linux this is indicated by pressing ctrl+d. When you run the above script again and omit the redirect, you will be prompted to continuously input lines of text. When you are finished press ctrl+z or ctrl+d.

$ py myscript.py
line one
line two
final line         # PRESS ENTER, THEN CTRL+Z OR CTRL+D TO INDICATE EOF.
['line one\n', 'line two\n', 'final line\n']

The third method available to us is the readline() method. This is arguably better than read() and readlines() as it only reads a single line from standard input. With the two previous methods, we could imagine a situation where a text file is really, really big and storing the entire contents of the file in memory might be a waste of valuable memory.

We can still read in many lines of input using this method, we'll just need a loop. Let's look at how that is done.

import sys

line = sys.stdin.readline()
while line:
    print(line.strip())     # Strip the newline character as print() will add one.
    line = sys.stdin.readline()

The syntax for this method is file.readline() where sys.stdin is a file (file object here).

Running this while redirecting standard input to a file yields:

$ py myscript.py < input.txt
This is line one.
I'm line two.
And I'm line three

I want to take another look at the read() method. We can actually control how much of the input is read. If we redirect input to a file we can limit how much of the file we read at any given time. Similarly, if we don't redirect, we can limit how much of user input through the console we read. We do this by passing an integer as an argument to the method. This integer represents how many characters we want to read.

This is done as follows:

import sys

contents = sys.stdin.read(6)

print(contents)

Running the code and redirecting standard input to a text file would yield:

$ py myscript.py < input.txt
This i

As you can see we passed the integer 6 to the read function, telling it to only read 6 characters.

Remember: Whitespace counts as a character and so does the newline character!

Get used to these methods! This is how we'll be the primary way of taking input into our functions from now on!

10.4 - Writing to standard output

Standard output refers to the streams of data that are produced by programs. These standardized streams make it very easy to send output from these programs to a devices display monitor, printer, etc.

We write to standard output by using the write() method on the sys.stdout file object. The behaviour of write() has changed since Python 2.x. In Python 3, when we write to standard output we are returned the number of characters that were written.

Let's look at that in action in the Python REPL:

>>> import sys
>>> sys.stdout.write("Hello")
Hello5
>>>

We can see that we wrote "Hello" to standard output and we were returned the number of characters that were written (5). The reason the 5 is on the same line is because, unlike print(), we must append the newline character ourselves:

>>> import sys
>>> sys.stdout.write("Hello\n")
Hello
5
>>>

We can also redirect the standard output stream to a file using the output redirection operator (>). Let's look at how this is done.

import sys

sys.stdout.write("Hello World!\n")

Then at the command-line run:

$ py myscript.py > output.txt

If you check the directory where your script is stored, you should now see a file called output.txt.

Open it and view it's contents, it should contain what we just wrote to the output stream.

When we redirect standard output to a file, a file will be created if it doesn't exist already.

We also have the writelines() method available to us which is not all that dissimilar to readlines() except it writes instead of reads.

import sys

my_lines = ["Line one\n", "Line two"]
sys.stdou.writelines(my_lines)

Then, at the command-line, run:

$ py myscript.py > output.txt

If you open output.txt you'll find that it contains the strings in the list above on two separate lines.

10.5 - Standard error

There is still one standard stream we have not covered yet and thats standard error (stderr). It again is a file object in python so we can write to it. It is typical for programs to write error messages and diagnostics to standard error. Typically, standard error is outputted to the terminal which started the program.

Having the standard output stream solves the semipredicate problem by allowing output and errors to be distinguished. We can redirect standard error to a different output stream to standard output. (an error log file for example).

That's all I want to say on standard error. We won't have to worry about it again in this book. It's just good to know about all the standard data streams we have access to.

10.6 - Opening & reading files

In this section we're going to look at open and reading from files that are stored in permanent storage. Although the standard data streams from the previous section are files, they were stored in RAM and were already open.

When reading a file, we initialize a file object that acts as a link from the program to the file stored on the disk.

To open a file we use the open() function. The open() function takes two arguments: The filename and the mode in which the file is to be opened. The syntax is: open(<filename>, <mode>)

There are various modes in which we can open a file, we're only concerned with four of them here.

The four modes we'll be dealing with are:

  1. 'r' - This mode opens a file for reading only.
  2. 'w' - This mode opens a file for writing. If the file doesn't exist, it creates a new file with the specified file name. If the file does exist and we write to it, anything that was previously in the file is overwritten.
  3. 'a' - This opens a file in append mode. If the file doesn't exist, it creates a new file with the specified file name. If it does exist, we add to the file rather than overwriting it.
  4. 'x' - This opens a file for exclusive creation, failing if it already exists and throwing a FileExistsError.

Once the file is open, we can read it's contents using the methods from section 3.4: read(), readlines() and readline().

Let's look at an example of opening a file from the disk and reading in it's contents:

my_file = open('input.txt', 'r')
content = my_file.readlines()
print(content)

Make sure you have a file called "input.txt" saved in the same directory as your script and for demo purposes make sure it has a few lines of text in it.

At the command-line, run:

$ py myscript.py
['This is line one.\n', "I'm line two.\n", "And I'm line three"]

You can now see that the contents of your file has been read in successfully. It is in list form as we used the readlines() method.

10.7 - Writing to files & closing files

In the previous section we looked at opening files. Now let's look at how to write to them and close them.

As you may have guessed, we write to files using the write() method we met in previous sections. To write to a file though, it must be opened in write mode.

Let's look at how to write to file:

my_file = open('output.txt', 'w')
my_file.write("I am being written to a file\n")

Run this script:

$ py myscript.py

Now open the file called output.txt. Don't worry if it didn't exist, one will be created. You should now see the line "I am being written to a file" contained in that file.

Now we need to close the file.

When a file is opened, the operating system allocates memory to track that file's state and the operating system cannot reallocate that memory until the file is closed. If we don't close open files, then we are using up memory unnecessarily.

When your program exists, the file is closed automatically but maybe we don't want to exit straight after we are finished reading and writing to a file.

We close files by using the close() function. This is demonstrated below:

my_file = open("output.txt", "w")
my_file.write("Hello\n")
my_file.close()

The file has now been closed and memory can be reallocated for other uses.

Having to open files and remember to close them can become a bit of a pain. We can use something called the with statement to handle this for us.

The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. Don't worry about try and finally we'll get to them a little later but for now, we can use with to handle the closing of files for us.

with open("data.txt", "w") as my_file:
    my_file.write("Some text here")

# THE REST OF YOUR PROGRAM HERE

So what is going on here? We can see the with statement but we also have an as statement. The as statement is used to assign the returned object to a new identifier. In this case, we open a file and are returned a file object. The file object is then assigned to the variable my_file using the as statement.

We then do whatever we need to do. In this case we write to the file. When we exit the with block the file is closed. Opening and closing files this way, is usually how most people do it.

Another use of the as statement which may clarify things is, for example, we were importing the sys module and we didn't like how long the name "sys" was (I know this is a silly example but some modules have long names), we could import it as such:

import sys as s
content = s.stdin.read()

Now, every time we want to refer to the sys module we call it by its alias, s.

10.8 - Putting is all together

We've learned a lot in this chapter so let's look at an example that puts everything together.

Let's assume we have a text file that contains many lines. Each line has a students name followed by the mark they received for a specific class. Here is a sample from that text file:

Liam 84
Noah 33
William 43
James 37
Logan 59

What we want to do is process this file and output to a new file whether or not each student passed or failed the class.

The output should look like this:

Liam PASS
Noad FAIL
William PASS
James FAIL
Logan PASS

We want to pass the input filename and output filename as arguments to the script and we take a fail to be a mark less than 40.

Here's how we might do that:

import sys

src = sys.argv[1]     # The input source file
dst = sys.argv[2]     # The file to output to

with open(src, 'r') as fin, open(dst, 'a') as fout:      # Open in append mode

    student = fin.readline().strip()     # Strip the newline character
    while student:
        
        student_data = student.split()   # ['Liam', '84'] for example
        
        name = student_data[0]
        mark = int(student_data[1])

        if mark < 40:
            grade = "FAIL"
        else:
            grade = "PASS"

        fout.write('{:s} {:s}\n'.format(name, grade))      # Write to the output file

        student = fin.readline().strip()      # Get the next student

Notice how we can open multiple files at the same time!

We run this as follows:

$ py grades.py input.txt output.txt

Your output should now be:

Liam PASS
Noah FAIL
William PASS
James FAIL
Logan PASS

File processing is a common task, as I've said, so become used to it, you'll probably be doing it a lot!

10.9 - Exercises

Important Note: These questions tougher than previous ones and get considerably more difficult as they go on, especially question 4. Don't be discouraged however. Stick at them. If you manage to complete these 4 questions, you're on your way to becoming a great developer and problem solver!

Question 1

Write a program that multiplies an arbitrary number of command-line arguments together. The arguments will be numbers (convert them first).

Your program should be run as follows:

$ py calc.py 5 99 32 ....

To test that your solution is correct, you should get the following output

$ py calc.py 8 8 9 3 2
3456

Question 2

Write a program that reads in lines from standard input (no redirection) and outputs them to standard output (no redirection).

Question 3

Write a program that reads a text file and outputs (using print() is fine) how many words are contained within that file. The name of the text file should be passed as a command-line argument to your script.

Your program shouldn't consider blank lines as words and you need not worry about punctuation. For example you can consider "trust," to be a single word.

Your program should be run as follows:

$ py num_words.py input.txt

To test that your solution is correct, use the Second Inaugural Address of Abraham Lincoln as your input text and your program should output: 701

Question 4

Write a program that reads in the contents of a file. Each line will contain a word or phrase. Your program should output (using print() is fine) whether or not each line is a palindrome or not, "True" or "False". The name of the text file should be passed as a command-line argument to your script.

A palindrome is a word or phrase that is the same backwards as it is forwards. For example, "racecar" is a palindrome while "AddE" is not.

Your program should not be case sensitive: "Racecar" should still be considered a palindrome. Spaces should NOT effect your program either. Consider removing white space characters.

Your program should be run as follows:

$ py palindrome.py input.txt

To test that your solution is correct, use the following as your input text:

racecar
AddE
HmllpH
Was it a car or a cat I saw
Hannah
T can arise in context where language is played wit
Able was I ere I saw Elba
Doc note I dissent A fast never prevents a fatness I diet on cod

Using the above input, your output should be:

True
False
False
True
True
False
True
True

Question 5

__** THIS QUESTION IS HARD **__

Similar to question 3, write a program that reads a text file. This time your program should output how many unique words are contained within the file.

This time you do need to care about punctuation and your solution should not be case sensitive. For example "trust" and "trust," and "Trust" should be considered the same word, therefore only one occurrence of that word should be recorded as unique and if you were to come across "trust" again, then don't record it.

Your program should be run as follows:

$ py unique_words.py input.txt

To test if your solution is correct, use the Second Inaugural Address of Abraham Lincoln as your input text and your program should output: 343

Hint 1: Take a look at the string module. In particular, string.punctuation. If you import this, you may be able to use it to your advantage. Import it as follows from string import punctuation. Figure out how string.punctuation works.

Hint 2: The solution to this exercise may make use of some string methods that we looked at back in the strings chapter.

Hint 3: Making use of a second list might be a good idea!



Help support the author by donating or purchasing a copy of the book (not available yet)



Previous Chapter - Next Chapter

-