UNIX text processing tools

by Allison Parrish

What is the command-line?

The command-line is a way for you to communicate with your computer.

Imagine that you needed some information from a distant library, in a city where a friend of yours lives. Your friend is willing to help, but knows nothing about the subject matter of the information you want—they can get to the library, but from there on out you’re going to have to tell them what to do. Your friend goes to the library, and you start a telephone conversation with each other.

You might start asking your friend questions in this situation, like:

  • What part of the library are you in?
  • Give me a list of some of the books you see there.

Based on that information, you might then ask your friend to start doing things for you, like:

  • Take from the shelf the book with the title “Cheese: A Cultural History”.
  • Read for me the first several lines of the book.

The command-line is kind of like this scenario, except your friend on the other end of the line is the computer. And you’re asking it questions not about books in a library, but files on the computer.

Another difference is that your friend in the library can understand human language, and (as a human) is clever enough to figure out your intent, even if you use ambiguous, misleading, or sarcastic language. A computer, on the other hand, can’t understand human language. You have to communicate with it through a more limited language of pre-programmed verbs and nouns, following very strict syntax.

But why? Surely we’ve advanced past such barbarities.

This style of interaction with a computer was invented soon after the invention of computers themselves—it’s a very simple for a programmer to create an interface to the computer’s functionality, and an efficient way for human operators to unambiguously communicate their intent about what they want the computer to do.

Contemporary “graphic” user interfaces (GUIs) have existed in some form for a while. An early example of the GUI can be found in Doug Engelbart’s so-called “Mother of All Demos”, presented in 1968. The Xerox Alto project, developed in the 1970s established many conventions in GUIs that we still use today, and served as the inspiration for the Apple Macintosh.

But the command-line has a number of advantages over the GUI. For example, this command-line command:

$ cp file1.html animals/feline.html

… takes a file called file1.html and makes a copy called feline.html in a folder called animals. (cp is the UNIX command for “copy.”) For an experienced user, performing this operation on the command-line can be much faster than performing the tasks necessary to do it in the GUI (which might involve opening several “windows,” dragging an icon with the mouse, performing right clicks, etc.).

The command-line interface also easily allows for multiple actions to be combined into a single action, or for one program to use another program’s output as input. Here’s another example command-line command:

$ cut -d ',' -f 2-3 data.txt | sort | uniq | grep 'cheese'

This command extracts the second and third fields of a comma-separated values file, sorts the values in alphabetical order, eliminates all of the duplicate lines, and then filters the result to include only lines that contain the string cheese. In order to perform this same task in a GUI, you’d need to either cut-and-paste your data between different programs that accomplished the individual tasks (one program to select parts of the data, another to sort it), or you’d have to find a single program that supported all of the desired features.

The UNIX command-line

Nearly all operating systems provide a command-line interface of some sort or other. (I cut my teeth on the MS-DOS command-line, an analog of which is still available on most Windows machines as cmd.exe.) When most people think of the command-line in a contemporary context, they’re thinking of the UNIX command-line.

SIDEBAR: ‘Wait, what’s an “operating system”?’ I hear you ask. Good question! An operating system is the software that runs on your computer that provides the basic functionality necessary for other programs to function—everything from interfaces to your computer’s components, like its hard drive or peripherals (mouse, printer, etc.) to things that you see as a user, like the user interface. You probably use multiple operating systems throughout the day: Windows or OS X on your computer, Android or iOS on your mobile phone.

UNIX is a family of common operating systems that originated in the 1970s, and are still frequently used today, in particular a clone/derivative called Linux. OS X is itself a derivative of UNIX (with a fancy proprietary GUI).

When UNIX was first being developed, and subsequently in its history, the programmers on the project came up with a series of command-line tools to help them accomplish tasks and solve common problems. It turns out that programmers, like other kinds of writers, deal with text a LOT, and so many of the tools they developed deal with text: filtering text, sorting text, modifying text. Over time, many other programmers have contributed to these tools, adding functionality and fixing bugs. For my money, they’re some of the most useful things that writers, researchers and computer users in general can learn.

You can also use these tools creatively. So we’re going to learn how to use them.

How do computers think about text?

Text can be divided into any number of different, overlapping units (document, page, section, subsection, chapter, clause, sentence, ascender, descender, act, stanza, syllable, foot, etc.) but only some of these are easy for computers to work with. (It’s harder than you think to teach a computer what a “sentence” is, for example.)

The two most obvious units of text in a computer are:

  • the character, i.e., the byte (or series of bytes) that represents a single element of written language (e.g., A through Z in English, any one of many glyphs in Chinese, etc.)
  • the file, i.e., an ordered collection of characters

Somewhere in between these two is the line, a formal unit of text that has been part of written language from the beginning. (Here’s an example of Cuneiform, an ancient writing system, written with lines.) The line arises in written text because writing transcribes speech, which is a one-dimensional medium, onto two-dimensional surfaces (paper, clay, stone, etc.). Line breaks in text are, fundamentally, a way of using up all of the space allotted on a surface.

But line breaks also serve syntactic, semantic, and metrical functions, as in poetry:

Rose, harsh rose, 
marred and with stint of petals, 
meagre flower, thin, 
spare of leaf,

more precious 
than a wet rose 
single on a stem -- 
you are caught in the drift.

Stunted, with small leaf, 
you are flung on the sand, 
you are lifted 
in the crisp sand 
that drives in the wind.

Can the spice-rose 
drip such acrid fragrance 
hardened in a leaf?

In computer text, the line is often used as a “record marker.” This is how a text file can be used as a rudimentary database, with one “record” per line. (For example, here are some NBA stats, written in plain text format, with one line per player.)

Perhaps because of these parallelisms (text layout/poetic structure/database structure), many programs that operate on text use the line as their fundamental unit—especially those in UNIX (coming right up). The programs that we write in this class will do the same.

Getting started

If you’re using OSX or Linux, then you’re good to go! You need to launch a “terminal emulator” application. On OSX, a terminal emulator comes pre-installed on your computer. Go to Applications > Utilities on your hard drive and select “Terminal” (or you can just search for “Terminal” with Spotlight—it’ll be the first result, most likely). If you’re using Linux, you may have to consult the instructions for your particular distribution. But there should be somewhere on your computer a prominently displayed application called “Terminal” (or something similar). If you click on it and a prompt ending with $ appears, you’re probably on the right track.

Windows doesn’t come pre-installed with UNIX command-line tools. You’ll have to install them yourself. I suggest the “GNU on Windows” package, which installs a number of helpful UNIX programs and utilities. Download it here.

The program you use on Windows to access the command line is called “Command Prompt.” You can access this program from the Start menu, under “All Programs > Acessories > Command Prompt.” The command prompt looks different on Windows from how it looks on UNIX-ish machines; for one, the prompt will end in > instead of $. But everything else in this tutorial should look and work fine.

When you’ve successfully reached the command line (another line!), you should see something like this:

[aparrish@ip-172-30-0-159 ~]$

This is the “prompt” (because it “prompts” you to do something). It’s telling you your username, the server you’re logged into, and the current directory. More on that later.

Keystrokes you should know

The keys you type on the command-line generally do what you think they will: they print the character you typed to the screen. The command-line also has a number of special keystrokes that have particular meanings. Two are important to know from the very beginning.

Ctrl+D signals to a program that is waiting for you to type in something that you are done typing stuff in. For example, the sort command, when run on its own, will wait for you to type in the lines that you want to sort. Hit Ctrl+D to tell sort that you’ve entered your last line. (This is Ctrl-Z on Windows.)

Ctrl+C signals to a program that you want it to stop doing whatever it’s doing immediately, even if it hasn’t yet completed its task. If, for example, you accidentally used the wrong file in an operation and you want the operation to stop (because it’s printing out too many lines, or the wrong lines), hit Ctrl+C.

Summed up:

  • Ctrl+D: “I’m done typing things in. kthxbye.” (This is Ctrl+Z on Windows.)
  • Ctrl+C: “You’re doing something I don’t like. Please stop.”

Notably, Ctrl+D is also used to signal to the command-line that you’re done entering commands. At the prompt, hit Ctrl+D to log out. (You can accomplish the same thing by typing exit and hitting return.)

Your first UNIX commands

First off, we’re going to create a directory, so that you can find it later and you don’t risk overwriting something:

$ mkdir workshop 
$ cd workshop 

(don’t type the $! That’s just there to indicate that those commands should be typed at the command line.)

The mkdir command means “make directory”–”directory” is just UNIX speak for “folder.” When you’re using the command line, there’s one directory on your machine that is considered your “current” directory, i.e., the directory you’re doing stuff in. The cd (“change directory”) command makes the directory you give it (workshop in this case) the current directory.

Example text files

For the purposes of this chapter, it will be helpful to have some plain text files readily available to you. Download this zip file and extract it somewhere on your computer. Copy the files into the directory that you just made and return to the command line. If you type ls, you should see a list of the files that you just downloaded.

Kinds of commands

There are (broadly) two kinds of commands in UNIX: commands that work on lines of input/output, and commands that operate on files and directories. The mkdir and cd commands are examples of the latter. We’re primarily concerned with the former. Let’s start with cat:

$ cat

(Make sure to hit “return” after you type cat.) Now type. After you enter a line, cat will print out the same line. It’s the simplest text filter possible (one rule: let everything through).

When you’re done with cat, press Ctrl+D. Let’s try something more interesting, like grep:

$ grep foo

Now type some lines of text. Try typing, for example, I like drink and then I like food. The grep command only prints out lines that “match” the string of characters that follow the command (foo, in this case). Let’s try it again, this time with a different “pattern”:

$ grep you

If we cut and paste the poem above (“Sea Rose”) into the terminal application the resulting output would look like this:

you are caught in the drift.
you are flung on the sand,
you are lifted

The commands head and `tail print out a certain number of lines from the beginning of a file and the end of a file, respectively. If you type in the following:

$ tail -3

… and then paste in the poem above, you’d get:

Can the spice-rose
drip such acrid fragrance
hardened in a leaf?

Structure of UNIX commands

UNIX commands generally follow this structure:

name_of_command [options] arguments

(The “[options]” part of that schema is usually one or more characters preceded by hyphens. The -4 in tail tells it to print the last four lines; grep takes an option, -i, which tells it to be case insensitive.)

You can think of UNIX commands like commands in English, but with a funny syntax: “Fetch thoroughly my slippers!”

You can figure out which options and arguments a command supports by typing man name_of_command at the command line.

Sorting and piping

The sort command takes every line of input and prints them back out, in alphabetical order. Try:

$ sort

… paste in the poem, and hit Ctrl+D. You’ll get something like:

Can the spice-rose 
drip such acrid fragrance 
hardened in a leaf?
in the crisp sand 
marred and with stint of petals, 
meagre flower, thin, 
more precious 
Rose, harsh rose, 
SEA ROSE
single on a stem -- 
spare of leaf,
Stunted, with small leaf, 
than a wet rose 
that drives in the wind.
you are caught in the drift.
you are flung on the sand, 
you are lifted 

(Why do you think there are so many blank lines at the beginning?)

So far, we’ve just been sending these commands input (by typing, or cutting and pasting), then letting the output be printed back to the screen. UNIX provides a means by which we can send the output of one program as the input of another program. We do this using the pipe character (| … usually shift+backslash). For example:

$ grep leaf | sort

… takes lines from input, displays only those that contain the string “leaf,” and then passes them to sort, which displays those lines in alphabetical order. The output from the poem:

hardened in a leaf?
spare of leaf,
Stunted, with small leaf,

cut

The cut command breaks up a line of text into its constituent parts. Let’s say we had a text file full of data, where each line contained multiple “fields” separated by commas. (This is a “comma-separated value” file, a common way of exporting data from a spreadsheet program like Excel—especially if you want to share that data with someone who doesn’t have Excel.) Here’s what the data looks like:

Geraldine,New York,welding
Roberto,Tennessee,birdwatching
Dana,Wyoming,poetry
Priya,Maine,rock climbing

The cut command allows us to easily process this text and “select” only particular items from each line. Here’s how, for example, we could print out just the “state” from each line:

$ cut -d , -f 2

Run that command, then cut-and-paste in the data from above. Here’s the output you’d get:

New York
Tennessee
Wyoming
Maine

The cut command takes two options, both of which themselves have parameters. (This is confusing, but stick with me here.) The -d option is followed by the “delimiter” string (i.e., what you want to split the line on—in the example above, the comma); the -f option is followed by which field you want.

Words in a line of text also have a “delimiter” between them—a space character. So we can use cut to transform some text by selecting only, say, the first word of each line. For example, try this command:

cut -d ' ' -f 1

And paste in ‘Sea Rose’. The output:

SEA

Rose,
marred
meagre
spare

more
than
single
you

Stunted,
you
you
in
that

Can
drip
hardened

SIDEBAR: What’s with the ‘ ‘? Why are those quotes just hanging out like that? That’s a good question! It turns out that the space character is itself used by the UNIX command-line to have a special meaning—that is, you use one or more space characters to separate commands and parameters from each other. If we want to tell a command to use a space character as a parameter, literally a space character, we need some way to set that space character apart from its “normal” use. We do this by putting the character in quotes (''). Quoting is extremely important to computer programming and this isn’t the last time you’ll see it—not by a longshot!

You can use the -f option to specify a range of values, or to give a comma-separated list of values to include. In our sample text files, there’s a CSV file called NBA_2015_games.csv that contains one line for every NBA game played in 2015. The names of the teams that competed in each game are in the third and fifth fields of the file. To get just those fields, use the cut command like so:

$ cut -d , -f 3,5 <NBA_2015_games.csv

The first few lines of the output will look like this:

Houston Rockets,Los Angeles Lakers
Orlando Magic,New Orleans Pelicans
Dallas Mavericks,San Antonio Spurs
Brooklyn Nets,Boston Celtics
Milwaukee Bucks,Charlotte Hornets
Detroit Pistons,Denver Nuggets
Philadelphia 76ers,Indiana Pacers
Minnesota Timberwolves,Memphis Grizzlies
Washington Wizards,Miami Heat
Chicago Bulls,New York Knicks
Los Angeles Lakers,Phoenix Suns
Oklahoma City Thunder,Portland Trail Blazers
Golden State Warriors,Sacramento Kings
Atlanta Hawks,Toronto Raptors
Houston Rockets,Utah Jazz

Exercise: Use man cut to find out how to use “ranges” as a parameter to the -f option. Then use cut to print out, for each line of “Sea Rose,” the second and third words on the line.

tr

The tr command “translates” a set of characters in the original line to another set of characters. The source character set is the first parameter, and the second parameter is the characters you want them to be translated to. For example:

$ tr aeiou eioua
hello there, how are you?
hillu thiri, huw eri yua?

You can specify a range of characters with a hyphen:

$ tr a-z A-Z
hello there, how are you?
HELLO THERE, HOW ARE YOU?

Multiple pipes

Of course, you can include more than one command in a “pipeline”:

$ sort | tail -6 | tr aeiou e

… which, if you send it our venerable poem, outputs the following:

Stented, weth smell leef,
then e wet rese
thet dreves en the wend.
yee ere ceeght en the dreft.
yee ere fleng en the send,
yee ere lefted

What happened? The input went to sort, which sorted the lines in alphabetical order. Then tail -6 grabbed only the last six lines of the output of sort, which sent those lines through to tr. (You can build pipelines of infinite length using this technique.)

Using files (“redirection”)

So far, we’ve been building “programs” that can only read from the keyboard (or from cut-and-paste) and can only send their output to the screen. What if we want to read from an existing file, and then output to a file?

No complicated code is needed. UNIX provides a method for us. It’s called “redirection.” Here’s how it works:

$ sort <sea_rose.txt

The < character means “instead of taking input from the keyboard, take input from this file.” Likewise:

$ grep were >some_file.txt

The > means “instead of sending your output to the screen, send it to this file.” You can use them both at the same time:

$ grep were <sea_rose.txt >some_file.txt

… in which case some_file.txt will end up with every line from sea_rose.txt that contains the string “were.” (If the output file doesn’t already exist, it will be created. If it does already exist, it will be overwritten, so be careful!)

Sorting lines randomly

For the purposes of creating interesting poetic juxtapositions, you might find it valuable to be able to sort the lines of a text file randomly. Unfortunately, there’s no one UNIX command that does this. You can, however, use the following very small Python program to achieve the same results:

$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sea_rose.txt

Copy and paste that entire line. You’ll get output that looks like this:

you are flung on the sand, 
spare of leaf,

in the crisp sand 
drip such acrid fragrance 
marred and with stint of petals, 
than a wet rose 

you are lifted 
single on a stem -- 
you are caught in the drift.
more precious 
Stunted, with small leaf, 

that drives in the wind.
meagre flower, thin, 
Rose, harsh rose, 
hardened in a leaf?
Can the spice-rose 

For now, we won’t worry about how the Python program works (but it’ll all make sense by the end of this course.) You can use this small Python program with arbitrary input, and you can pipe and redirect its output however you please. So, for example, to get three randomly selected lines from sea_rose.txt:

$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sea_rose.txt | head -3

This line might produce the following output:

spare of leaf,
in the crisp sand 
more precious 

Pasting different text files together

You can use the paste command to combine several text files together, “pasting” them next to each other as though they were columns in a spreadsheet. To use the paste command, type paste on the command line followed by a list of filenames whose contents you’d like to juxtapose. To demonstrate, let’s create a file that has 100 random words from sowpods.txt:

$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words.txt

Now, let’s create a file that has just the first word from the first 100 lines of Shakespeare’s sonnets:

$ cut -d ' ' -f 1 <sonnets.txt | head -100 >first_words.txt

Now, we can combine the two with paste:

$ paste -d ' ' first_words.txt random_words.txt

(The -d option specifies what character to use when gluing the files back together, just as the -d option for cut specifies where to break lines apart.) You’ll get output that looks like this:

 quail
From rehospitalised
That aconitums
But laundries
His endship
But shocked
Feed'st sweying
Making scorns
Thy jadedness
Thou thistly
And grandsir
Within extensions
And obconical
 extremals
 goniatite
 aeromancy
When brunts
And portance
Thy contrasexuals
Will ultracautious

Try creating a few more lists of 100 random words, like so:

$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words2.txt
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words3.txt

Now, you can use paste to create a series of 100 unusual word combinations, like so:

$ paste random_words.txt random_words2.txt random_words3.txt

You’ll get output like this:

ultracautious provisory avowedly
autotomic clobber lethargizes
processionary theftless tapernesses
floozy macintosh rootless
basicity drearest splents
fadeaways denigrated packaging
kissably paralogisms luv
parametrization heketara avas
restrove sdaining frostline
forecourses deracination hematuria
councillor quark narial
presentational bezzazzes haphazardnesses
matronizing frontierswoman re
grants fain mattoids
womanlike journeyer hexylene
pugh militate filed
pingrass sluff counterpoint
eyetooth reavowing file

Congratulations! You just made your first weird computational poem.

Helpful keystrokes

You may have noticed at this point that you can’t change the location of the cursor on the command line by just clicking where you want the cursor to go. No, that’d be too easy. The regular keystrokes you use to move to the beginning or end of the line won’t work either. But there are some special Terminal-specific keystrokes you can learn to make your life easier. Here are a few:

  • Up/Down: Moves between entries in your command line “history” (i.e., a list of all the commands you’ve typed in this session).
  • Ctrl-A: Moves to the beginning of the line.
  • Ctrl-E: Moves to the end of the line.
  • Ctrl-U: Erases everything from the current position of the cursor to the beginning of the line.
  • Tab: Attempts to “auto-complete” your current word. This will search the current directory for a filename or command beginning with the characters you’ve already typed. Hit Tab again to cycle through possible alternatives.

Other helpful commands

  • wc -w foo will print out the number of words in the file named foo. (wc -l will count the number of lines; wc -c will count the number of characters.)

  • curl -s http://some.url/ fetches the web page at the given URL and prints its content to standard output. (We’ll be using this command extensively!)

  • The ls command will give you a list (“ls” stands for “list”) of files in your current directory. If you give it a parameter, it will give you a listing of the files in the directory you gave to it. (On OS X, for example, try ls /Users/your_user_name/Desktop)

  • Type pwd to find out what your current directory is.

  • The cp command will make a copy (“cp” for “copy”) of a file. It takes two parameters: the first is the source file name, the second is the destination file name.

  • Type rm foo to delete the file named foo. (Note: this is permanent! The file won’t go to your Trash, so be careful)

Further reading