UNIX text processing tools
What is the command-line?
The command-line is a way for you to communicate with your computer.
Imagine that you needed some information from a distant library, in a city where a friend of yours lives. Your friend is willing to help, but knows nothing about the subject matter of the information you want—they can get to the library, but from there on out you’re going to have to tell them what to do. Your friend goes to the library, and you start a telephone conversation with each other.
You might start asking your friend questions in this situation, like:
- What part of the library are you in?
- Give me a list of some of the books you see there.
Based on that information, you might then ask your friend to start doing things for you, like:
- Take from the shelf the book with the title “Cheese: A Cultural History”.
- Read for me the first several lines of the book.
The command-line is kind of like this scenario, except your friend on the other end of the line is the computer. And you’re asking it questions not about books in a library, but files on the computer.
Another difference is that your friend in the library can understand human language, and (as a human) is clever enough to figure out your intent, even if you use ambiguous, misleading, or sarcastic language. A computer, on the other hand, can’t understand human language. You have to communicate with it through a more limited language of pre-programmed verbs and nouns, following very strict syntax.
But why? Surely we’ve advanced past such barbarities.
This style of interaction with a computer was invented soon after the invention of computers themselves—it’s a very simple for a programmer to create an interface to the computer’s functionality, and an efficient way for human operators to unambiguously communicate their intent about what they want the computer to do.
Contemporary “graphic” user interfaces (GUIs) have existed in some form for a while. An early example of the GUI can be found in Doug Engelbart’s so-called “Mother of All Demos”, presented in 1968. The Xerox Alto project, developed in the 1970s established many conventions in GUIs that we still use today, and served as the inspiration for the Apple Macintosh.
But the command-line has a number of advantages over the GUI. For example, this command-line command:
$ cp file1.html animals/feline.html
… takes a file called file1.html
and makes a copy called feline.html
in a
folder called animals
. (cp
is the UNIX command for “copy.”) For an
experienced user, performing this operation on the command-line can be much
faster than performing the tasks necessary to do it in the GUI (which might
involve opening several “windows,” dragging an icon with the mouse, performing
right clicks, etc.).
The command-line interface also easily allows for multiple actions to be combined into a single action, or for one program to use another program’s output as input. Here’s another example command-line command:
$ cut -d ',' -f 2-3 data.txt | sort | uniq | grep 'cheese'
This command extracts the second and third fields of a comma-separated values
file, sorts the values in
alphabetical order, eliminates all of the duplicate lines, and then filters
the result to include only lines that contain the string cheese
. In order to
perform this same task in a GUI, you’d need to either cut-and-paste your
data between different programs that accomplished the individual tasks (one
program to select parts of the data, another to sort it), or you’d have to
find a single program that supported all of the desired features.
The UNIX command-line
Nearly all operating systems provide a command-line interface of some sort or
other. (I cut my teeth on the MS-DOS command-line, an analog of which is still
available on most Windows machines as cmd.exe
.) When most people think of the command-line in a contemporary context, they’re thinking of the UNIX
command-line.
SIDEBAR: ‘Wait, what’s an “operating system”?’ I hear you ask. Good question! An operating system is the software that runs on your computer that provides the basic functionality necessary for other programs to function—everything from interfaces to your computer’s components, like its hard drive or peripherals (mouse, printer, etc.) to things that you see as a user, like the user interface. You probably use multiple operating systems throughout the day: Windows or OS X on your computer, Android or iOS on your mobile phone.
UNIX is a family of common operating systems that originated in the 1970s, and are still frequently used today, in particular a clone/derivative called Linux. OS X is itself a derivative of UNIX (with a fancy proprietary GUI).
When UNIX was first being developed, and subsequently in its history, the programmers on the project came up with a series of command-line tools to help them accomplish tasks and solve common problems. It turns out that programmers, like other kinds of writers, deal with text a LOT, and so many of the tools they developed deal with text: filtering text, sorting text, modifying text. Over time, many other programmers have contributed to these tools, adding functionality and fixing bugs. For my money, they’re some of the most useful things that writers, researchers and computer users in general can learn.
You can also use these tools creatively. So we’re going to learn how to use them.
How do computers think about text?
Text can be divided into any number of different, overlapping units (document, page, section, subsection, chapter, clause, sentence, ascender, descender, act, stanza, syllable, foot, etc.) but only some of these are easy for computers to work with. (It’s harder than you think to teach a computer what a “sentence” is, for example.)
The two most obvious units of text in a computer are:
- the character, i.e., the byte (or series of bytes) that represents a single element of written language (e.g., A through Z in English, any one of many glyphs in Chinese, etc.)
- the file, i.e., an ordered collection of characters
Somewhere in between these two is the line, a formal unit of text that has been part of written language from the beginning. (Here’s an example of Cuneiform, an ancient writing system, written with lines.) The line arises in written text because writing transcribes speech, which is a one-dimensional medium, onto two-dimensional surfaces (paper, clay, stone, etc.). Line breaks in text are, fundamentally, a way of using up all of the space allotted on a surface.
But line breaks also serve syntactic, semantic, and metrical functions, as in poetry:
Rose, harsh rose,
marred and with stint of petals,
meagre flower, thin,
spare of leaf,
more precious
than a wet rose
single on a stem --
you are caught in the drift.
Stunted, with small leaf,
you are flung on the sand,
you are lifted
in the crisp sand
that drives in the wind.
Can the spice-rose
drip such acrid fragrance
hardened in a leaf?
In computer text, the line is often used as a “record marker.” This is how a text file can be used as a rudimentary database, with one “record” per line. (For example, here are some NBA stats, written in plain text format, with one line per player.)
Perhaps because of these parallelisms (text layout/poetic structure/database structure), many programs that operate on text use the line as their fundamental unit—especially those in UNIX (coming right up). The programs that we write in this class will do the same.
Getting started
If you’re using OSX or Linux, then you’re good to go! You need to launch a
“terminal emulator” application. On OSX, a terminal emulator comes
pre-installed on your computer. Go to Applications > Utilities on your hard
drive and select “Terminal” (or you can just search for “Terminal” with
Spotlight—it’ll be the first result, most likely). If you’re using Linux, you
may have to consult the instructions for your particular distribution. But
there should be somewhere on your computer a prominently displayed application
called “Terminal” (or something similar). If you click on it and a prompt
ending with $
appears, you’re probably on the right track.
Windows doesn’t come pre-installed with UNIX command-line tools. You’ll have to install them yourself. I suggest the “GNU on Windows” package, which installs a number of helpful UNIX programs and utilities. Download it here.
The program you use on Windows to access the command line is called “Command
Prompt.” You can access this program from the Start menu, under “All Programs >
Acessories > Command Prompt.” The command prompt looks different on Windows
from how it looks on UNIX-ish machines; for one, the prompt will end in >
instead of $
. But everything else in this tutorial should look and work fine.
When you’ve successfully reached the command line (another line!), you should see something like this:
[aparrish@ip-172-30-0-159 ~]$
This is the “prompt” (because it “prompts” you to do something). It’s telling you your username, the server you’re logged into, and the current directory. More on that later.
Keystrokes you should know
The keys you type on the command-line generally do what you think they will: they print the character you typed to the screen. The command-line also has a number of special keystrokes that have particular meanings. Two are important to know from the very beginning.
Ctrl+D signals to a program that is waiting for you to type in something
that you are done typing stuff in. For example, the sort
command, when
run on its own, will wait for you to type in the lines that you want to
sort. Hit Ctrl+D to tell sort
that you’ve entered your last line. (This is
Ctrl-Z on Windows.)
Ctrl+C signals to a program that you want it to stop doing whatever it’s doing immediately, even if it hasn’t yet completed its task. If, for example, you accidentally used the wrong file in an operation and you want the operation to stop (because it’s printing out too many lines, or the wrong lines), hit Ctrl+C.
Summed up:
- Ctrl+D: “I’m done typing things in. kthxbye.” (This is Ctrl+Z on Windows.)
- Ctrl+C: “You’re doing something I don’t like. Please stop.”
Notably, Ctrl+D is also used to signal to the command-line that you’re done
entering commands. At the prompt, hit Ctrl+D to log out. (You can
accomplish the same thing by typing exit
and hitting return.)
Your first UNIX commands
First off, we’re going to create a directory, so that you can find it later and you don’t risk overwriting something:
$ mkdir workshop
$ cd workshop
(don’t type the $! That’s just there to indicate that those commands should be typed at the command line.)
The mkdir
command means “make directory”–”directory” is just UNIX speak for
“folder.” When you’re using the command line, there’s one directory on your
machine that is considered your “current” directory, i.e., the directory you’re
doing stuff in. The cd
(“change directory”) command makes the directory you
give it (workshop
in this case) the current directory.
Example text files
For the purposes of this chapter, it will be helpful to have some plain text
files readily available to you. Download this zip
file and
extract it somewhere on your computer. Copy the files into the directory that
you just made and return to the command line. If you type ls
, you should see
a list of the files that you just downloaded.
Kinds of commands
There are (broadly) two kinds of commands in UNIX: commands that work on lines
of input/output, and commands that operate on files and directories. The
mkdir
and cd
commands are examples of the latter. We’re primarily concerned
with the former. Let’s start with cat:
$ cat
(Make sure to hit “return” after you type cat
.) Now type. After you enter a
line, cat
will print out the same line. It’s the simplest text filter
possible (one rule: let everything through).
When you’re done with cat
, press Ctrl+D. Let’s try something more
interesting, like grep:
$ grep foo
Now type some lines of text. Try typing, for example, I like drink
and then
I like food.
The grep
command only prints out lines that “match” the string
of characters that follow the command (foo
, in this case). Let’s try it
again, this time with a different “pattern”:
$ grep you
If we cut and paste the poem above (“Sea Rose”) into the terminal application the resulting output would look like this:
you are caught in the drift.
you are flung on the sand,
you are lifted
The commands head
and `tail print out a certain number of lines from the beginning of a file and the end of a file, respectively. If you type in the following:
$ tail -3
… and then paste in the poem above, you’d get:
Can the spice-rose
drip such acrid fragrance
hardened in a leaf?
Structure of UNIX commands
UNIX commands generally follow this structure:
name_of_command [options] arguments
(The “[options]” part of that schema is usually one or more characters preceded by hyphens. The -4 in tail tells it to print the last four lines; grep takes an option, -i, which tells it to be case insensitive.)
You can think of UNIX commands like commands in English, but with a funny syntax: “Fetch thoroughly my slippers!”
You can figure out which options and arguments a command supports by typing
man name_of_command
at the command line.
Sorting and piping
The sort command takes every line of input and prints them back out, in alphabetical order. Try:
$ sort
… paste in the poem, and hit Ctrl+D. You’ll get something like:
Can the spice-rose
drip such acrid fragrance
hardened in a leaf?
in the crisp sand
marred and with stint of petals,
meagre flower, thin,
more precious
Rose, harsh rose,
SEA ROSE
single on a stem --
spare of leaf,
Stunted, with small leaf,
than a wet rose
that drives in the wind.
you are caught in the drift.
you are flung on the sand,
you are lifted
(Why do you think there are so many blank lines at the beginning?)
So far, we’ve just been sending these commands input (by typing, or cutting and
pasting), then letting the output be printed back to the screen. UNIX provides
a means by which we can send the output of one program as the input of another
program. We do this using the pipe character (|
… usually shift+backslash).
For example:
$ grep leaf | sort
… takes lines from input, displays only those that contain the string “leaf,” and then passes them to sort, which displays those lines in alphabetical order. The output from the poem:
hardened in a leaf?
spare of leaf,
Stunted, with small leaf,
cut
The cut
command breaks up a line of text into its constituent parts. Let’s
say we had a text file full of data, where each line contained multiple
“fields” separated by commas. (This is a “comma-separated value” file, a
common way of exporting data from a spreadsheet program like Excel—especially
if you want to share that data with someone who doesn’t have Excel.) Here’s
what the data looks like:
Geraldine,New York,welding
Roberto,Tennessee,birdwatching
Dana,Wyoming,poetry
Priya,Maine,rock climbing
The cut
command allows us to easily process this text and “select” only
particular items from each line. Here’s how, for example, we could print out
just the “state” from each line:
$ cut -d , -f 2
Run that command, then cut-and-paste in the data from above. Here’s the output you’d get:
New York
Tennessee
Wyoming
Maine
The cut command takes two options, both of which themselves have parameters.
(This is confusing, but stick with me here.) The -d
option is followed by the
“delimiter” string (i.e., what you want to split the line on—in the example above, the comma); the -f
option is followed by which field you want.
Words in a line of text also have a “delimiter” between them—a space
character. So we can use cut
to transform some text by selecting only, say,
the first word of each line. For example, try this command:
cut -d ' ' -f 1
And paste in ‘Sea Rose’. The output:
SEA
Rose,
marred
meagre
spare
more
than
single
you
Stunted,
you
you
in
that
Can
drip
hardened
SIDEBAR: What’s with the ‘ ‘? Why are those quotes just hanging out like that? That’s a good question! It turns out that the space character is itself used by the UNIX command-line to have a special meaning—that is, you use one or more space characters to separate commands and parameters from each other. If we want to tell a command to use a space character as a parameter, literally a space character, we need some way to set that space character apart from its “normal” use. We do this by putting the character in quotes (
''
). Quoting is extremely important to computer programming and this isn’t the last time you’ll see it—not by a longshot!
You can use the -f
option to specify a range of values, or to give a
comma-separated list of values to include. In our sample text files, there’s
a CSV file called NBA_2015_games.csv
that contains one line for every
NBA game played in 2015. The names of the teams that competed in each game
are in the third and fifth fields of the file. To get just those fields, use
the cut
command like so:
$ cut -d , -f 3,5 <NBA_2015_games.csv
The first few lines of the output will look like this:
Houston Rockets,Los Angeles Lakers
Orlando Magic,New Orleans Pelicans
Dallas Mavericks,San Antonio Spurs
Brooklyn Nets,Boston Celtics
Milwaukee Bucks,Charlotte Hornets
Detroit Pistons,Denver Nuggets
Philadelphia 76ers,Indiana Pacers
Minnesota Timberwolves,Memphis Grizzlies
Washington Wizards,Miami Heat
Chicago Bulls,New York Knicks
Los Angeles Lakers,Phoenix Suns
Oklahoma City Thunder,Portland Trail Blazers
Golden State Warriors,Sacramento Kings
Atlanta Hawks,Toronto Raptors
Houston Rockets,Utah Jazz
Exercise: Use man cut
to find out how to use “ranges” as a parameter to
the -f
option. Then use cut
to print out, for each line of “Sea Rose,” the
second and third words on the line.
tr
The tr command “translates” a set of characters in the original line to another set of characters. The source character set is the first parameter, and the second parameter is the characters you want them to be translated to. For example:
$ tr aeiou eioua
hello there, how are you?
hillu thiri, huw eri yua?
You can specify a range of characters with a hyphen:
$ tr a-z A-Z
hello there, how are you?
HELLO THERE, HOW ARE YOU?
Multiple pipes
Of course, you can include more than one command in a “pipeline”:
$ sort | tail -6 | tr aeiou e
… which, if you send it our venerable poem, outputs the following:
Stented, weth smell leef,
then e wet rese
thet dreves en the wend.
yee ere ceeght en the dreft.
yee ere fleng en the send,
yee ere lefted
What happened? The input went to sort
, which sorted the lines in alphabetical
order. Then tail -6
grabbed only the last six lines of the output of sort
,
which sent those lines through to tr
. (You can build pipelines of infinite
length using this technique.)
Using files (“redirection”)
So far, we’ve been building “programs” that can only read from the keyboard (or from cut-and-paste) and can only send their output to the screen. What if we want to read from an existing file, and then output to a file?
No complicated code is needed. UNIX provides a method for us. It’s called “redirection.” Here’s how it works:
$ sort <sea_rose.txt
The < character means “instead of taking input from the keyboard, take input from this file.” Likewise:
$ grep were >some_file.txt
The >
means “instead of sending your output to the screen, send it to this file.” You can use them both at the same time:
$ grep were <sea_rose.txt >some_file.txt
… in which case some_file.txt
will end up with every line from sea_rose.txt
that contains the string “were.” (If the output file doesn’t already exist, it will be created. If it does already exist, it will be overwritten, so be careful!)
Sorting lines randomly
For the purposes of creating interesting poetic juxtapositions, you might find it valuable to be able to sort the lines of a text file randomly. Unfortunately, there’s no one UNIX command that does this. You can, however, use the following very small Python program to achieve the same results:
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sea_rose.txt
Copy and paste that entire line. You’ll get output that looks like this:
you are flung on the sand,
spare of leaf,
in the crisp sand
drip such acrid fragrance
marred and with stint of petals,
than a wet rose
you are lifted
single on a stem --
you are caught in the drift.
more precious
Stunted, with small leaf,
that drives in the wind.
meagre flower, thin,
Rose, harsh rose,
hardened in a leaf?
Can the spice-rose
For now, we won’t worry about how the Python program works (but it’ll all make
sense by the end of this course.) You can use this small Python program with
arbitrary input, and you can pipe and redirect its output however you please.
So, for example, to get three randomly selected lines from sea_rose.txt
:
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sea_rose.txt | head -3
This line might produce the following output:
spare of leaf,
in the crisp sand
more precious
Pasting different text files together
You can use the paste
command to combine several text files together,
“pasting” them next to each other as though they were columns in a spreadsheet.
To use the paste
command, type paste
on the command line followed by a
list of filenames whose contents you’d like to juxtapose. To demonstrate, let’s
create a file that has 100 random words from sowpods.txt
:
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words.txt
Now, let’s create a file that has just the first word from the first 100 lines of Shakespeare’s sonnets:
$ cut -d ' ' -f 1 <sonnets.txt | head -100 >first_words.txt
Now, we can combine the two with paste
:
$ paste -d ' ' first_words.txt random_words.txt
(The -d
option specifies what character to use when gluing the files back
together, just as the -d
option for cut
specifies where to break lines
apart.) You’ll get output that looks like this:
quail
From rehospitalised
That aconitums
But laundries
His endship
But shocked
Feed'st sweying
Making scorns
Thy jadedness
Thou thistly
And grandsir
Within extensions
And obconical
extremals
goniatite
aeromancy
When brunts
And portance
Thy contrasexuals
Will ultracautious
Try creating a few more lists of 100 random words, like so:
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words2.txt
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words3.txt
Now, you can use paste
to create a series of 100 unusual word combinations,
like so:
$ paste random_words.txt random_words2.txt random_words3.txt
You’ll get output like this:
ultracautious provisory avowedly
autotomic clobber lethargizes
processionary theftless tapernesses
floozy macintosh rootless
basicity drearest splents
fadeaways denigrated packaging
kissably paralogisms luv
parametrization heketara avas
restrove sdaining frostline
forecourses deracination hematuria
councillor quark narial
presentational bezzazzes haphazardnesses
matronizing frontierswoman re
grants fain mattoids
womanlike journeyer hexylene
pugh militate filed
pingrass sluff counterpoint
eyetooth reavowing file
Congratulations! You just made your first weird computational poem.
Helpful keystrokes
You may have noticed at this point that you can’t change the location of the cursor on the command line by just clicking where you want the cursor to go. No, that’d be too easy. The regular keystrokes you use to move to the beginning or end of the line won’t work either. But there are some special Terminal-specific keystrokes you can learn to make your life easier. Here are a few:
- Up/Down: Moves between entries in your command line “history” (i.e., a list of all the commands you’ve typed in this session).
- Ctrl-A: Moves to the beginning of the line.
- Ctrl-E: Moves to the end of the line.
- Ctrl-U: Erases everything from the current position of the cursor to the beginning of the line.
- Tab: Attempts to “auto-complete” your current word. This will search the current directory for a filename or command beginning with the characters you’ve already typed. Hit Tab again to cycle through possible alternatives.
Other helpful commands
-
wc -w
foo will print out the number of words in the file named foo. (wc -l
will count the number of lines;wc -c
will count the number of characters.) -
curl -s http://some.url/
fetches the web page at the given URL and prints its content to standard output. (We’ll be using this command extensively!) -
The
ls
command will give you a list (“ls” stands for “list”) of files in your current directory. If you give it a parameter, it will give you a listing of the files in the directory you gave to it. (On OS X, for example, tryls /Users/your_user_name/Desktop
) -
Type
pwd
to find out what your current directory is. -
The
cp
command will make a copy (“cp” for “copy”) of a file. It takes two parameters: the first is the source file name, the second is the destination file name. -
Type
rm foo
to delete the file named foo. (Note: this is permanent! The file won’t go to your Trash, so be careful)
Further reading
- UNIX tutorial for beginners: humane with many helpful illustrations
- egrep for linguists: not just about egrep, but a number of other UNIX text processing utilities as well!