UNIX text processing tools
What is the command-line?
The command-line is a way for you to communicate with your computer.
Imagine that you needed some information from a distant library, in a city where a friend of yours lives. Your friend is willing to help, but knows nothing about the subject matter of the information you want—they can get to the library, but from there on out you’re going to have to tell them what to do. Your friend goes to the library, and you start a telephone conversation with each other.
You might start asking your friend questions in this situation, like:
- What part of the library are you in?
- Give me a list of some of the books you see there.
Based on that information, you might then ask your friend to start doing things for you, like:
- Take from the shelf the book with the title “Cheese: A Cultural History”.
- Read for me the first several lines of the book.
The command-line is kind of like this scenario, except your friend on the other end of the line is the computer. And you’re asking it questions not about books in a library, but files on the computer.
Another difference is that your friend in the library can understand human language, and (as a human) is clever enough to figure out your intent, even if you use ambiguous, misleading, or sarcastic language. A computer, on the other hand, can’t understand human language. You have to communicate with it through a more limited language of pre-programmed verbs and nouns, following very strict syntax.
But why? Surely we’ve advanced past such barbarities.
This style of interaction with a computer was invented soon after the invention of computers themselves—it’s a very simple for a programmer to create an interface to the computer’s functionality, and an efficient way for human operators to unambiguously communicate their intent about what they want the computer to do.
Contemporary “graphic” user interfaces (GUIs) have existed in some form for a while. An early example of the GUI can be found in Doug Engelbart’s so-called “Mother of All Demos”, presented in 1968. The Xerox Alto project, developed in the 1970s established many conventions in GUIs that we still use today, and served as the inspiration for the Apple Macintosh.
But the command-line has a number of advantages over the GUI. For example, this command-line command:
$ cp file1.html animals/feline.html
… takes a file called
file1.html and makes a copy called
feline.html in a
cp is the UNIX command for “copy.”) For an
experienced user, performing this operation on the command-line can be much
faster than performing the tasks necessary to do it in the GUI (which might
involve opening several “windows,” dragging an icon with the mouse, performing
right clicks, etc.).
The command-line interface also easily allows for multiple actions to be combined into a single action, or for one program to use another program’s output as input. Here’s another example command-line command:
$ cut -d ',' -f 2-3 data.txt | sort | uniq | grep 'cheese'
This command extracts the second and third fields of a comma-separated values
file, sorts the values in
alphabetical order, eliminates all of the duplicate lines, and then filters
the result to include only lines that contain the string
cheese. In order to
perform this same task in a GUI, you’d need to either cut-and-paste your
data between different programs that accomplished the individual tasks (one
program to select parts of the data, another to sort it), or you’d have to
find a single program that supported all of the desired features.
The UNIX command-line
Nearly all operating systems provide a command-line interface of some sort or
other. (I cut my teeth on the MS-DOS command-line, an analog of which is still
available on most Windows machines as
cmd.exe.) When most people think of the command-line in a contemporary context, they’re thinking of the UNIX
SIDEBAR: ‘Wait, what’s an “operating system”?’ I hear you ask. Good question! An operating system is the software that runs on your computer that provides the basic functionality necessary for other programs to function—everything from interfaces to your computer’s components, like its hard drive or peripherals (mouse, printer, etc.) to things that you see as a user, like the user interface. You probably use multiple operating systems throughout the day: Windows or OS X on your computer, Android or iOS on your mobile phone.
UNIX is a family of common operating systems that originated in the 1970s, and are still frequently used today, in particular a clone/derivative called Linux. OS X is itself a derivative of UNIX (with a fancy proprietary GUI).
When UNIX was first being developed, and subsequently in its history, the programmers on the project came up with a series of command-line tools to help them accomplish tasks and solve common problems. It turns out that programmers, like other kinds of writers, deal with text a LOT, and so many of the tools they developed deal with text: filtering text, sorting text, modifying text. Over time, many other programmers have contributed to these tools, adding functionality and fixing bugs. For my money, they’re some of the most useful things that writers, researchers and computer users in general can learn.
You can also use these tools creatively. So we’re going to learn how to use them.
How do computers think about text?
Text can be divided into any number of different, overlapping units (document, page, section, subsection, chapter, clause, sentence, ascender, descender, act, stanza, syllable, foot, etc.) but only some of these are easy for computers to work with. (It’s harder than you think to teach a computer what a “sentence” is, for example.)
The two most obvious units of text in a computer are:
- the character, i.e., the byte (or series of bytes) that represents a single element of written language (e.g., A through Z in English, any one of many glyphs in Chinese, etc.)
- the file, i.e., an ordered collection of characters
Somewhere in between these two is the line, a formal unit of text that has been part of written language from the beginning. (Here’s an example of Cuneiform, an ancient writing system, written with lines.) The line arises in written text because writing transcribes speech, which is a one-dimensional medium, onto two-dimensional surfaces (paper, clay, stone, etc.). Line breaks in text are, fundamentally, a way of using up all of the space allotted on a surface.
But line breaks also serve syntactic, semantic, and metrical functions, as in poetry:
Rose, harsh rose, marred and with stint of petals, meagre flower, thin, spare of leaf, more precious than a wet rose single on a stem -- you are caught in the drift. Stunted, with small leaf, you are flung on the sand, you are lifted in the crisp sand that drives in the wind. Can the spice-rose drip such acrid fragrance hardened in a leaf?
In computer text, the line is often used as a “record marker.” This is how a text file can be used as a rudimentary database, with one “record” per line. (For example, here are some NBA stats, written in plain text format, with one line per player.)
Perhaps because of these parallelisms (text layout/poetic structure/database structure), many programs that operate on text use the line as their fundamental unit—especially those in UNIX (coming right up). The programs that we write in this class will do the same.
If you’re using OSX or Linux, then you’re good to go! You need to launch a
“terminal emulator” application. On OSX, a terminal emulator comes
pre-installed on your computer. Go to Applications > Utilities on your hard
drive and select “Terminal” (or you can just search for “Terminal” with
Spotlight—it’ll be the first result, most likely). If you’re using Linux, you
may have to consult the instructions for your particular distribution. But
there should be somewhere on your computer a prominently displayed application
called “Terminal” (or something similar). If you click on it and a prompt
$ appears, you’re probably on the right track.
Windows doesn’t come pre-installed with UNIX command-line tools. You’ll have to install them yourself. I suggest the “GNU on Windows” package, which installs a number of helpful UNIX programs and utilities. Download it here.
The program you use on Windows to access the command line is called “Command
Prompt.” You can access this program from the Start menu, under “All Programs >
Acessories > Command Prompt.” The command prompt looks different on Windows
from how it looks on UNIX-ish machines; for one, the prompt will end in
$. But everything else in this tutorial should look and work fine.
When you’ve successfully reached the command line (another line!), you should see something like this:
This is the “prompt” (because it “prompts” you to do something). It’s telling you your username, the server you’re logged into, and the current directory. More on that later.
Keystrokes you should know
The keys you type on the command-line generally do what you think they will: they print the character you typed to the screen. The command-line also has a number of special keystrokes that have particular meanings. Two are important to know from the very beginning.
Ctrl+D signals to a program that is waiting for you to type in something
that you are done typing stuff in. For example, the
sort command, when
run on its own, will wait for you to type in the lines that you want to
sort. Hit Ctrl+D to tell
sort that you’ve entered your last line. (This is
Ctrl-Z on Windows.)
Ctrl+C signals to a program that you want it to stop doing whatever it’s doing immediately, even if it hasn’t yet completed its task. If, for example, you accidentally used the wrong file in an operation and you want the operation to stop (because it’s printing out too many lines, or the wrong lines), hit Ctrl+C.
- Ctrl+D: “I’m done typing things in. kthxbye.” (This is Ctrl+Z on Windows.)
- Ctrl+C: “You’re doing something I don’t like. Please stop.”
Notably, Ctrl+D is also used to signal to the command-line that you’re done
entering commands. At the prompt, hit Ctrl+D to log out. (You can
accomplish the same thing by typing
exit and hitting return.)
Your first UNIX commands
First off, we’re going to create a directory, so that you can find it later and you don’t risk overwriting something:
$ mkdir workshop $ cd workshop
(don’t type the $! That’s just there to indicate that those commands should be typed at the command line.)
mkdir command means “make directory”–”directory” is just UNIX speak for
“folder.” When you’re using the command line, there’s one directory on your
machine that is considered your “current” directory, i.e., the directory you’re
doing stuff in. The
cd (“change directory”) command makes the directory you
give it (
workshop in this case) the current directory.
Example text files
For the purposes of this chapter, it will be helpful to have some plain text
files readily available to you. Download this zip
extract it somewhere on your computer. Copy the files into the directory that
you just made and return to the command line. If you type
ls, you should see
a list of the files that you just downloaded.
Kinds of commands
There are (broadly) two kinds of commands in UNIX: commands that work on lines
of input/output, and commands that operate on files and directories. The
cd commands are examples of the latter. We’re primarily concerned
with the former. Let’s start with cat:
(Make sure to hit “return” after you type
cat.) Now type. After you enter a
cat will print out the same line. It’s the simplest text filter
possible (one rule: let everything through).
When you’re done with
cat, press Ctrl+D. Let’s try something more
interesting, like grep:
$ grep foo
Now type some lines of text. Try typing, for example,
I like drink and then
I like food. The
grep command only prints out lines that “match” the string
of characters that follow the command (
foo, in this case). Let’s try it
again, this time with a different “pattern”:
$ grep you
If we cut and paste the poem above (“Sea Rose”) into the terminal application the resulting output would look like this:
you are caught in the drift. you are flung on the sand, you are lifted
head and `tail print out a certain number of lines from the beginning of a file and the end of a file, respectively. If you type in the following:
$ tail -3
… and then paste in the poem above, you’d get:
Can the spice-rose drip such acrid fragrance hardened in a leaf?
Structure of UNIX commands
UNIX commands generally follow this structure:
name_of_command [options] arguments
(The “[options]” part of that schema is usually one or more characters preceded by hyphens. The -4 in tail tells it to print the last four lines; grep takes an option, -i, which tells it to be case insensitive.)
You can think of UNIX commands like commands in English, but with a funny syntax: “Fetch thoroughly my slippers!”
You can figure out which options and arguments a command supports by typing
man name_of_command at the command line.
Sorting and piping
The sort command takes every line of input and prints them back out, in alphabetical order. Try:
… paste in the poem, and hit Ctrl+D. You’ll get something like:
Can the spice-rose drip such acrid fragrance hardened in a leaf? in the crisp sand marred and with stint of petals, meagre flower, thin, more precious Rose, harsh rose, SEA ROSE single on a stem -- spare of leaf, Stunted, with small leaf, than a wet rose that drives in the wind. you are caught in the drift. you are flung on the sand, you are lifted
(Why do you think there are so many blank lines at the beginning?)
So far, we’ve just been sending these commands input (by typing, or cutting and
pasting), then letting the output be printed back to the screen. UNIX provides
a means by which we can send the output of one program as the input of another
program. We do this using the pipe character (
| … usually shift+backslash).
$ grep leaf | sort
… takes lines from input, displays only those that contain the string “leaf,” and then passes them to sort, which displays those lines in alphabetical order. The output from the poem:
hardened in a leaf? spare of leaf, Stunted, with small leaf,
cut command breaks up a line of text into its constituent parts. Let’s
say we had a text file full of data, where each line contained multiple
“fields” separated by commas. (This is a “comma-separated value” file, a
common way of exporting data from a spreadsheet program like Excel—especially
if you want to share that data with someone who doesn’t have Excel.) Here’s
what the data looks like:
Geraldine,New York,welding Roberto,Tennessee,birdwatching Dana,Wyoming,poetry Priya,Maine,rock climbing
cut command allows us to easily process this text and “select” only
particular items from each line. Here’s how, for example, we could print out
just the “state” from each line:
$ cut -d , -f 2
Run that command, then cut-and-paste in the data from above. Here’s the output you’d get:
New York Tennessee Wyoming Maine
The cut command takes two options, both of which themselves have parameters.
(This is confusing, but stick with me here.) The
-d option is followed by the
“delimiter” string (i.e., what you want to split the line on—in the example above, the comma); the
-f option is followed by which field you want.
Words in a line of text also have a “delimiter” between them—a space
character. So we can use
cut to transform some text by selecting only, say,
the first word of each line. For example, try this command:
cut -d ' ' -f 1
And paste in ‘Sea Rose’. The output:
SEA Rose, marred meagre spare more than single you Stunted, you you in that Can drip hardened
SIDEBAR: What’s with the ‘ ‘? Why are those quotes just hanging out like that? That’s a good question! It turns out that the space character is itself used by the UNIX command-line to have a special meaning—that is, you use one or more space characters to separate commands and parameters from each other. If we want to tell a command to use a space character as a parameter, literally a space character, we need some way to set that space character apart from its “normal” use. We do this by putting the character in quotes (
''). Quoting is extremely important to computer programming and this isn’t the last time you’ll see it—not by a longshot!
You can use the
-f option to specify a range of values, or to give a
comma-separated list of values to include. In our sample text files, there’s
a CSV file called
NBA_2015_games.csv that contains one line for every
NBA game played in 2015. The names of the teams that competed in each game
are in the third and fifth fields of the file. To get just those fields, use
cut command like so:
$ cut -d , -f 3,5 <NBA_2015_games.csv
The first few lines of the output will look like this:
Houston Rockets,Los Angeles Lakers Orlando Magic,New Orleans Pelicans Dallas Mavericks,San Antonio Spurs Brooklyn Nets,Boston Celtics Milwaukee Bucks,Charlotte Hornets Detroit Pistons,Denver Nuggets Philadelphia 76ers,Indiana Pacers Minnesota Timberwolves,Memphis Grizzlies Washington Wizards,Miami Heat Chicago Bulls,New York Knicks Los Angeles Lakers,Phoenix Suns Oklahoma City Thunder,Portland Trail Blazers Golden State Warriors,Sacramento Kings Atlanta Hawks,Toronto Raptors Houston Rockets,Utah Jazz
man cut to find out how to use “ranges” as a parameter to
-f option. Then use
cut to print out, for each line of “Sea Rose,” the
second and third words on the line.
The tr command “translates” a set of characters in the original line to another set of characters. The source character set is the first parameter, and the second parameter is the characters you want them to be translated to. For example:
$ tr aeiou eioua hello there, how are you? hillu thiri, huw eri yua?
You can specify a range of characters with a hyphen:
$ tr a-z A-Z hello there, how are you? HELLO THERE, HOW ARE YOU?
Of course, you can include more than one command in a “pipeline”:
$ sort | tail -6 | tr aeiou e
… which, if you send it our venerable poem, outputs the following:
Stented, weth smell leef, then e wet rese thet dreves en the wend. yee ere ceeght en the dreft. yee ere fleng en the send, yee ere lefted
What happened? The input went to
sort, which sorted the lines in alphabetical
tail -6 grabbed only the last six lines of the output of
which sent those lines through to
tr. (You can build pipelines of infinite
length using this technique.)
Using files (“redirection”)
So far, we’ve been building “programs” that can only read from the keyboard (or from cut-and-paste) and can only send their output to the screen. What if we want to read from an existing file, and then output to a file?
No complicated code is needed. UNIX provides a method for us. It’s called “redirection.” Here’s how it works:
$ sort <sea_rose.txt
The < character means “instead of taking input from the keyboard, take input from this file.” Likewise:
$ grep were >some_file.txt
> means “instead of sending your output to the screen, send it to this file.” You can use them both at the same time:
$ grep were <sea_rose.txt >some_file.txt
… in which case
some_file.txt will end up with every line from
sea_rose.txt that contains the string “were.” (If the output file doesn’t already exist, it will be created. If it does already exist, it will be overwritten, so be careful!)
Sorting lines randomly
For the purposes of creating interesting poetic juxtapositions, you might find it valuable to be able to sort the lines of a text file randomly. Unfortunately, there’s no one UNIX command that does this. You can, however, use the following very small Python program to achieve the same results:
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sea_rose.txt
Copy and paste that entire line. You’ll get output that looks like this:
you are flung on the sand, spare of leaf, in the crisp sand drip such acrid fragrance marred and with stint of petals, than a wet rose you are lifted single on a stem -- you are caught in the drift. more precious Stunted, with small leaf, that drives in the wind. meagre flower, thin, Rose, harsh rose, hardened in a leaf? Can the spice-rose
For now, we won’t worry about how the Python program works (but it’ll all make
sense by the end of this course.) You can use this small Python program with
arbitrary input, and you can pipe and redirect its output however you please.
So, for example, to get three randomly selected lines from
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sea_rose.txt | head -3
This line might produce the following output:
spare of leaf, in the crisp sand more precious
Pasting different text files together
You can use the
paste command to combine several text files together,
“pasting” them next to each other as though they were columns in a spreadsheet.
To use the
paste command, type
paste on the command line followed by a
list of filenames whose contents you’d like to juxtapose. To demonstrate, let’s
create a file that has 100 random words from
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words.txt
Now, let’s create a file that has just the first word from the first 100 lines of Shakespeare’s sonnets:
$ cut -d ' ' -f 1 <sonnets.txt | head -100 >first_words.txt
Now, we can combine the two with
$ paste -d ' ' first_words.txt random_words.txt
-d option specifies what character to use when gluing the files back
together, just as the
-d option for
cut specifies where to break lines
apart.) You’ll get output that looks like this:
quail From rehospitalised That aconitums But laundries His endship But shocked Feed'st sweying Making scorns Thy jadedness Thou thistly And grandsir Within extensions And obconical extremals goniatite aeromancy When brunts And portance Thy contrasexuals Will ultracautious
Try creating a few more lists of 100 random words, like so:
$ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words2.txt $ python -c "import sys,random;x=sys.stdin.readlines();random.shuffle(x);sys.stdout.write(''.join(x))" <sowpods.txt | head -100 >random_words3.txt
Now, you can use
paste to create a series of 100 unusual word combinations,
$ paste random_words.txt random_words2.txt random_words3.txt
You’ll get output like this:
ultracautious provisory avowedly autotomic clobber lethargizes processionary theftless tapernesses floozy macintosh rootless basicity drearest splents fadeaways denigrated packaging kissably paralogisms luv parametrization heketara avas restrove sdaining frostline forecourses deracination hematuria councillor quark narial presentational bezzazzes haphazardnesses matronizing frontierswoman re grants fain mattoids womanlike journeyer hexylene pugh militate filed pingrass sluff counterpoint eyetooth reavowing file
Congratulations! You just made your first weird computational poem.
You may have noticed at this point that you can’t change the location of the cursor on the command line by just clicking where you want the cursor to go. No, that’d be too easy. The regular keystrokes you use to move to the beginning or end of the line won’t work either. But there are some special Terminal-specific keystrokes you can learn to make your life easier. Here are a few:
- Up/Down: Moves between entries in your command line “history” (i.e., a list of all the commands you’ve typed in this session).
- Ctrl-A: Moves to the beginning of the line.
- Ctrl-E: Moves to the end of the line.
- Ctrl-U: Erases everything from the current position of the cursor to the beginning of the line.
- Tab: Attempts to “auto-complete” your current word. This will search the current directory for a filename or command beginning with the characters you’ve already typed. Hit Tab again to cycle through possible alternatives.
Other helpful commands
wc -wfoo will print out the number of words in the file named foo. (
wc -lwill count the number of lines;
wc -cwill count the number of characters.)
curl -s http://some.url/fetches the web page at the given URL and prints its content to standard output. (We’ll be using this command extensively!)
lscommand will give you a list (“ls” stands for “list”) of files in your current directory. If you give it a parameter, it will give you a listing of the files in the directory you gave to it. (On OS X, for example, try
pwdto find out what your current directory is.
cpcommand will make a copy (“cp” for “copy”) of a file. It takes two parameters: the first is the source file name, the second is the destination file name.
rm footo delete the file named foo. (Note: this is permanent! The file won’t go to your Trash, so be careful)
- UNIX tutorial for beginners: humane with many helpful illustrations
- egrep for linguists: not just about egrep, but a number of other UNIX text processing utilities as well!