Linux Text Processing Commands and Programs


Below you’ll find some useful Linux tools and commands you can use to manipulate and process text.

1. less

In my opinion the best tool for visualizing text files. It allows you to easily scroll back and forward (using both lines and pages), and it also comes with search and other functions.

Use:

less file.txt

to open the file. Then use /pattern to search for a regex pattern and n to move to the next match.

2. cat

This command is used for several purposes. First of all you can use it to read from stdin and output to a file, like this:

cat > file1.txt

Thus creating a file directly from the command line (press Ctrl+D when you are done).

You can also use it to copy the contents of one file to another:

cat file1.txt > file2.txt

Or to concatenate the contests of two or more files:

cat file1.txt file2.txt > file3.txt

3. tr

The translate command is used to replace characters or patterns in a text file. For example, the command below will transform all lowercase letters from file1.txt into uppercase letters, and then send the output to file2.txt:

tr "a-z" "A-Z" < file1.txt > file2.txt

You can also pipe the output into less instead of saving it to a file:

tr "a-z" "A-Z" < file1.txt | less

Now suppose you have a file with all of Shakespeare’s works (this example is taken from Prof. Dan Jurafsky, Stanford). You can download it here as a .txt.

Say we want to see each word of that file on a separate line. We can achieve this by translating all non alphabetic characters into newline characters (‘\n’). Like this:

tr -cs "a-zA-Z" "\n" < shakes.txt | less

The -c option is used to find the complement of “a-zA-Z”, which is basically all non-alphabetic characters. The option -s is used to squeeze repeat characters, so we don’t end up with several blank lines in a row.

4. sort

This is another useful command when it comes to text processing. Still referring to the previous example we can now sort all the words of Shakespeare’s work in lexicographical order, like this:

tr -cs "a-zA-Z" "\n" < shakes.txt | sort | less

5. uniq

As you’ll notice, using the command above is not very useful. That’s because we’ll get a ton of repeat words. For instance, the first few thousand lines are only the character ‘a’. To solve the problem we can use the uniq command to eliminate all duplicate lines:

tr -cs "a-zA-Z" "\n" < shakes.txt | sort | uniq | less

So far the words ‘The’ and ‘the’ are being considered separately. To count them together we just need to conver all uppercase letters to lowercase:

tr -cs "a-zA-Z" "\n" < shakes.txt | tr "A-Z" "a-z" | sort | uniq| less

We can now use the -c option of the uniq program to count the number of times that each word appears, using the sort program again to sort the results:

tr -cs "a-zA-Z" "\n" < shakes.txt | tr "A-Z" "a-z" | sort | uniq -c | sort -n -r | less

The -n argument of sort means numeric sort (instead of lexicographic, the default) and -r means to use a random hash of keys as the sorting method.

6. wc

Use the wc command to display the number lines, words and characters from a file.


Leave a Reply

Your email address will not be published. Required fields are marked *