Commands

4 Essential and Practical Usage of Cut Command in Linux

The cut command in Linux allows removing data on each line of a file. Read this tutorial to know how to use it effectively to process text or CSV data file.

Sylvain Leroux
Sylvain Leroux

Table of Contents

The cut command is the canonical tool to remove “columns” from a text file. In this context, a “column” can be defined as a range of characters or bytes identified by their physical position on the line, or a range of fields delimited by a separator.

I have written about using AWK commands earlier. In this detailed guide, I’ll explain four essential and practical examples of cut command in Linux that will help you big time.

4 Practical examples of Cut command in Linux

If you prefer, you can watch this video explaining the same practical examples of cut command that I have listed in the article.

1. Working with character ranges

When invoked with the -c command line option, the cut command will remove character ranges.

Like any other filter, the cut command does not change the input file in place but it will copy the modified data to its standard output. It is your responsibility to redirect the command output to a file to save the result or to use a pipe to send it as input to another command.

If you’ve downloaded the sample test files used in the video above, you can see the BALANCE.txt data file, coming straight out of an accounting software my wife is using at her work:

sh$ head BALANCE.txt
ACCDOC    ACCDOCDATE    ACCOUNTNUM ACCOUNTLIB              ACCDOCLIB                        DEBIT          CREDIT
4         1012017       623477     TIDE SCHEDULE           ALNEENRE-4701-LOC                00000001615,00
4         1012017       445452     VAT BS/ENC              ALNEENRE-4701-LOC                00000000323,00
4         1012017       4356       PAYABLES                ALNEENRE-4701-LOC                               00000001938,00
5         1012017       623372     ACCOMODATION GUIDE      ALNEENRE-4771-LOC                00000001333,00
5         1012017       445452     VAT BS/ENC              ALNEENRE-4771-LOC                00000000266,60
5         1012017       4356       PAYABLES                ALNEENRE-4771-LOC                               00000001599,60
6         1012017       4356       PAYABLES                FACT FA00006253 - BIT QUIROBEN                  00000001837,20
6         1012017       445452     VAT BS/ENC              FACT FA00006253 - BIT QUIROBEN   00000000306,20
6         1012017       623795     TOURIST GUIDE BOOK      FACT FA00006253 - BIT QUIROBEN   00000001531,00

This is a fixed-width text file since the data fields are padded with a variable number of spaces to ensure they are displayed as a nicely aligned table.

As a corollary, a data column always starts and ends at the same character position on each line. There is a little pitfall though: despite its name, the cut command actually requires you to specify the range of data you want to keep, not the range you want to remove. So, if I need only the ACCOUNTNUM and ACCOUNTLIB columns in the data file above, I would write that:

sh$ cut -c 25-59 BALANCE.txt | head
ACCOUNTNUM ACCOUNTLIB
623477     TIDE SCHEDULE
445452     VAT BS/ENC
4356       /accountPAYABLES
623372     ACCOMODATION GUIDE
445452     VAT BS/ENC
4356       PAYABLES
4356       PAYABLES
445452     VAT BS/ENC
623795     TOURIST GUIDE BOOK

What’s a range?

As we have just seen it, the cut command requires we specify the range of data we want to keep. So, let’s introduce more formally what is a range: for the cut command, a range is defined by a starting and ending position separated by a hyphen. Ranges are 1-based, that is the first item of the line is the item number 1, not 0. Ranges are inclusive: the start and end will be preserved in the output, as well as all characters between them. It is an error to specify a range whose ending position is before (“lower”) than its starting position. As a shortcut, you can omit the start or end value as described in the table below:

  • a-b: the range between a and b (inclusive)
  • a: equivalent to the range a-a
  • -b: equivalent to 1-a
  • b-: equivalent to b-∞

The cut commands allow you to specify several ranges by separating them with a comma. Here are a couple of examples:

# Keep characters from 1 to 24 (inclusive)
cut -c -24 BALANCE.txt

# Keep characters from 1 to 24 and 36 to 59 (inclusive)
cut -c -24,36-59 BALANCE.txt

# Keep characters from 1 to 24, 36 to 59 and 93 to the end of the line (inclusive)
cut -c -24,36-59,93- BALANCE.txt

One limitation (or feature, depending on the way you see it) of the cut command is that it will never reorder the data. So, the following command will produce exactly the same result as the previous one, despite the ranges being specified in a different order:

cut -c 93-,-24,36-59 BALANCE.txt

You can check that easily using the diff command:

diff -s <(cut -c -24,36-59,93- BALANCE.txt) \
              <(cut -c 93-,-24,36-59 BALANCE.txt)
Files /dev/fd/63 and /dev/fd/62 are identical

Similarly, the cut command never duplicates data:

# One might expect that could be a way to repeat
# the first column three times, but no...
cut -c -10,-10,-10 BALANCE.txt | head -5
ACCDOC
4
4
4
5

Worth mentioning there was a proposal for a -o option to lift those two last limitations, allowing the cut utility to reorder or duplicate data. But this was rejected by the POSIX committee“because this type of enhancement is outside the scope of the IEEE P1003.2b draft standard.”

As of myself, I don’t know any cut version implementing that proposal as an extension. But if you do, please, share that with us using the comment section!

2. Working with byte ranges

When invoked with the -b command line option, the cut command will remove byte ranges.

At first sight, there is no obvious difference between character and byte ranges:

sh$ diff -s <(cut -b -24,36-59,93- BALANCE.txt) \
              <(cut -c -24,36-59,93- BALANCE.txt)
Files /dev/fd/63 and /dev/fd/62 are identical

That’s because my sample data file is using the US-ASCII character encoding (“charset”) as the file -i command can correctly guess it:

sh$ file -i BALANCE.txt
BALANCE.txt: text/plain; charset=us-ascii

In that character encoding, there is a one-to-one mapping between characters and bytes. Using only one byte, you can theoretically encode up to 256 different characters (digits, letters, punctuations, symbols, … ) In practice, that number is much lower since character encodings make provision for some special values (like the 32 or 65 control characters generally found). Anyway, even if we could use the full byte range, that would be far from enough to store the variety of human writing. So, today, the one-to-one mapping between characters and byte is more the exception than the norm and is almost always replaced by the ubiquitous UTF-8 multibyte encoding. Let’s see now how the cut command could handle that.

Working with multibyte characters

As I said previously, the sample data files used as examples for that article are coming from an accounting software used by my wife. It appends she updated that software recently and, after that, the exported text files were subtlely different. I let you try spotting the difference by yourself:

sh$ head BALANCE-V2.txt
ACCDOC    ACCDOCDATE    ACCOUNTNUM ACCOUNTLIB              ACCDOCLIB                        DEBIT          CREDIT
4         1012017       623477     TIDE SCHEDULE           ALNÉENRE-4701-LOC                00000001615,00
4         1012017       445452     VAT BS/ENC              ALNÉENRE-4701-LOC                00000000323,00
4         1012017       4356       PAYABLES                ALNÉENRE-4701-LOC                               00000001938,00
5         1012017       623372     ACCOMODATION GUIDE      ALNÉENRE-4771-LOC                00000001333,00
5         1012017       445452     VAT BS/ENC              ALNÉENRE-4771-LOC                00000000266,60
5         1012017       4356       PAYABLES                ALNÉENRE-4771-LOC                               00000001599,60
6         1012017       4356       PAYABLES                FACT FA00006253 - BIT QUIROBEN                  00000001837,20
6         1012017       445452     VAT BS/ENC              FACT FA00006253 - BIT QUIROBEN   00000000306,20
6         1012017       623795     TOURIST GUIDE BOOK      FACT FA00006253 - BIT QUIROBEN   00000001531,00

The title of this section might help you in finding what has changed. But, found or not, let see now the consequences of that change:

sh$ cut -c 93-,-24,36-59 BALANCE-V2.txt
ACCDOC    ACCDOCDATE    ACCOUNTLIB              DEBIT          CREDIT
4         1012017       TIDE SCHEDULE            00000001615,00
4         1012017       VAT BS/ENC               00000000323,00
4         1012017       PAYABLES                                00000001938,00
5         1012017       ACCOMODATION GUIDE       00000001333,00
5         1012017       VAT BS/ENC               00000000266,60
5         1012017       PAYABLES                                00000001599,60
6         1012017       PAYABLES                               00000001837,20
6         1012017       VAT BS/ENC              00000000306,20
6         1012017       TOURIST GUIDE BOOK      00000001531,00
19        1012017       SEMINAR FEES            00000000080,00
19        1012017       PAYABLES                               00000000080,00
28        1012017       MAINTENANCE             00000000746,58
28        1012017       VAT BS/ENC              00000000149,32
28        1012017       PAYABLES                               00000000895,90
31        1012017       PAYABLES                                00000000240,00
31        1012017       VAT BS/DEBIT             00000000040,00
31        1012017       ADVERTISEMENTS           00000000200,00
32        1012017       WATER                   00000000202,20
32        1012017       VAT BS/DEBIT            00000000020,22
32        1012017       WATER                   00000000170,24
32        1012017       VAT BS/DEBIT            00000000009,37
32        1012017       PAYABLES                               00000000402,03
34        1012017       RENTAL COSTS            00000000018,00
34        1012017       PAYABLES                               00000000018,00
35        1012017       MISCELLANEOUS CHARGES   00000000015,00
35        1012017       VAT BS/DEBIT            00000000003,00
35        1012017       PAYABLES                               00000000018,00
36        1012017       LANDLINE TELEPHONE        00000000069,14
36        1012017       VAT BS/ENC                00000000013,83

I have copied above the command output in-extenso so it should be obvious something has gone wrong with the column alignment.

The explanation is the original data file contained only US-ASCII characters (symbol, punctuations, numbers and Latin letters without any diacritical marks)

But if you look closely at the file produced after the software update, you can see that new export data file now preserves accented letters. For example, the company named “ALNÉENRE” is now properly spelled whereas it was previously exported as “ALNEENRE” (no accent)

The file -i utility did not miss that change since it reports now the file as being UTF-8 encoded:

sh$ file -i BALANCE-V2.txt
BALANCE-V2.txt: text/plain; charset=utf-8

To see how are encoded accented letters in an UTF-8 file, we can use the hexdump utility that allows us to look directly at the bytes in a file:

# To reduce clutter, let's focus only on the second line of the file
sh$ sed '2!d' BALANCE-V2.txt
4         1012017       623477     TIDE SCHEDULE           ALNÉENRE-4701-LOC                00000001615,00
sh$ sed '2!d' BALANCE-V2.txt  | hexdump -C
00000000  34 20 20 20 20 20 20 20  20 20 31 30 31 32 30 31  |4         101201|
00000010  37 20 20 20 20 20 20 20  36 32 33 34 37 37 20 20  |7       623477  |
00000020  20 20 20 54 49 44 45 20  53 43 48 45 44 55 4c 45  |   TIDE SCHEDULE|
00000030  20 20 20 20 20 20 20 20  20 20 20 41 4c 4e c3 89  |           ALN..|
00000040  45 4e 52 45 2d 34 37 30  31 2d 4c 4f 43 20 20 20  |ENRE-4701-LOC   |
00000050  20 20 20 20 20 20 20 20  20 20 20 20 20 30 30 30  |             000|
00000060  30 30 30 30 31 36 31 35  2c 30 30 20 20 20 20 20  |00001615,00     |
00000070  20 20 20 20 20 20 20 20  20 20 20 0a              |           .|
0000007c

On the line 00000030 of the hexdump output, after a bunch of spaces (byte 20), you can see:

  • the letter A is encoded as the byte 41,
  • the letter L is encoded a the byte 4c,
  • and the letter N is encoded as the byte 4e.

But, the uppercase LATIN CAPITAL LETTER E WITH ACUTE (as it is the official name of the letter É in the Unicode standard) is encoded using the two bytes c3 89

And here is the problem: using the cut command with ranges expressed as byte positions works well for fixed length encodings, but not for variable length ones like UTF-8 or Shift JIS. This is clearly explained in the following non-normative extract of the POSIX standard:

Earlier versions of the cut utility worked in an environment where bytes and characters were considered equivalent (modulo <backspace> and <tab> processing in some implementations). In the extended world of multi-byte characters, the new -b option has been added.

Hey, wait a minute! I wasn’t using the -b option in the “faulty” example above, but the -c option. So, shouldn’t that have worked?!?

Yes, it should: it is unfortunate, but we are in 2018 and despite that, as of GNU Coreutils 8.30, the GNU implementation of the cut utility still does not handle multi-byte characters properly. To quote the GNU documentation, the -c option is “The same as -b for now, but internationalization will change that[… ]” — a mention that is present since more than 10 years now!

On the other hand, the OpenBSD implementation of the cut utility is POSIX compliant, and will honor the current locale settings to handle multi-byte characters properly:

# Ensure subseauent commands will know we are using UTF-8 encoded
# text files
openbsd-6.3$ export LC_CTYPE=en_US.UTF-8

# With the `-c` option, cut works properly with multi-byte characters
openbsd-6.3$ cut -c -24,36-59,93- BALANCE-V2.txt
ACCDOC    ACCDOCDATE    ACCOUNTLIB              DEBIT          CREDIT
4         1012017       TIDE SCHEDULE           00000001615,00
4         1012017       VAT BS/ENC              00000000323,00
4         1012017       PAYABLES                               00000001938,00
5         1012017       ACCOMODATION GUIDE      00000001333,00
5         1012017       VAT BS/ENC              00000000266,60
5         1012017       PAYABLES                               00000001599,60
6         1012017       PAYABLES                               00000001837,20
6         1012017       VAT BS/ENC              00000000306,20
6         1012017       TOURIST GUIDE BOOK      00000001531,00
19        1012017       SEMINAR FEES            00000000080,00
19        1012017       PAYABLES                               00000000080,00
28        1012017       MAINTENANCE             00000000746,58
28        1012017       VAT BS/ENC              00000000149,32
28        1012017       PAYABLES                               00000000895,90
31        1012017       PAYABLES                               00000000240,00
31        1012017       VAT BS/DEBIT            00000000040,00
31        1012017       ADVERTISEMENTS          00000000200,00
32        1012017       WATER                   00000000202,20
32        1012017       VAT BS/DEBIT            00000000020,22
32        1012017       WATER                   00000000170,24
32        1012017       VAT BS/DEBIT            00000000009,37
32        1012017       PAYABLES                               00000000402,03
34        1012017       RENTAL COSTS            00000000018,00
34        1012017       PAYABLES                               00000000018,00
35        1012017       MISCELLANEOUS CHARGES   00000000015,00
35        1012017       VAT BS/DEBIT            00000000003,00
35        1012017       PAYABLES                               00000000018,00
36        1012017       LANDLINE TELEPHONE      00000000069,14
36        1012017       VAT BS/ENC              00000000013,83

As expected, when using the -b byte mode instead of the -c character mode, the OpenBSD cut implementation behave like the legacy cut:

openbsd-6.3$ cut -b -24,36-59,93- BALANCE-V2.txt
ACCDOC    ACCDOCDATE    ACCOUNTLIB              DEBIT          CREDIT
4         1012017       TIDE SCHEDULE            00000001615,00
4         1012017       VAT BS/ENC               00000000323,00
4         1012017       PAYABLES                                00000001938,00
5         1012017       ACCOMODATION GUIDE       00000001333,00
5         1012017       VAT BS/ENC               00000000266,60
5         1012017       PAYABLES                                00000001599,60
6         1012017       PAYABLES                               00000001837,20
6         1012017       VAT BS/ENC              00000000306,20
6         1012017       TOURIST GUIDE BOOK      00000001531,00
19        1012017       SEMINAR FEES            00000000080,00
19        1012017       PAYABLES                               00000000080,00
28        1012017       MAINTENANCE             00000000746,58
28        1012017       VAT BS/ENC              00000000149,32
28        1012017       PAYABLES                               00000000895,90
31        1012017       PAYABLES                                00000000240,00
31        1012017       VAT BS/DEBIT             00000000040,00
31        1012017       ADVERTISEMENTS           00000000200,00
32        1012017       WATER                   00000000202,20
32        1012017       VAT BS/DEBIT            00000000020,22
32        1012017       WATER                   00000000170,24
32        1012017       VAT BS/DEBIT            00000000009,37
32        1012017       PAYABLES                               00000000402,03
34        1012017       RENTAL COSTS            00000000018,00
34        1012017       PAYABLES                               00000000018,00
35        1012017       MISCELLANEOUS CHARGES   00000000015,00
35        1012017       VAT BS/DEBIT            00000000003,00
35        1012017       PAYABLES                               00000000018,00
36        1012017       LANDLINE TELEPHONE        00000000069,14
36        1012017       VAT BS/ENC                00000000013,83

3. Working with fields

In some sense, working with fields in a delimited text file is easier for the cut utility, since it will only have to locate the (one byte) field delimiters on each row, copying then verbatim the field content to the output without bothering with any encoding issues.

Here is a sample delimited text file:

sh$ head BALANCE.csv
ACCDOC;ACCDOCDATE;ACCOUNTNUM;ACCOUNTLIB;ACCDOCLIB;DEBIT;CREDIT
4;1012017;623477;TIDE SCHEDULE;ALNEENRE-4701-LOC;00000001615,00;
4;1012017;445452;VAT BS/ENC;ALNEENRE-4701-LOC;00000000323,00;
4;1012017;4356;PAYABLES;ALNEENRE-4701-LOC;;00000001938,00
5;1012017;623372;ACCOMODATION GUIDE;ALNEENRE-4771-LOC;00000001333,00;
5;1012017;445452;VAT BS/ENC;ALNEENRE-4771-LOC;00000000266,60;
5;1012017;4356;PAYABLES;ALNEENRE-4771-LOC;;00000001599,60
6;1012017;4356;PAYABLES;FACT FA00006253 - BIT QUIROBEN;;00000001837,20
6;1012017;445452;VAT BS/ENC;FACT FA00006253 - BIT QUIROBEN;00000000306,20;
6;1012017;623795;TOURIST GUIDE BOOK;FACT FA00006253 - BIT QUIROBEN;00000001531,00;

You may know that file format as CSV (for Comma-separated Value), even if the field separator is not always a comma. For example, the semi-colon (;) is frequently encountered as a field separator, and it is often the default choice when exporting data as “CSV” in countries already using the comma as the decimal separator (like we do in France — hence the choice of that character in my sample file). Another popular variant uses a tab character as the field separator, producing what is sometimes called a tab-separated values file. Finally, in the Unix and Linux world, the colon (:) is yet another relatively common field separator you may find, for example, in the standard /etc/passwd and /etc/group files.

When using a delimited text file format, you provide to the cut command the range of fields to keep using the -f option, and you have to specify the delimiter using the -d option (without the -d option, the cut utility defaults to a tab character for the separator):

sh$ cut -f 5- -d';' BALANCE.csv | head
ACCDOCLIB;DEBIT;CREDIT
ALNEENRE-4701-LOC;00000001615,00;
ALNEENRE-4701-LOC;00000000323,00;
ALNEENRE-4701-LOC;;00000001938,00
ALNEENRE-4771-LOC;00000001333,00;
ALNEENRE-4771-LOC;00000000266,60;
ALNEENRE-4771-LOC;;00000001599,60
FACT FA00006253 - BIT QUIROBEN;;00000001837,20
FACT FA00006253 - BIT QUIROBEN;00000000306,20;
FACT FA00006253 - BIT QUIROBEN;00000001531,00;

Handling lines not containing the delimiter

But what if some line in the input file does not contain the delimiter? It is tempting to imagine that as a row containing only the first field. But this is not what the cut utility does.

By default, when using the -f option, the cut utility will always output verbatim a line that does not contain the delimiter (probably assuming this is a non-data row like a header or comment of some sort):

sh$ (echo "# 2018-03 BALANCE"; cat BALANCE.csv) > BALANCE-WITH-HEADER.csv

sh$ cut -f 6,7 -d';' BALANCE-WITH-HEADER.csv | head -5
# 2018-03 BALANCE
DEBIT;CREDIT
00000001615,00;
00000000323,00;
;00000001938,00

Using the -s option, you can reverse that behavior, so cut will always ignore such line:

sh$ cut -s -f 6,7 -d';' BALANCE-WITH-HEADER.csv | head -5
DEBIT;CREDIT
00000001615,00;
00000000323,00;
;00000001938,00
00000001333,00;

If you are in a hackish mood, you can exploit that feature as a relatively obscure way to keep only lines containing a given character:

# Keep lines containing a `e`
sh$ printf "%s\n" {mighty,bold,great}-{condor,monkey,bear} | cut -s -f 1- -d'e'

Changing the output delimiter

As an extension, the GNU implementation of cut allows to use a different field separator for the output using the --output-delimiter option:

sh$ cut -f 5,6- -d';' --output-delimiter="*" BALANCE.csv | head
ACCDOCLIB*DEBIT*CREDIT
ALNEENRE-4701-LOC*00000001615,00*
ALNEENRE-4701-LOC*00000000323,00*
ALNEENRE-4701-LOC**00000001938,00
ALNEENRE-4771-LOC*00000001333,00*
ALNEENRE-4771-LOC*00000000266,60*
ALNEENRE-4771-LOC**00000001599,60
FACT FA00006253 - BIT QUIROBEN**00000001837,20
FACT FA00006253 - BIT QUIROBEN*00000000306,20*
FACT FA00006253 - BIT QUIROBEN*00000001531,00*

Notice, in that case, all occurrences of the field separator are replaced, and not only those at the boundary of the ranges specified on the command line arguments.

4. Non-POSIX GNU extensions

Speaking of non-POSIX GNU extension, a couple of them that can be particularly useful. Worth mentioning the following extensions work equally well with the byte, character (for what that means in the current GNU implementation) or field ranges:--complement

Think of that option like the exclamation mark in a sed address (!); instead of keeping the data matching the given range, cut will keep data NOT matching the range

# Keep only field 5
sh$ cut -f 5 -d';' BALANCE.csv |head -3
ACCDOCLIB
ALNEENRE-4701-LOC
ALNEENRE-4701-LOC

# Keep all but field 5
sh$ cut --complement -f 5 -d';' BALANCE.csv |head -3
ACCDOC;ACCDOCDATE;ACCOUNTNUM;ACCOUNTLIB;DEBIT;CREDIT
4;1012017;623477;TIDE SCHEDULE;00000001615,00;
4;1012017;445452;VAT BS/ENC;00000000323,00;

--zero-terminated (-z)

use the NUL character as the line terminator instead of the newline character. The -z option is particularly useful when your data may contain embedded newline characters, like when working with filenames (since newline is a valid character in a filename, but NUL isn’t).

To show you how the -z option works, let’s make a little experiment. First, we will create a file whose name contains embedded new lines:

bash$ touch

Let’s now assume I want to display the first 5 characters of each *.txt file name. A naive solution will miserably fail here:

sh$ ls -1 *.txt | cut -c 1-5
BALAN
BALAN
EMPTY
FILE
WITH
NAME.

You may have already read ls was designed for human consumption, and using it in a command pipeline is an anti-pattern (it is indeed). So let’s use the find command instead:

sh$ find . -name '*.txt' -printf "%f\n" | cut -c 1-5
BALAN
EMPTY
FILE
WITH
NAME.
BALAN

and … that produced basically the same erroneous result as before (although in a different order because ls implicitly sorts the filenames, something the find command does not do).

The problem is in both cases, the cut command can’t make the distinction between a newline character being part of a data field (the filename), and a newline character used as an end of record marker. But, using the NUL byte (\0) as the line terminator clears the confusion so we can finally obtain the expected result:

# I was told (?) some old versions of tr require using \000 instead of \0
# to denote the NUL character (let me know if you needed that change!)
sh$ find . -name '*.txt' -printf "%f\0" | cut -z -c 1-5| tr '\0' '\n'
BALAN
EMPTY
BALAN

With that latest example, we are moving away from the core of this article that was the cut command. So, I will let you try to figure by yourself the meaning of the funky "%f\0" after the -printf argument of the find command or why I used the tr command at the end of the pipeline.

A lot more can be done with Cut command

I just showed the most common and in my opinion the most essential usage of Cut command. You can apply the command in even more practical ways. It depends on your logical reasoning and imagination.

Don’t hesitate to use the comment section below to post your findings. And, as always, if you like this article, don’t forget to share it on your favorite websites and social media!



Join the conversation.