csplit: A Better Way to Split File in Linux Based on its Content
How to split a file in Linux based on its content? Learn some practical examples of the GNU coreutils csplit command. It's more useful than the popular split command.
When it comes to splitting a text file into multiple files in Linux, most people use the split command. Nothing wrong with the split command except that it relies on the byte size or line size for splitting the files.
This is not convenient in situations where you need to split files based on its content, instead of size. Let me give you an example.
I manage my scheduled tweets using YAML files. A typical tweet file contains several tweets, separated by four dashes:
----
event:
repeat: { days: 180 }
status: |
I think I use the `sed` command daily. And you?
https://www.yesik.it/EP07
#Shell #Linux #Sed #YesIKnowIT
----
status: |
Print the first column of a space-separated data file:
awk '{print $1}' data.txt # Print out just the first column
For some unknown reason, I find that easier to remember than:
cut -f1 data.txt
#Linux #AWK #Cut
----
status: |
For the #shell #beginners :
[...]
When importing them into my system, I need to write each tweet to its own file. I do that to avoid registering duplicate tweets.
But how to split a file into several parts based on its content? Well, probably you can obtain something convincing using awk commands:
sh$ awk < tweets.yaml '
> /----/ { OUTPUT="tweet." (N++) ".yaml" }
> { print > OUTPUT }
> '
However, despite a relative simplicity, such a solution is not very robust: for example, I didn’t properly close the various output files, so this might very well reach the open files limit. Or what if I forgot the separator before the very first tweet of the file? Of course, all that can be handled and fixed in the AWK script, at the expense of making it more complex. But why bothering with that when we have the csplit
tool to accomplish that task?
Using csplit to split files in Linux
The csplit
tool is a cousin of the split
tool that can be used to split a file into fixed-size chunks. But csplit
will identify the chunk boundaries based on the file content, rather than using byte count.
In this tutorial, I’ll demonstrate csplit command usage and will also explain the output of this command.
So, for example, if I want to split my tweet file based on the ----
delimiter, I could write:
sh$ csplit tweets.yaml /----/
0
10846
You may have guessed the csplit
tool used the regex provided on the command line to identify the separator. And what could be those 0
and 10983
result displayed on the standard output? Well, they are the size in bytes of each created chunk of data.
sh$ ls -l xx0*
-rw-r--r-- 1 sylvain sylvain 0 Jun 6 11:30 xx00
-rw-r--r-- 1 sylvain sylvain 10846 Jun 6 11:30 xx01
Wait a minute! Where those xx00
and xx01
filenames are coming from? And why csplit
split the file into two chunks only? And why the first data chunk has a length of zero bytes?
The answer to the first question is simple: xxNN
(or more formally xx%02d
) is the default filename format used by csplit
. But you can change that using the --suffix-format
and --prefix
options. For example, I could change the format to something more meaningful for my needs:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> /----/
0
10846
sh$ ls -l tweet.*
-rw-r--r-- 1 sylvain sylvain 0 Jun 6 11:30 tweet.000.yaml
-rw-r--r-- 1 sylvain sylvain 10846 Jun 6 11:30 tweet.001.yaml
The prefix is a plain string, but the suffix is a format string like the one used by the standard C library printf
function. Most characters of the format will be used verbatim, except for conversion specifications which are introduced by the percent sign (%
) and which ends with a conversion specifier (here, d
). In between, the format may also contain various flags and options. In my example, the %03d
conversion specification means:
- display the chunk number as a decimal integer (
d
), - in a three characters width field (
3
), - eventually padded on the left with zeros (
0
).
But that does not address the other interrogations I had above: why do we have only two chunks, one of them containing zero bytes? Maybe do you already have found the answer to that latter question by yourself: my data file starts with ----
on its very first line. So, csplit
considered it as a delimiter, and since there was no data before that line, it created an empty first chunk. We can disable the creation of zero bytes length files using the --elide-empty-files
option:
sh$ rm tweet.*
rm: cannot remove 'tweet.*': No such file or directory
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> /----/
10846
sh$ ls -l tweet.*
-rw-r--r-- 1 sylvain sylvain 10846 Jun 6 11:30 tweet.000.yaml
Ok: no more empty files. But in a sense, the result is worst now, since csplit
split the file in just one chunk. We barely can call that “splitting” a file, can’t we?
The explanation for that surprising result is csplit
does not at all assume each chuck should be split based on the same separator. Actually, csplit
requires you to provide each separator used. Even if it is several times the same:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> /----/ /----/ /----/
170
250
10426
I’ve put three (identical) separators on the command line. So, csplit
identified the end of the first chunk based on the first separator. It leads to a zero bytes length chunk that was elided. The second chunk was delimited by the next line matching /----/
. Leading to a 170 bytes chunk. Finally, a third 250 bytes length chunk was identified based on the third separator. The remaining data, 10426 bytes, were put into the last chunk.
sh$ ls -l tweet.???.yaml
-rw-r--r-- 1 sylvain sylvain 170 Jun 6 11:30 tweet.000.yaml
-rw-r--r-- 1 sylvain sylvain 250 Jun 6 11:30 tweet.001.yaml
-rw-r--r-- 1 sylvain sylvain 10426 Jun 6 11:30 tweet.002.yaml
Obviously, it wouldn’t be practical if we had to provide as many separators on the command line as there are chunks in the data file. Especially since that exact number is usually not known in advance. Fortunately, csplit
has a special pattern meaning “repeat the previous pattern as much as possible.” Despite its syntax reminding the star quantifier in a regular expression, this is closer to the Kleene plus concept since it is used to repeat a separator that has already been matched once:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> /----/ '{*}'
170
250
190
208
140
[...]
247
285
194
214
185
131
316
221
And this time, finally, I have split my tweet collection into individual parts. However, does csplip
have some other nice “special” patterns like that? Well, I don’t know if we can call them “special”, but definitely, csplit
understand more of patterns.
More csplit patterns
We’ve just seen in the preceding section how to use the ‘{*}’ quantifier for unbound repetitions. However, by replacing the star with a number, you can request an exact number of repetitions:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> /----/ '{6}'
170
250
190
208
140
216
9672
That leads to an interesting corner case. What would append if the number of repetition exceeded the number of actual delimiters in the data file? Well, let’s see that on an example:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> /----/ '{999}'
csplit: ‘/----/’: match not found on repetition 62
170
250
190
208
[...]
91
247
285
194
214
185
131
316
221
sh$ ls tweet.*
ls: cannot access 'tweet.*': No such file or directory
Interestingly, not only csplit
reported an error, but it also removed all the chunk files created during the process. Pay special attention to my wording: it removed them. That means the files were created, then, when csplit
encountered the error, it deleted them. In other words, if you already have a file whose name looks like a chunk file, it will be removed:
sh$ touch tweet.002.yaml
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> /----/ '{999}'
csplit: ‘/----/’: match not found on repetition 62
170
250
190
[...]
87
91
247
285
194
214
185
131
316
221
sh$ ls tweet.*
ls: cannot access 'tweet.*': No such file or directory
In the above example, the tweet.002.yaml
file we’ve manually created was overwritten, then removed by csplit
.
You can change that behavior using the --keep-files
option. As its name implies it, it will not remove chunks csplit created after encountering an error:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> --keep-files \
> /----/ '{999}'
csplit: ‘/----/’: match not found on repetition 62
170
250
190
[...]
316
221
sh$ ls tweet.*
tweet.000.yaml
tweet.001.yaml
tweet.002.yaml
tweet.003.yaml
[...]
tweet.058.yaml
tweet.059.yaml
tweet.060.yaml
tweet.061.yaml
Notice in that case, and despite the error, csplit
didn’t discard any data:
sh$ diff -s tweets.yaml <(cat tweet.*)
Files tweets.yaml and /dev/fd/63 are identical
But what if there are some data in the file I want to discard? Well, csplit
has some limited support for that using a %regex%
pattern.
Skipping data in csplit
When using a percent sign (%
) as the regex delimiter instead of a slash (/
), csplit
will skip data up to (but not including) the first line matching the regular expression. This may be useful to ignore some records, especially at the start or the end of the input file:
sh$ # Keep only the first two tweets
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> --keep-files \
> /----/ '{2}' %----% '{*}'
170
250
sh$ head tweet.00[012].yaml
==> tweet.000.yaml <==
----
event:
repeat: { days: 180 }
status: |
I think I use the `sed` command daily. And you?
https://www.yesik.it/EP07
#Shell #Linux #Sed #YesIKnowIT
==> tweet.001.yaml <==
----
status: |
Print the first column of a space-separated data file:
awk '{print $1}' data.txt # Print out just the first column
For some unknown reason, I find that easier to remember than:
cut -f1 data.txt
#Linux #AWK #Cut
sh$ # Skip the first two tweets
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> --keep-files \
> %----% '{2}' /----/ '{2}'
190
208
140
9888
sh$ head tweet.00[012].yaml
==> tweet.000.yaml <==
----
status: |
For the #shell #beginners :
« #GlobPatterns : how to move hundreds of files in not time [1/3] »
https://youtu.be/TvW8DiEmTcQ
#Unix #Linux
#YesIKnowIT
==> tweet.001.yaml <==
----
status: |
Want to know the oldest file in your disk?
find / -type f -printf '%TFT%.8TT %p\n' | sort | less
(should work on any Single UNIX Specification compliant system)
#UNIX #Linux
==> tweet.002.yaml <==
----
status: |
When using the find command, use `-iname` instead of `-name` for case-insensitive search
#Unix #Linux #Shell #Find
sh$ # Keep only the third and fourth tweets
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> --keep-files \
> %----% '{2}' /----/ '{2}' %----% '{*}'
190
208
140
sh$ head tweet.00[012].yaml
==> tweet.000.yaml <==
----
status: |
For the #shell #beginners :
« #GlobPatterns : how to move hundreds of files in not time [1/3] »
https://youtu.be/TvW8DiEmTcQ
#Unix #Linux
#YesIKnowIT
==> tweet.001.yaml <==
----
status: |
Want to know the oldest file in your disk?
find / -type f -printf '%TFT%.8TT %p\n' | sort | less
(should work on any Single UNIX Specification compliant system)
#UNIX #Linux
==> tweet.002.yaml <==
----
status: |
When using the find command, use `-iname` instead of `-name` for case-insensitive search
#Unix #Linux #Shell #Find
Using offsets while splitting files with csplit
When using regular expressions (either /…/
or %…%
) you can specify a positive (+N
) or negative (-N
) offset at the end of the pattern so csplit
will split the file N lines after or before the matching line. Remember, in all cases, the pattern specifies the end of the chunk:
sh$ csplit tweets.yaml \
> --prefix='tweet.' --suffix-format='%03d.yaml' \
> --elide-empty-files \
> --keep-files \
> %----%+1 '{2}' /----/+1 '{2}' %----% '{*}'
190
208
140
sh$ head tweet.00[012].yaml
==> tweet.000.yaml <==
status: |
For the #shell #beginners :
« #GlobPatterns : how to move hundreds of files in not time [1/3] »
https://youtu.be/TvW8DiEmTcQ
#Unix #Linux
#YesIKnowIT
----
==> tweet.001.yaml <==
status: |
Want to know the oldest file in your disk?
find / -type f -printf '%TFT%.8TT %p\n' | sort | less
(should work on any Single UNIX Specification compliant system)
#UNIX #Linux
----
==> tweet.002.yaml <==
status: |
When using the find command, use `-iname` instead of `-name` for case-insensitive search
#Unix #Linux #Shell #Find
----
Split by line number
We have already seen how we can use a regular expression to split files. In that case, csplit
will split the file at the first line matching that regex. But you can also identify the split line by its line number as we will see it now.
Before switching to YAML, I used to store my scheduled tweets in a flat file.
In that file, a tweet was made of two lines. One containing an optional repetition, and the second containing the text of the tweet, with newlines replaced by \n. Once again that sample file is available online.
With that “fixed size” format too was able to use csplit
to put each individual tweet into its own file:
sh$ csplit tweets.txt \
> --prefix='tweet.' --suffix-format='%03d.txt' \
> --elide-empty-files \
> --keep-files \
> 2 '{*}'
csplit: ‘2’: line number out of range on repetition 62
1
123
222
161
182
119
184
81
148
128
142
101
107
[...]
sh$ diff -s tweets.txt <(cat tweet.*.txt)
Files tweets.txt and /dev/fd/63 are identical
sh$ head tweet.00[012].txt
==> tweet.000.txt <==
==> tweet.001.txt <==
{ days:180 }
I think I use the `sed` command daily. And you?\n\nhttps://www.yesik.it/EP07\n#Shell #Linux #Sed\n#YesIKnowIT
==> tweet.002.txt <==
{}
Print the first column of a space-separated data file:\nawk '{print $1}' data.txt # Print out just the first column\n\nFor some unknown reason, I find that easier to remember than:\ncut -f1 data.txt\n\n#Linux #AWK #Cut
The example above seems easy to understand, but there are two pitfalls here. First, the 2
given as an argument to csplit
is a line number, not a line count. However, when using a repetition as I did, after the first match, csplit
will use that number as a line count. If it’s not clear, I let you compare the output of the three following commands:
sh$ csplit tweets.txt --keep-files 2 2 2 2 2
csplit: warning: line number ‘2’ is the same as preceding line number
csplit: warning: line number ‘2’ is the same as preceding line number
csplit: warning: line number ‘2’ is the same as preceding line number
csplit: warning: line number ‘2’ is the same as preceding line number
1
0
0
0
0
9030
sh$ csplit tweets.txt --keep-files 2 4 6 8 10
1
123
222
161
182
8342
sh$ csplit tweets.txt --keep-files 2 '{4}'
1
123
222
161
182
8342
I mentioned a second pitfall, somewhat related to the first one. Maybe did you notice the empty line at the very top of the tweets.txt
file? It leads to that tweet.000.txt
chunk that contains only the newline character. Unfortunately, it was required in that example because of the repetition: remember I want two lines chunks. So the 2
is mandatory before the repetition. But that also means the first chunk will break at, but not including, the line two. In other words, the first chunk contains one line. All the other ones will contain 2 lines. Maybe you could share your opinion in the comment section, but as of myself I think this was an unfortunate design choice.
You can mitigate that issue by skipping directly to the first non-empty line:
sh$ csplit tweets.txt \
> --prefix='tweet.' --suffix-format='%03d.txt' \
> --elide-empty-files \
> --keep-files \
> %.% 2 '{*}'
csplit: ‘2’: line number out of range on repetition 62
123
222
161
[...]
sh$ head tweet.00[012].txt
==> tweet.000.txt <==
{ days:180 }
I think I use the `sed` command daily. And you?\n\nhttps://www.yesik.it/EP07\n#Shell #Linux #Sed\n#YesIKnowIT
==> tweet.001.txt <==
{}
Print the first column of a space-separated data file:\nawk '{print $1}' data.txt # Print out just the first column\n\nFor some unknown reason, I find that easier to remember than:\ncut -f1 data.txt\n\n#Linux #AWK #Cut
==> tweet.002.txt <==
{}
For the #shell #beginners :\n« #GlobPatterns : how to move hundreds of files in not time [1/3] »\nhttps://youtu.be/TvW8DiEmTcQ\n\n#Unix #Linux\n#YesIKnowIT
Reading from stdin
Of course, like most of the command line tools, csplit
can read the input data from its standard input. In that case, you have to specify -
as the input filename:
sh$ tr [:lower:] [:upper:] < tweets.txt | csplit - \
> --prefix='tweet.' --suffix-format='%03d.txt' \
> --elide-empty-files \
> --keep-files \
> %.% 2 '{3}'
123
222
161
8524
sh$ head tweet.???.txt
==> tweet.000.txt <==
{ DAYS:180 }
I THINK I USE THE `SED` COMMAND DAILY. AND YOU?\N\NHTTPS://WWW.YESIK.IT/EP07\N#SHELL #LINUX #SED\N#YESIKNOWIT
==> tweet.001.txt <==
{}
PRINT THE FIRST COLUMN OF A SPACE-SEPARATED DATA FILE:\NAWK '{PRINT $1}' DATA.TXT # PRINT OUT JUST THE FIRST COLUMN\N\NFOR SOME UNKNOWN REASON, I FIND THAT EASIER TO REMEMBER THAN:\NCUT -F1 DATA.TXT\N\N#LINUX #AWK #CUT
==> tweet.002.txt <==
{}
FOR THE #SHELL #BEGINNERS :\N« #GLOBPATTERNS : HOW TO MOVE HUNDREDS OF FILES IN NOT TIME [1/3] »\NHTTPS://YOUTU.BE/TVW8DIEMTCQ\N\N#UNIX #LINUX\N#YESIKNOWIT
==> tweet.003.txt <==
{}
WANT TO KNOW THE OLDEST FILE IN YOUR DISK?\N\NFIND / -TYPE F -PRINTF '%TFT%.8TT %P\N' | SORT | LESS\N(SHOULD WORK ON ANY SINGLE UNIX SPECIFICATION COMPLIANT SYSTEM)\N#UNIX #LINUX
{}
WHEN USING THE FIND COMMAND, USE `-INAME` INSTEAD OF `-NAME` FOR CASE-INSENSITIVE SEARCH\N#UNIX #LINUX #SHELL #FIND
{}
FROM A POSIX SHELL `$OLDPWD` HOLDS THE NAME OF THE PREVIOUS WORKING DIRECTORY:\NCD /TMP\NECHO YOU ARE HERE: $PWD\NECHO YOU WERE HERE: $OLDPWD\NCD $OLDPWD\N\N#UNIX #LINUX #SHELL #CD
{}
FROM A POSIX SHELL, "CD" IS A SHORTHAND FOR CD $HOME\N#UNIX #LINUX #SHELL #CD
{}
HOW TO MOVE HUNDREDS OF FILES IN NO TIME?\NUSING THE FIND COMMAND!\N\NHTTPS://YOUTU.BE/ZMEFXJYZAQK\N#UNIX #LINUX #MOVE #FILES #FIND\N#YESIKNOWIT
And that’s pretty all I wanted to show you today. I hope in the future, you’ll use csplit to split files in Linux. If you’ve enjoyed this article and don’t forget to share and like it on your favorite social network!
Engineer by Passion, Teacher by Vocation. My goal is to share my enthusiasm for what I teach and prepare my students to develop their skills by themselves.