Find non-ASCII Characters in Text Files in Linux

ASCII (American Standard Code for Information Interchange) was initially developed for encoding the English alphabet and is only limited to 128 characters.

And through this tutorial, I am going to explain various ways by which you can find non-ASCII characters in a text file.

How to find Non-ASCII characters in Linux

Before going through the process, let's have a look at the sample that I'm going to use:

sagar@LHB:~$ cat Non-ASCII.txt 
Short guide on how to find Non-ASCII characters from text file by LHB
한국어 샘플 텍스트 ¶ÆÇφϖ℘ℑℜ
A ŠÄMρLë T∈XT to find Non-ÅŠÇÎI Characters

1. Using Perl

It may surprise you but the Perl was originally intended to search, extract and print information so let's put Perl to work:

perl -ne 'print if /[^[:ascii:]]/' sample.txt

In case you are confused, Perl will get the lines that involve any non-ASCII characters. So let me break down the options used here:

-ne is a combination of two flags -n and -e used to create a new line first and then proceed with the execution.
print if /[^[:ascii:]] is a logic behind finding and printing lines that contain non-ASCII characters.

2. Using grep command

In Linux, we generally use patterns to search for specific items and in those cases, the utilities like grep can make the process lot easier.

Remember, you'd get different results depending on the configuration you've made to your shell.

grep --color='auto' -P -n "[\x80-\xFF]" Non-ASCII.txt

Here,

--color='auto' highlights the matched pattern.
-P will interpret the Perl-compatible expression.
-n is used to display lines with numbers containing non-ASCII characters.
“[\x80-\xFF]” is a defined range for characters that are non-ASCII.

Changing the range gave me better results compared to the previous query:

grep --color='auto' -P -n "[^\x00-\x7F]" Non-ASCII.txt

3. Using tr command

While the tr (or translate) command is mainly used to translate characters, it can also be used to delete the characters and that's what I'm going to do here.

To be clear, it is not going to delete the actual contents of the file and will modify the output that I'm going to get by using '[:print:]':

tr -d '[:print:]' < Non-ASCII.txt

4. Using sed command

⚠️

Make sure to create a copy of the original text file as the sed command will modify the original file.

The sed utility is generally used when the sequence of executables is too complex and it is similar to what we are dealing with here.

But sed offers much more than that and if you are constantly dealing with complex workloads, you should check out the detailed guide on SED:

Getting Started With SED Command [Beginner’s Guide]

Learn to use one of the most powerful commands of the Unix toolbox: sed, the stream editor with practical examples of SED commands.

Linux HandbookSylvain Leroux

Now, let's find out the non-ASCII character by utilizing the given command:

LC_ALL=C sed -i 's/[^\x0-\xB1]//g' Non-ASCII.txt

It doesn't show any non-ASCII characters. Don't worry, the command has been executed successfully and I'll show you where to look for non-ASCII characters in a moment.

But first, let me break down the executed command:

LC_ALL=C will set the localization settings to the simplest C.
-i edits the file in place, meaning it will modify the original file.
's/[^\x0-\xB1]//g' is the expression to match non-ASCII characters.

As I mentioned earlier, the sed command has highlighted the non-ASCII characters that can be accessed through the cat command:

cat Non-ASCII.txt

5. Using pcregrep

The pcregrep utility is nothing but grep pre-compatible with Perl regular expression. In simple terms, pcregrep behaves as grep with -P.

But it requires manual installation and can be installed through the given command:

For Debian-based distros:

sudo apt install pcregrep

For RHEL-based distros:

yum install pcregrep

So let's use pcregrep to search for non-ASCII characters:

pcregrep --color='auto' -n "[\x80-\xFF]" Non-ASCII.txt

Looks similar right? To make it more clear, let me break it down for you:

Here:

--color='auto' highlights non-ASCII characters.
-n displays every line having non-ASCII characters with numbers.
"[\x80-\xFF]" matches characters outside the range of ASCII characters.

Similarly, you can also use [:ascii:] character class with ^ to filter non-ASCII characters:

pcregrep --color='auto' -n "[^[:ascii:]]" Non-ASCII.txt

Final Words

This was my take on how you can find non-ASCII characters in a text file in Linux.

And if you have any queries, make sure to leave ASCII characters in the comments section.