ASCII (American Standard Code for Information Interchange) was initially developed for encoding the English alphabet and is only limited to 128 characters.
And through this tutorial, I am going to explain various ways by which you can find non-ASCII characters in a text file.
How to find Non-ASCII characters in Linux
Before going through the process, let's have a look at the sample that I'm going to use:
[email protected]:~$ cat Non-ASCII.txt Short guide on how to find Non-ASCII characters from text file by LHB 한국어 샘플 텍스트 ¶ÆÇφϖ℘ℑℜ A ŠÄMρLë T∈XT to find Non-ÅŠÇÎI Characters
1. Using Perl
It may surprise you but the Perl was originally intended to search, extract and print information so let's put Perl to work:
perl -ne 'print if /[^[:ascii:]]/' sample.txt
In case you are confused, Perl will get the lines that involve any non-ASCII characters. So let me break down the options used here:
-neis a combination of two flags
-eused to create a new line first and then proceed with the execution.
print if /[^[:ascii:]]is a logic behind finding and printing lines that contain non-ASCII characters.
2. Using grep command
In Linux, we generally use patterns to search for specific items and in those cases, the utilities like grep can make the process lot easier.
Remember, you'd get different results depending on the configuration you've made to your shell.
grep --color='auto' -P -n "[\x80-\xFF]" Non-ASCII.txt
--color='auto'highlights the matched pattern.
-Pwill interpret the Perl-compatible expression.
-nis used to display lines with numbers containing non-ASCII characters.
“[\x80-\xFF]”is a defined range for characters that are non-ASCII.
Changing the range gave me better results compared to the previous query:
grep --color='auto' -P -n "[^\x00-\x7F]" Non-ASCII.txt
3. Using tr command
While the tr (or translate) command is mainly used to translate characters, it can also be used to delete the characters and that's what I'm going to do here.
To be clear, it is not going to delete the actual contents of the file and will modify the output that I'm going to get by using
tr -d '[:print:]' < Non-ASCII.txt
4. Using sed command
The sed utility is generally used when the sequence of executables is too complex and it is similar to what we are dealing with here.
But sed offers much more than that and if you are constantly dealing with complex workloads, you should check out the detailed guide on SED:
Now, let's find out the non-ASCII character by utilizing the given command:
LC_ALL=C sed -i 's/[^\x0-\xB1]//g' Non-ASCII.txt
It doesn't show any non-ASCII characters. Don't worry, the command has been executed successfully and I'll show you where to look for non-ASCII characters in a moment.
But first, let me break down the executed command:
LC_ALL=Cwill set the localization settings to the simplest
-iedits the file in place, meaning it will modify the original file.
's/[^\x0-\xB1]//g'is the expression to match non-ASCII characters.
As I mentioned earlier, the sed command has highlighted the non-ASCII characters that can be accessed through the cat command:
5. Using pcregrep
The pcregrep utility is nothing but grep pre-compatible with Perl regular expression. In simple terms, pcregrep behaves as grep with
But it requires manual installation and can be installed through the given command:
For Debian-based distros:
sudo apt install pcregrep
For RHEL-based distros:
yum install pcregrep
So let's use pcregrep to search for non-ASCII characters:
pcregrep --color='auto' -n "[\x80-\xFF]" Non-ASCII.txt
Looks similar right? To make it more clear, let me break it down for you:
--color='auto'highlights non-ASCII characters.
-ndisplays every line having non-ASCII characters with numbers.
"[\x80-\xFF]"matches characters outside the range of ASCII characters.
Similarly, you can also use
[:ascii:] character class with
^ to filter non-ASCII characters:
pcregrep --color='auto' -n "[^[:ascii:]]" Non-ASCII.txt
This was my take on how you can find non-ASCII characters in a text file in Linux.
And if you have any queries, make sure to leave ASCII characters in the comments section.