Chapter 5: String Manipulation in AWK

AWK isn't just a text processor - it's your personal calculator and string manipulation wizard rolled into one. You have already experienced AWK's mathematical capabilities. Now explore the string functions.

AWK offers a bunch of built-in functions to deal with strings. Here's a quick look at them:

Function	Purpose	Syntax	Example	Result
length(s)	String length	length(string)	length("hello")	5
substr(s,p,n)	Extract substring	substr(string, pos, len)	substr("hello", 2, 3)	"ell"
index(s,t)	Find substring position	index(string, target)	index("hello", "ll")	3
split(s,a,fs)	Split into array	split(string, array, sep)	split("a,b,c", arr, ",")	3 (returns count)
sub(r,s,t)	Replace first match	sub(regex, replacement, target)	sub(/o/, "0", "hello")	"hell0"
gsub(r,s,t)	Replace all matches	gsub(regex, replacement, target)	gsub(/l/, "1", "hello")	"he110"
match(s,r)	Find regex position	match(string, regex)	match("hello", /ll/)	3
sprintf(f,...)	Format string	sprintf(format, args...)	sprintf("%.2f", 3.14159)	"3.14"
tolower(s)	Convert to lowercase	tolower(string)	tolower("HELLO")	"hello"
toupper(s)	Convert to uppercase	toupper(string)	toupper("hello")	"HELLO"

AWK's string functions let you clean, transform, and analyze text data with precision.

Let's create a few sample test files.

user_data.txt:

john.smith@company.com John Smith Active
mary.johnson@company.com Mary Johnson Inactive
bob.wilson@COMPANY.COM Bob Wilson Active
alice.brown@company.com Alice Brown Pending
charlie.davis@external.org Charlie Davis Active
sara.miller@company.com Sara Miller Active

And log_entries.txt:

2024-06-29 ERROR Database connection failed for user admin
2024-06-29 WARNING High memory usage detected on web01
2024-06-29 INFO Backup completed successfully 
2024-06-29 ERROR Invalid login attempt from 192.168.1.100
2024-06-29 DEBUG Session cleanup started
2024-06-29 WARNING Disk space low on /var partition

Length and Substring Functions

Let's find out users who have long email addresses in the user_data.txt file:

awk '{
    email_length = length($1)
    printf "%-25s: %d characters", $1, email_length
    if (email_length > 23) printf " (long email)"
    printf "\n"
}' user_data.txt

The length() function gives the length of the string here and then a simple comparison does the job.

john.smith@company.com   : 22 characters
mary.johnson@company.com : 24 characters (long email)
bob.wilson@COMPANY.COM   : 22 characters
alice.brown@company.com  : 23 characters
charlie.davis@external.org: 26 characters (long email)
sara.miller@company.com  : 23 characters

The length() function

Notice how the email address with 26 characters doesn't fit in the specified field of 25?

Now let's take it to the next level by extracting the user and domain names from each email address. This will test the use of substr function.

awk '{
    at_position = index($1, "@")
    if (at_position > 0) {
        domain = substr($1, at_position + 1)
        username = substr($1, 1, at_position - 1)
        printf "User: %-15s Domain: %s\n", username, domain
    }
}' user_data.txt

So what I did here was to use index() to find the @ symbol and substr() to split email into username and domain parts.

User: john.smith      Domain: company.com
User: mary.johnson    Domain: company.com
User: bob.wilson      Domain: COMPANY.COM
User: alice.brown     Domain: company.com
User: charlie.davis   Domain: external.org
User: sara.miller     Domain: company.com

The substr() function

Case Conversion Functions

Let's standardize email domains to lowercase by using the tolower function:

awk '{ printf "%-25s -> %s\n", $1, tolower($1) }' user_data.txt

It will convert email address to lowercase for consistent email formatting.

john.smith@company.com   -> john.smith@company.com
mary.johnson@company.com -> mary.johnson@company.com
bob.wilson@COMPANY.COM   -> bob.wilson@company.com
alice.brown@company.com  -> alice.brown@company.com
charlie.davis@external.org -> charlie.davis@external.org
sara.miller@company.com  -> sara.miller@company.com

Convert email address to lowercase

Let's take it a bit further by creating display names from email addresses.

Extract the user name first and then separate them at the dot (.). Also make the first letters in uppercases.

awk '{
    at_position = index($1, "@")
    username = substr($1, 1, at_position - 1)
    
    # Convert to proper case
    display_name = toupper(substr(username, 1, 1)) substr(username, 2)
    gsub(/\./, " ", display_name)  # Replace dots with spaces
    
    printf "Email: %-25s Display: %s\n", $1, display_name
}' user_data.txt

Worth noticing is substr(username, 2) which doesn't have length specified and thus it takes everything from position 2 till the end. I also used the global gsub although there is no such need in the sample data bit if a username had multiple names separated by dot, it would be handled properly. More on it in the next section.

Email: john.smith@company.com    Display: John smith
Email: mary.johnson@company.com  Display: Mary johnson
Email: bob.wilson@COMPANY.COM    Display: Bob wilson
Email: alice.brown@company.com   Display: Alice brown
Email: charlie.davis@external.org Display: Charlie davis
Email: sara.miller@company.com   Display: Sara miller

Extract the name from the email address and convert to proper cases.

Pattern Replacement with sub

Clean up log entries by removing timestamps. Use the subtitue function sub() to remove the date pattern from the beginning of each log line.

awk '{
    # Remove date at beginning
    sub(/^[0-9-]+ /, "", $0)
    print "Clean log:", $0
}' log_entries.txt

The regex basically looks one more occurrence of a number or dash at the beginning followed by a space and replaces it with ... nothing. And thus remvoving that part completely.

Clean log: ERROR Database connection failed for user admin
Clean log: WARNING High memory usage detected on web01
Clean log: INFO Backup completed successfully
Clean log: ERROR Invalid login attempt from 192.168.1.100
Clean log: DEBUG Session cleanup started
Clean log: WARNING Disk space low on /var partition

The sub() function

Let's make our output more readable by adding color emojis.

awk '{
    sub(/ERROR/, "🔴 ERROR", $0)
    sub(/WARNING/, "🟡 WARNING", $0)
    sub(/INFO/, "🔵 INFO", $0)
    sub(/DEBUG/, "🟢 DEBUG", $0)
    print $0
}' log_entries.txt

It looks better, right?

2024-06-29 🔴 ERROR Database connection failed for user admin
2024-06-29 🟡 WARNING High memory usage detected on web01
2024-06-29 🔵 INFO Backup completed successfully
2024-06-29 🔴 ERROR Invalid login attempt from 192.168.1.100
2024-06-29 🟢 DEBUG Session cleanup started
2024-06-29 🟡 WARNING Disk space low on /var partition

Colored emojis

gsub is used to replace globally, for all the matches, not just the first one like sub.

sub vs gsub

Both sub() and gsub() work on replacing a pattern. The major difference being that gsub will replace all the matches while sub only replaces the first match.

Here's a simple expression that uses sub:

echo "2024-07-17" | awk '{ x = $0; sub("-", ":", x); print x }'

It outputs:

2024:07-17

Expression that uses sub() function

And if I use gsub:

echo "2024-07-17" | awk '{ x = $0; gsub("-", ":", x); print x }'

The date is properly formatted with all - replaced with :

2024:07:17

Expression using the gsub() function

💡 Did you notice I used the entire expression in one line?

It could also have been written like this:

echo "2024-07-17" | awk '{ x = $0
	gsub("-", ":", x)
	print x }'

But when you write an AWK expression in one line, you need to separate the commands with a semi-colon ;.

In multiline AWK expression, the commands are separated by newline character.

Multi-line awk commands

String Splitting

Let's use the server_metrics.txt file from the previous chapter:

web01 75.5 4096 85.2 45
web02 82.1 2048 78.9 62
db01 68.9 8192 92.3 38
db02 91.2 4096 88.7 71
cache01 45.3 1024 65.4 22
backup01 88.8 2048 91.1 55

Now, we parse server names into components. What kind of servers are they (web or db) and what numbers do they have:

awk '{
    # Split hostname into parts
    n = split($1, parts, /[0-9]+/)
    if (n >= 2) {
        server_type = parts[1]
        # Extract number
        gsub(/[^0-9]/, "", $1)
        server_num = $1
        printf "Type: %-8s Number: %s\n", server_type, server_num
    }
}' server_metrics.txt

The split() function separates server type from number in hostname patterns like "web01", "db02".

Type: web      Number: 01
Type: web      Number: 02
Type: db       Number: 01
Type: db       Number: 02
Type: cache    Number: 01
Type: backup   Number: 01

The split() function

String Formatting

Create a server status report:

awk '{
    status = "OK"
    if ($2 > 85) status = "HIGH CPU"
    if ($4 > 90) status = "HIGH I/O"
    if ($5 > 70) status = "HOT"
    
    printf "%-12s | CPU:%5.1f%% | RAM:%4dMB | I/O:%5.1f | %3d°C | %s\n", 
           $1, $2, $3, $4, $5, status
}' server_metrics.txt

It creates a formatted status dashboard with aligned columns and status indicators. I let you figure out the code on your own as a practice exercise.

🪧 Time to recall

In this chapter, you learned the following:

String functions: Manipulate text with length, substr, index, split
Case conversion: Standardize data with toupper and tolower
Pattern replacement: Clean and transform text with sub and gsub

These functions transform AWK from a simple text processor into a complete data analysis and reporting tool.

Practice Exercises

Try these exercises with the sample files I've provided:

1. Extract just the domain names from all email addresses

2. Clean email addresses by converting to lowercase and removing extra spaces

3. Calculate the percentage of disk space used across all partitions with the following data:

/dev/sda1 50G 35G 12G 75% /
/dev/sda2 100G 80G 15G 85% /home
/dev/sda3 20G 8G 11G 45% /var
/dev/sdb1 500G 300G 175G 65% /data

In the next chapter, you'll learn about arrays - AWK's most powerful feature for advanced data analysis and grouping!