Chapter 5: Manipulating Strings in AWK
AWK isn't just a text processor - it's your personal calculator and string manipulation wizard rolled into one. You have already experienced AWK's mathematical capabilities. Now explore the string functions.
AWK offers a bunch of built-in functions to deal with strings. Here's a quick look at them:
Function | Purpose | Syntax | Example | Result |
---|---|---|---|---|
length(s) | String length | length(string) | length("hello") | 5 |
substr(s,p,n) | Extract substring | substr(string, pos, len) | substr("hello", 2, 3) | "ell" |
index(s,t) | Find substring position | index(string, target) | index("hello", "ll") | 3 |
split(s,a,fs) | Split into array | split(string, array, sep) | split("a,b,c", arr, ",") | 3 (returns count) |
sub(r,s,t) | Replace first match | sub(regex, replacement, target) | sub(/o/, "0", "hello") | "hell0" |
gsub(r,s,t) | Replace all matches | gsub(regex, replacement, target) | gsub(/l/, "1", "hello") | "he110" |
match(s,r) | Find regex position | match(string, regex) | match("hello", /ll/) | 3 |
sprintf(f,...) | Format string | sprintf(format, args...) | sprintf("%.2f", 3.14159) | "3.14" |
tolower(s) | Convert to lowercase | tolower(string) | tolower("HELLO") | "hello" |
toupper(s) | Convert to uppercase | toupper(string) | toupper("hello") | "HELLO" |
AWK's string functions let you clean, transform, and analyze text data with precision.
Let's create a few sample test files.
user_data.txt
:
john.smith@company.com John Smith Active
mary.johnson@company.com Mary Johnson Inactive
bob.wilson@COMPANY.COM Bob Wilson Active
alice.brown@company.com Alice Brown Pending
charlie.davis@external.org Charlie Davis Active
sara.miller@company.com Sara Miller Active
And log_entries.txt
:
2024-06-29 ERROR Database connection failed for user admin
2024-06-29 WARNING High memory usage detected on web01
2024-06-29 INFO Backup completed successfully
2024-06-29 ERROR Invalid login attempt from 192.168.1.100
2024-06-29 DEBUG Session cleanup started
2024-06-29 WARNING Disk space low on /var partition
Length and Substring Functions
Let's find out users who have long email addresses in the user_data.txt
file:
awk '{
email_length = length($1)
printf "%-25s: %d characters", $1, email_length
if (email_length > 23) printf " (long email)"
printf "\n"
}' user_data.txt
The length()
function gives the length of the string here and then a simple comparison does the job.
john.smith@company.com : 22 characters
mary.johnson@company.com : 24 characters (long email)
bob.wilson@COMPANY.COM : 22 characters
alice.brown@company.com : 23 characters
charlie.davis@external.org: 26 characters (long email)
sara.miller@company.com : 23 characters
Notice how the email address with 26 characters doesn't fit in the specified field of 25?
Now let's take it to the next level by extracting the user and domain names from each email address. This will test the use of substr
function.
awk '{
at_position = index($1, "@")
if (at_position > 0) {
domain = substr($1, at_position + 1)
username = substr($1, 1, at_position - 1)
printf "User: %-15s Domain: %s\n", username, domain
}
}' user_data.txt
So what I did here was to use index() to find the @ symbol and substr() to split email into username and domain parts.
User: john.smith Domain: company.com
User: mary.johnson Domain: company.com
User: bob.wilson Domain: COMPANY.COM
User: alice.brown Domain: company.com
User: charlie.davis Domain: external.org
User: sara.miller Domain: company.com
Case Conversion Functions
Let's standardize email domains to lowercase by using the tolower
function:
awk '{ printf "%-25s -> %s\n", $1, tolower($1) }' user_data.txt
It will convert email address to lowercase for consistent email formatting.
john.smith@company.com -> john.smith@company.com
mary.johnson@company.com -> mary.johnson@company.com
bob.wilson@COMPANY.COM -> bob.wilson@company.com
alice.brown@company.com -> alice.brown@company.com
charlie.davis@external.org -> charlie.davis@external.org
sara.miller@company.com -> sara.miller@company.com
Let's take it a bit further by creating display names from email addresses.
Extract the user name first and then separate them at the dot (.). Also make the first letters in uppercases.
awk '{
at_position = index($1, "@")
username = substr($1, 1, at_position - 1)
# Convert to proper case
display_name = toupper(substr(username, 1, 1)) substr(username, 2)
gsub(/\./, " ", display_name) # Replace dots with spaces
printf "Email: %-25s Display: %s\n", $1, display_name
}' user_data.txt
Worth noticing is substr(username, 2)
which doesn't have length specified and thus it takes everything from position 2 till the end. I also used the global gsub
although there is no such need in the sample data bit if a username had multiple names separated by dot, it would be handled properly. More on it in the next section.
Email: john.smith@company.com Display: John smith
Email: mary.johnson@company.com Display: Mary johnson
Email: bob.wilson@COMPANY.COM Display: Bob wilson
Email: alice.brown@company.com Display: Alice brown
Email: charlie.davis@external.org Display: Charlie davis
Email: sara.miller@company.com Display: Sara miller
Pattern Replacement with sub
Clean up log entries by removing timestamps. Use the subtitue function sub()
to remove the date pattern from the beginning of each log line.
awk '{
# Remove date at beginning
sub(/^[0-9-]+ /, "", $0)
print "Clean log:", $0
}' log_entries.txt
The regex basically looks one more occurrence of a number or dash at the beginning followed by a space and replaces it with ... nothing. And thus remvoving that part completely.
Clean log: ERROR Database connection failed for user admin
Clean log: WARNING High memory usage detected on web01
Clean log: INFO Backup completed successfully
Clean log: ERROR Invalid login attempt from 192.168.1.100
Clean log: DEBUG Session cleanup started
Clean log: WARNING Disk space low on /var partition
Let's make our output more readable by adding color emojis.
awk '{
sub(/ERROR/, "š“ ERROR", $0)
sub(/WARNING/, "š” WARNING", $0)
sub(/INFO/, "šµ INFO", $0)
sub(/DEBUG/, "š¢ DEBUG", $0)
print $0
}' log_entries.txt
It looks better, right?
2024-06-29 š“ ERROR Database connection failed for user admin
2024-06-29 š” WARNING High memory usage detected on web01
2024-06-29 šµ INFO Backup completed successfully
2024-06-29 š“ ERROR Invalid login attempt from 192.168.1.100
2024-06-29 š¢ DEBUG Session cleanup started
2024-06-29 š” WARNING Disk space low on /var partition
gsub is used to replace globally, for all the matches, not just the first one like sub.
sub vs gsub
Both sub()
and gsub()
work on replacing a pattern. The major difference being that gsub
will replace all the matches while sub
only replaces the first match.
Here's a simple expression that uses sub
:
echo "2024-07-17" | awk '{ x = $0; sub("-", ":", x); print x }'
It outputs:
2024:07-17
And if I use gsub
:
echo "2024-07-17" | awk '{ x = $0; gsub("-", ":", x); print x }'
The date is properly formatted with all -
replaced with :
2024:07:17
š” Did you notice I used the entire expression in one line?
It could also have been written like this:
echo "2024-07-17" | awk '{ x = $0
gsub("-", ":", x)
print x }'
But when you write an AWK expression in one line, you need to separate the commands with a semi-colon ;
.
In multiline AWK expression, the commands are separated by newline character.
String Splitting
Let's use the server_metrics.txt
file from the previous chapter:
web01 75.5 4096 85.2 45
web02 82.1 2048 78.9 62
db01 68.9 8192 92.3 38
db02 91.2 4096 88.7 71
cache01 45.3 1024 65.4 22
backup01 88.8 2048 91.1 55
Now, we parse server names into components. What kind of servers are they (web or db) and what numbers do they have:
awk '{
# Split hostname into parts
n = split($1, parts, /[0-9]+/)
if (n >= 2) {
server_type = parts[1]
# Extract number
gsub(/[^0-9]/, "", $1)
server_num = $1
printf "Type: %-8s Number: %s\n", server_type, server_num
}
}' server_metrics.txt
The split() function separates server type from number in hostname patterns like "web01", "db02".
Type: web Number: 01
Type: web Number: 02
Type: db Number: 01
Type: db Number: 02
Type: cache Number: 01
Type: backup Number: 01
String Formatting
Create a server status report:
awk '{
status = "OK"
if ($2 > 85) status = "HIGH CPU"
if ($4 > 90) status = "HIGH I/O"
if ($5 > 70) status = "HOT"
printf "%-12s | CPU:%5.1f%% | RAM:%4dMB | I/O:%5.1f | %3d°C | %s\n",
$1, $2, $3, $4, $5, status
}' server_metrics.txt
It creates a formatted status dashboard with aligned columns and status indicators. I let you figure out the code on your own as a practice exercise.
šŖ§ Time to recall
In this chapter, you learned the following:
- String functions: Manipulate text with length, substr, index, split
- Case conversion: Standardize data with toupper and tolower
- Pattern replacement: Clean and transform text with sub and gsub
These functions transform AWK from a simple text processor into a complete data analysis and reporting tool.
Practice Exercises
Try these exercises with the sample files I've provided:
1. Extract just the domain names from all email addresses
2. Clean email addresses by converting to lowercase and removing extra spaces
3. Calculate the percentage of disk space used across all partitions with the following data:
/dev/sda1 50G 35G 12G 75% /
/dev/sda2 100G 80G 15G 85% /home
/dev/sda3 20G 8G 11G 45% /var
/dev/sdb1 500G 300G 175G 65% /data
In the next chapter, you'll learn about arrays - AWK's most powerful feature for advanced data analysis and grouping!