理解AWK【机翻】

Jaxon · 2021 年11 月 17 日 13:10

背景

我有一个坦白：我不知道如何使用 Awk。或者至少在我开始写这篇文章之前我不知道如何使用它。我会听到人们提到 Awk 以及他们使用它的频率，我很确定我错过了一些小超能力。

就像Bryan Cantrill 的这个简短评论：

我每天写三四个 awk 程序。这些是单线。这些超级快速的程序。

事实证明，Awk 非常简单。它只有几个约定和少量语法。因此，学习起来很简单，一旦你理解了它，它就会比你想象的更频繁地派上用场。

所以在这篇文章中，我将教我自己和你，Awk 的基础知识。如果您通读了这篇文章，甚至可以尝试一两个示例，那么在文章末尾编写 awk 脚本应该没有问题。 而且您可能甚至不需要安装任何东西，因为 awk 无处不在。

计划

我将使用 Awk 查看书评并选择我的下一本书阅读。我将从简短的 Awk one-liners 开始，然后构建一个简单的 33 行程序，该程序根据 Amazon.com 上的 1900 万条评论对书籍进行排名。

什么是 awk

awk 是 Aho、Kernighan 和 Weinberger 于 1977 年编写的记录处理工具。它的名字是他们名字的首字母缩写。

他们在线路处理工具sed 和grep . awk 最初是作者的一个实验，目的是研究是否可以扩展文本处理工具来处理数字。如果 grep 允许您搜索行，而 sed 允许您在行中进行替换，那么 awk 旨在让您在行上进行计算。一旦我带我们看一些例子，这意味着什么就很清楚了。

如何安装 `gawk`

学习 Awk 的最大原因是它几乎在每个 Linux 发行版上都有。您可能没有 perl 或 Python。你将拥有 awk。只有最小的最小 linux 系统才会排除它。甚至busybox 也包含awk。这就是它被视为多么重要。

cogman10

Awk 是便携式操作系统接口 (POSIX) 的一部分。这意味着它已经在你的 MacBook 和你的 Linux 服务器上。Awk 有多个版本，但对于基础知识，无论您拥有什么 Awk 都可以。

如果您可以运行它，您就可以完成本教程的其余部分：

$ awk --version

  GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.1)
  Copyright (C) 1989, 1991-2020 Free Software Foundation.

如果您正在做一些与 Awk 相关的事情，请花时间安装 GNU Awk ( gawk )。我使用 Homebrew ( brew install gawk )做到了这一点。Windows 用户可以使用chocolatey ( choco install gawk )来呆住。如果您使用的是 Linux，那么您已经拥有了。

awk 打印

默认情况下，Awk 期望在标准输入上接收其输入并将其结果输出到标准输出。因此，您可以在 awk 中做的最简单的事情就是打印一行输入。

$ echo "one two three" | awk '{ print }'
one two three

注意大括号。在您看到几个示例后，此语法将开始变得有意义。

您可以有选择地选择列（Awk 称之为字段）：

$ echo "one two three" | awk '{ print $1 }'
one
$ echo "one two three" | awk '{ print $2 }'
two
$ echo "one two three" | awk '{ print $3 }'
three

您可能一直期望第一列是$0 而不是$1 ，但$0 情况有所不同：

$ echo "one two three" | awk '{ print $0 }'

one two three

是整条线！顺便提一下，Awk 将每一行称为一条记录，将每一列称为一个字段。

我可以跨多行执行此操作：

$ echo "
 one two three
 four five six" \
 | awk '{ print $1 }'

one
four

我可以打印不止一列：

$ echo "
 one two three
 four five six" \
| awk '{ print $1, $2 }'

one two
four five

awk 还包括$NF 用于访问最后一列：

$ echo "
 one two three
 four five six" \
| awk '{ print $NF }'

three
six

我学到了什么：Awk 字段变量

awk 为记录（行）（$1 , $2 … $NF ）中的每个字段（列）创建一个变量。$0 指整个记录。

您可以打印出这样的字段：

$ awk '{ print $1, $2, $7 }'

awk 样本数据

为了超越简单的打印，我需要一些示例数据。在本教程的其余部分，我将使用亚马逊产品评论数据集的书籍部分。

该数据集包含来自亚马逊的产品评论和元数据，包括 1996 年 5 月至 2014 年 7 月的 1.428 亿条评论。

亚马逊评论数据集

你可以像这样抓住它的书部分：

$ curl https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz | /
  gunzip -c >> / 
  bookreviews.tsv

如果您想跟踪整个数据集，请对三个书籍文件 ( v1_00 、v1_01 、v1_02 ) 中的每一个重复此操作。

磁盘空间警告

上面的文件已解压缩超过 6 个演出。如果您同时拥有这三个，您将获得多达 15 场演出的磁盘空间。如果你没有太多空间，你可以通过抓取第一个文件的前一万行来玩：

$ curl https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz \
  | gunzip -c \
  | head -n 10000 \
  > bookreviews.tsv

图书资料

获取该数据后，您应该拥有如下所示的亚马逊书评数据：

marketplace customer_id review_id product_id product_parent product_title product_category star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date
US 22480053 R28HBXXO1UEVJT 0843952016 34858117 The Rising Books 5 0 0 N N Great Twist on Zombie Mythos I've known about this one for a long time, but just finally got around to reading it for the first time. I enjoyed it a lot!  What I liked the most was how it took a tired premise and breathed new life into it by creating an entirely new twist on the zombie mythos. A definite must read! 2012-05-03

该文件中的每一行代表一个书评记录。亚马逊对字段的布局如下：

DATA COLUMNS:
01  marketplace       - 2 letter country code of the marketplace where the review was written.
02  customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
03  review_id         - The unique ID of the review.
04  product_id        - The unique Product ID the review pertains to. 
05  product_parent    - Random identifier that can be used to aggregate reviews for the same product.
06  product_title     - Title of the product.
07  product_category  - Broad product category that can be used to group reviews 
08  star_rating       - The 1-5 star rating of the review.
09  helpful_votes     - Number of helpful votes.
10  total_votes       - Number of total votes the review received.
11  vine              - Review was written as part of the Vine program.
12  verified_purchase - The review is on a verified purchase.
13  review_headline   - The title of the review.
14  review_body       - The review text.
15  review_date       - The date the review was written.

打印书籍数据

我现在可以在更大的文件上测试我的现场打印技能。我可以从打印我关心的领域开始，比如市场：

$ awk '{ print $1 }' bookreviews.tsv | head

marketplace
US
US
US
US
US
US
US
US
US

或 customer_id：

$ awk '{ print $2 }' bookreviews.tsv | head

customer_id
22480053
44244451
20357422
13235208
26301786
27780192
13041546
51692331
23108524

但是，当我尝试打印标题时，事情并不顺利：

$ awk '{ print $6 }' bookreviews.tsv | head

product_title
The
Sticky
Black
Direction
Until
Unfinished
The
Good
Patterns

为了解决这个问题，我需要配置我的字段分隔符。

字段分隔符

默认情况下，Awk 假定记录中的字段是用空格分隔的1。我可以使用以下awk -F 选项更改字段分隔符以使用制表符：

$ awk -F '\t' '{ print $6 }' bookreviews.tsv | head

product_title
The Rising
Sticky Faith Teen Curriculum with DVD: 10 Lessons to Nurture Faith Beyond High 
Black Passenger Yellow Cabs: Of Exile And Excess In Japan
Direction and Destiny in the Birth Chart
Until the Next Time
Unfinished Business
The Republican Brain: The Science of Why They Deny Science- and Reality
Good Food: 101 Cakes & Bakes
Patterns and Quilts (Mathzones)

我学到了什么：Awk 字段分隔符

awk 假设记录中的字段是用空格分隔的。

您可以使用这样的-F 选项更改此设置

$ awk -F '\t' '{ print $6 }'

我也可以通过从NF .

$ awk -F '\t' '{ print $NF "\t" $(NF-2)}' bookreviews.tsv | head

review_date     review_headline
2012-05-03      Great Twist on Zombie Mythos
2012-05-03      Helpful and Practical
2012-05-03      Paul
2012-05-03      Direction and Destiny in the Birth Chart
2012-05-03      This was Okay
2012-05-03      Excellent read!!!
2012-05-03      A must read for science thinkers
2012-05-03      Chocoholic heaven
2012-05-03      Quilt Art Projects for Children

旁注：NF 和 NR

$NF 打印最后一列的名称似乎不寻常，对吗？但实际上，NF 是一个保存记录中字段数的 变量。所以我只是使用它的值作为索引来引用最后一个字段。

我可以像这样打印实际值：

$ awk -F '\t' '{ print NF }' bookreviews.tsv | head

Awk 提供的另一个变量是NR ，到目前为止的记录数 。NR 当我需要打印行号时很方便：

$ awk -F '\t' '{ print NR " " $(NF-2) }' bookreviews.tsv | head

1 review_headline
2 Great Twist on Zombie Mythos
3 Helpful and Practical
4 Paul
5 Direction and Destiny in the Birth Chart
6 This was Okay
7 Excellent read!!!
8 A must read for science thinkers
9 Chocoholic heaven
10 Quilt Art Projects for Children

正则表达式的 awk 模式匹配

到目前为止，我所做的一切都适用于我们文件中的每一行，但 Awk 的真正威力来自于模式匹配。你可以给 Awk 一个模式来匹配每一行，如下所示：

$ echo "aa
        bb
        cc" | awk '/bb/'
bb

你可以这样换grep 。您还可以将其与我们迄今为止所做的现场访问和打印结合起来：

$ echo "aa 10
        bb 20
        cc 30" | awk '/bb/ { print $2 }'
20

使用这些知识，我可以轻松地按书名抓取评论并打印书名（$6 ）和评论分数（$8 ）。

我今天要关注的评论是关于《饥饿游戏》这本书的。我选择它是因为它是一个有很多评论的系列的一部分，我记得我喜欢这部电影。所以我想知道我是否应该阅读它。

$ awk -F '\t' '/Hunger Games/ { print $6, $8  }' bookreviews.tsv | head

The Hunger Games (Book 1) 5
The Hunger Games Trilogy Boxed Set 5
The Hunger Games Trilogy: The Hunger Games / Catching Fire / Mockingjay 5
Catching Fire |Hunger Games|2 4
The Hunger Games (Book 1) 5
Catching Fire |Hunger Games|2 5
The Hunger Games Trilogy: The Hunger Games / Catching Fire / Mockingjay 5
Blackout 3
The Hunger Games Trilogy: The Hunger Games / Catching Fire / Mockingjay 4
Tabula Rasa 3

我应该能够从这些评论中提取有价值的数据，但首先存在一个问题。我在这里收到不止一本书的评论。/Hunger Games/ 匹配行中的任何地方，我得到了各种“饥饿游戏”书籍的归还。我什至在评论文本中看到提到“饥饿游戏”的书：

$ awk -F '\t' '/Hunger Games/ { print $6 }' bookreviews.tsv | sort | uniq

Birthmarked
Blood Red Road
Catching Fire (The Hunger Games)
Divergent
Enclave
Fire (Graceling Realm Books)
Futuretrack 5
Girl in the Arena
...

我可以通过使用product_id 字段模式匹配来解决这个问题：

$ awk -F '\t' '$4 == "0439023483" { print $6 }' bookreviews.tsv | sort |  uniq 
The Hunger Games (The Hunger Games, Book 1)

我想计算“饥饿游戏”的平均评论分数，但首先，让我们看一下饥饿游戏评论的 review_date ( $15 )、review_headline ( $13 ) 和star_rating ( )，$8 以感受一下数据：

$ awk -F '\t' '$4 == "0439023483" { print $15 "\t" $13 "\t" $8}' bookreviews.tsv | head

2015-08-19      Five Stars      5
2015-08-17      Five Stars      5
2015-07-23      Great read      5
2015-07-05      Awesome 5
2015-06-28      Epic start to an epic series    5
2015-06-21      Five Stars      5
2015-04-19      Five Stars      5
2015-04-12      i lile the book 3
2015-03-28      What a Great Read, i could not out it down   5
2015-03-28      Five Stars      5

看看那些星级。是的，这本书获得了许多 5 星评论，但更重要的是，我的文本表格的布局看起来很糟糕：评论标题的宽度打破了布局。

为了解决这个问题，我需要从 using 切换print 到 using printf 。

我学到了什么：Awk 模式匹配

我了解到 awk 操作（例如{ print $4} ）可以跟在模式之前，例如/regex/ 。如果没有模式，操作将在所有行上运行。

您可以对模式使用简单的正则表达式。在这种情况下，它匹配行中的任何位置，例如grep ：

$ awk '/hello/ { print "This line contains hello", $0}'

或者您可以在特定字段内进行匹配：

$ awk '$4~/hello/ { print "This field contains hello", $4}'

或者你可以精确匹配一个字段：

$ awk '$4 == "hello" { print "This field is hello:", $4}'

错误 `printf`

printf 像在 C 中一样工作，并使用格式字符串和值列表。您可以使用%s 打印下一个字符串值。

所以我的print $15 "\t" $13 "\t" $8 变成了printf "%s \t %s \t %s", $15, $13, $8 .

从那里，我可以添加右填充，并通过改变解决我的布局%s 到%-Ns 哪里N 是我想要的列宽：

$ awk -F '\t' '$4 == "0439023483" { printf "%s \t %-20s \t %s \n", $15, $13, $8}' bookreviews.tsv | head

2015-08-19       Five Stars              5 
2015-08-17       Five Stars              5 
2015-07-23       Great read              5 
2015-07-05       Awesome                 5 
2015-06-28       Epic start to an epic series    5 
2015-06-21       Five Stars              5 
2015-04-19       Five Stars              5 
2015-04-12       i lile the book         3 
2015-03-28       What a Great Read, i could not out it down   5 
2015-03-28       Five Stars              5

这张桌子非常接近我想要的。但是，有些标题太长了。我可以将它们缩短为 20 个字符substr($13,1,20) 。

把它们放在一起，我得到：

$ awk -F '\t' '$4 == "0439023483" { printf "%s \t %-20s \t %s \n", $15, substr($13,1,20), $8}' bookreviews.tsv | head

2015-08-19       Five Stars              5 
2015-08-17       Five Stars              5 
2015-07-23       Great read              5 
2015-07-05       Awesome                 5 
2015-06-28       Epic start to an epi    5 
2015-06-21       Five Stars              5 
2015-04-19       Five Stars              5 
2015-04-12       i lile the book         3 
2015-03-28       What a Great Read, i    5 
2015-03-28       Five Stars              5

好的，我想在这一点上，我已经准备好继续进行星级计算了。

我学到了什么：printf 和内置插件

如果您需要打印表格，Awk 可以让您使用printf 内置程序substr 来格式化您的输出。

它最终看起来像这样：

$ awk '{ printf "%s \t %-5s", $1, substr($2,1,5) }'

printf 很像 C 的printf . 您可以使用%s 将字符串插入到格式字符串中，其他标志让您设置宽度或精度。有关printf 或其他内置函数的更多信息，您可以查阅 Awk 参考文档。

awk`BEGIN` 和`END` 动作

我想计算这个数据集中书评的平均评分。为此，我需要使用一个变量。但是，我不需要声明变量或其类型。我可以使用它：

我可以$8 像这样累加并打印出 review_stars ( )的运行总数：

$ awk -F '\t' '{ total = total + $8; print total }' bookreviews.tsv | head

0
5
10
...

并将其转化为平均值，我可以使用NR 来获取记录总数并END 在处理结束时运行一个操作。

$ awk -F '\t' '
    { total = total + $8 }
END { print "Average book review:", total/NR, "stars" }
' bookreviews.tsv | head

Average book review is 4.24361 stars

我还可以使用BEGIN 在 awk 开始处理记录之前运行一个操作。

 $ awk -F '\t' '
BEGIN { print "Calculating Average ..." } 
      { total = total + $8 }
END   { print "Average book review:", total/NR, "stars" }
' bookreviews.tsv

Calculating Average ...
Average book review is 4.24361 stars

我学到的东西：awk的BEGIN ，END 和变量

awk 提供了两种特殊的模式，BEGIN 和END . 您可以使用它们在处理记录之前和之后运行操作。例如，这就是您在 Awk 中初始化数据、打印页眉和页脚或执行任何启动或拆卸操作的方式。

它最终看起来像这样：

$ awk '
BEGIN { print "start up" }
      { print "line match" }
END   { print "tear down" }'

您还可以轻松地在 awk 中使用变量。不需要声明。

$ awk -F '{ total = total + $8 }'

Fun Awk One-Liners

在我们离开单行程序的世界之前，我联系了我的朋友，询问他们每天何时使用 Awk。以下是我拿回来的一些例子。

打印具有人类可读大小的文件：

$ ls -lh | awk '{ print $5,"\t", $9 }'

7.8M     The_AWK_Programming_Language.pdf
6.2G     bookreviews.tsv

获取正在运行的 docker 容器的 containerID：

$ docker ps -a |  awk '{ print $1 }'

CONTAINER
08488f220f76
3b7fc673649f

您可以将最后一个与正则表达式结合起来，专注于您关心的一行。在这里我停止postgres ，不管它的标签名称是什么：

$ docker stop "$(docker ps -a |  awk '/postgres/{ print $1 }')"

你明白了。如果您有一个由某些工具返回的以空格分隔的文本表，那么 Awk 可以轻松地将其切片和切块。

awk 脚本示例

如果你选择你的约束，你可以使一个特定的使用范围变得容易，而那些你不关心的则很难。Awk 选择成为每行处理器，在所有行之前和所有行之后都有可选的处理部分是自限性的，但它定义了一个有用的使用范围。

迈克尔羽毛

在我看来，一旦 awk 程序跨越多行，就该考虑将其放入文件中了。

旁注：为什么使用 awk 脚本

一旦我们超越了 one-liners，一个自然的问题是为什么 . 就像在“为什么不使用 Python？难道它不擅长这种事情吗？

我有几个答案。

首先，Awk 非常适合编写程序，这些程序的核心是对某些输入进行美化的 for 循环。如果这就是你正在做的，并且控制流有限，那么使用 awk 会比 Python 更简洁。

其次，如果你需要在某个时候将你的 Awk 程序改写成 Python，那就随它去吧。它不会超过 100 行代码，而且翻译过程会很简单。

第三，为什么不呢？学习新工具会很有趣。

我们现在已经从单行代码过渡到了 awk 脚本。使用 Awk，过渡是平滑的。我现在可以将 awk 嵌入到 bash 脚本中：

$ cat average

exec awk -F '\t' '
    { total = total + $8 }
END { print "Average book review is", total/NR, "stars" } 
' $1

$ average bookreviews.tsv

Average book review is 4.2862 stars

或者我可以使用shebang ( #! )：

$ cat average.awk

#!/usr/bin/env -S gawk -f

BEGIN { FS = "\t" }
{ total = total + $8 }
END { print "Average book $6 review is", total/NR, "stars" }

并像这样运行它

$ ./average.awk bookreviews.tsv

或者您也可以使用-f 以下命令直接将其传递给 awk ：

$ awk -f average.awk bookreviews.tsv

旁注：BEGIN FS

如果您使用 shebang 或直接传递给 Awk，则最容易FS = "\t" 在BEGIN 操作中使用设置文件分隔符。

BEGIN { FS = "\t" }

awk 平均示例

此时，我应该可以开始计算饥饿游戏的评论分数了：

exec awk -F '\t' '
$4 == "0439023483" { title=$6; count = count + 1; total = total + $8 }
END                { print "The Average book review for", title, "is", total/count, "stars" }  
' $1

现在我在一个文件中，我可以更好地格式化它，以便更容易阅读：

$4 == "0439023483" { 
  title=$6
  count = count + 1; 
  total = total + $8 
}
END { 
  printf "Book: %-5s\n", title
  printf "Average Rating: %.2f\n", total/count 
}

无论哪种方式，我都会得到这个输出：

Book: The Hunger Games (The Hunger Games, Book 1)
Average Rating: 4.67%

我学到了什么：从文件中调用 Awk

一旦超出一行，就可以将 awk 脚本放入文件中。

然后您可以使用该-f 选项调用您的程序

$ awk -f file.awk input

使用shebang：

#!/usr/bin/env -S gawk -f

或使用 bashexec 命令：

exec awk -F '\t' 'print $0' $1

awk 数组

我想知道该系列是否保持强劲，或者它是否是作者将其扩展为三部曲的一本好书。如果评论迅速下降，那么这不是一个好兆头。我应该能够看到哪本书被评为最好，哪本书最差。让我们来了解一下。

如果我要在 Python 中计算平均值，我会遍历评论列表并使用字典来跟踪每个评论的总星级和总评论。

在 awk 中，我也可以这样做：

BEGIN { FS = "\t" }
$6~/\(The Hunger Games(, Book 1)?\)$/ { 
  title[$6]=$6
  count[$6]= count[$6] + 1
  total[$6]= total[$6] + $8
}
END { 
    for (i in count) {
        printf "---------------------------------------\n"
        printf "%s\n", title[i]
        printf "---------------------------------------\n"
        printf "Ave: %.2f\t Count: %s \n\n", total[i]/count[i], count[i]  
    }
}

$ awk -f hungergames.awk bookreviews.tsv

---------------------------------------
The Hunger Games (The Hunger Games, Book 1)
---------------------------------------
Ave: 4.55        Count: 1497 

---------------------------------------
Mockingjay (The Hunger Games)
---------------------------------------
Ave: 3.77        Count: 3801 

---------------------------------------
Catching Fire (The Hunger Games)
---------------------------------------
Ave: 4.52        Count: 2205 

---------------------------------------

看看那个，该系列的第一本书是最受欢迎的。最后一本书，Mockingjay 不那么受欢迎。所以这不是一个好兆头。

让我看看另一个三部曲，看看排名的逐渐下降是常见的还是饥饿游戏特有的：

BEGIN { FS = "\t" }
$6~/\(The Lord of the Rings, Book .\)$/ {  # <-- changed this line
  title[$6]=$6
  count[$6]= count[$6] + 1
  total[$6]= total[$6] + $8
}
END { 
    for (i in title) {
        printf "---------------------------------------\n"
        printf "%s\n", title[i]
        printf "---------------------------------------\n"
        printf "Ave: %.2f\t Count: %s \n\n", total[i]/count[i], count[i]  
    }
}

---------------------------------------
The Return of the King (The Lord of the Rings, Book 3)
---------------------------------------
Ave: 4.79        Count: 38 

---------------------------------------
The Two Towers (The Lord of the Rings, Book 2)
---------------------------------------
Ave: 4.64        Count: 56 

---------------------------------------
The Fellowship of the Ring (The Lord of the Rings, Book 1)
---------------------------------------
Ave: 4.60        Count: 93

指环王有不同的模式。这些书都在一个非常狭窄的范围内。评论数量也少得多，所以很难肯定《王者归来》是最好的书，但它看起来确实如此。

我学到了什么：Awk 关联数组

Awk 构建了关联数组，您可以像使用 Python 字典一样使用它们。

arr["key1"] = "one"
arr["key2"] = "two"
arr["key3"] = "three"

然后，您可以使用 for 循环来迭代它们：

for (i in arr){
    print i, arr[i]
}

key1 one
key2 two
key3 three

对于 1977 年编写的语言来说还不错！

错误 `If` `Else`

我讨厌亚马逊上的每本书的星级评分都在 3.0 到 4.5 星之间。仅仅根据数字很难判断。所以让我们根据平均值重新调整。也许如果我将评论标准化，就更容易确定 Mockingjay 的 3.77 平均分的好坏。

首先，我需要计算全局平均值，但将每一行的总数和平均值相加：

{
    # Global Average
    g_count = g_count + 1
    g_total = g_total + $8 
}

然后我计算全球平均值：

END { 
    g_score = g_total/g_count 
    ...
}

一旦我有了这个，我就可以根据书籍比平均水平高或低的程度来评分。我需要做的就是在我的END 模式中添加一些 if 语句来完成这个：

END { 
    g_score = g_total/g_count 
    for (i in count) {
        score = total[i]/count[i]
        printf "%-30s\t", substr(title[i],1,30)
        if (score - g_score > .5)
            printf "👍👍👍" 
        else if (score - g_score > .25)
            printf "👍👍" 
        else if (score - g_score > 0)
            printf "👍" 
        else if (g_score - score  > 1)
            printf "👎👎👎" 
        else if (g_score - score  > .5)
            printf "👎👎" 
        else if (g_score - score  > 0)
            printf "👎"
        printf "\n"
    }
}

分区的值只是一个猜测，但它使我更容易理解排名：

The Hunger Games (The Hunger G  👍
Catching Fire: The Official Il  👍👍👍
Mockingjay (The Hunger Games)   👎👎

The Two Towers (The Lord of th  👍👍
The Fellowship of the Ring (Th  👍👍
The Return of the King (The Lo  👍👍👍

看起来 Mockingjay，至少在亚马逊和这个数据集中，并没有受到好评。

您可以轻松修改此脚本以查询特别的书籍：

exec gawk -F '\t' '
{
    # Global Average
    g_count = g_count + 1
    g_total = g_total + $8 
    PROCINFO["sorted_in"] = "@val_num_asc"
}
$6~/^.*'+"$1"+'.*$/ { # <-- Take match as input
  title[$6]=$6
  count[$6]= count[$6] + 1
  total[$6]= total[$6] + $8
}
END { 
    PROCINFO["sorted_in"] = "@val_num_desc"
    g_score = g_total/g_count 
    for (i in count) {
        score = total[i]/count[i]
        printf "%-50s\t", substr(title[i],1,50)
        if (score - g_score > .4)
            printf "👍👍👍" 
        else if (score - g_score > .25)
            printf "👍👍" 
        else if (score - g_score > 0)
            printf "👍" 
        else if (g_score - score  > 1)
            printf "👎👎👎" 
        else if (g_score - score  > .5)
            printf "👎👎" 
        else if (g_score - score  > 0)
            printf "👎"
        printf "\n"
    }
}
' bookreviews.tsv | head -n 1

然后像这样运行它：

$ ./average "Left Hand of Darkness"
The Left Hand of Darkness (Ace Science Fiction)         👎
$ ./average "Neuromancer"          
Neuromancer                                             👎👎
$ ./average "The Lifecycle of Software Objects"
The Lifecycle of Software Objects                       👎

这些都是好书，所以我开始质疑亚马逊评论家的品味。

不过，我想再测试一件事：最受欢迎的书籍排名如何？也许流行书籍获得了很多评论，这使它们低于整体平均水平？

我学到了什么

Awk 有分支 usingif 和else 语句。它的工作原理与您可能期望的完全一样：

$ echo "1\n 2\n 3\n 4\n 5\n 6" | awk '{
        if (NR % 2) 
            print "odd"
        else
            print $0
        }'

odd
2
odd
4
odd
6

awk 按值排序

Awk（特别是 gawk）允许您使用名为PROCINFO["sorted_in"] . 这意味着如果我将我们的程序更改为按值排序并取消过滤，那么我将能够看到评论最多的书籍：

exec gawk -F '\t' '
{
    # Global Average
    g_count = g_count + 1
    g_total = g_total + $8 
    title[$6]=$6
    count[$6]= count[$6] + 1
    total[$6]= total[$6] + $8
}
END { 
    PROCINFO["sorted_in"] = "@val_num_desc" # <-- Print in value order
    g_score = g_total/g_count 
    for (i in count) {
        score = total[i]/count[i]
        printf "%-50s\t", substr(title[i],1,50)
        if (score - g_score > .4)
            printf "👍👍👍" 
        else if (score - g_score > .25)
            printf "👍👍" 
        else if (score - g_score > 0)
            printf "👍" 
        else if (g_score - score  > 1)
            printf "👎👎👎" 
        else if (g_score - score  > .5)
            printf "👎👎" 
        else if (g_score - score  > 0)
            printf "👎"
        printf "\n"
    }
}
' bookreviews.tsv

运行它：

$ ./top_books | head

Harry Potter And The Sorcerer's Stone                 👍👍
Fifty Shades of Grey                                  👎👎
The Hunger Games (The Hunger Games, Book 1)           👍
The Hobbit                                            👍
Twilight                                              👎
Jesus Calling: Enjoying Peace in His Presence         👍👍👍
Unbroken: A World War II Story of Survival, Resili    👍👍👍
The Shack: Where Tragedy Confronts Eternity           👎
Divergent                                             👍
Gone Girl                                             👎👎

看起来评论最多的书籍中约有一半 (6 /10) 比平均水平更受欢迎。所以我不能把 Mockingjay 的低分归咎于它的受欢迎程度。我想我必须通过这个系列或至少那本书。

结论

一个好的程序员使用最强大的工具来完成工作。一个伟大的程序员使用最不强大的工具来完成这项工作。

维尤

Awk 的功能远不止这些。还有更多的内置变量和内置函数。它具有范围模式和替换规则，您可以轻松地使用它来修改内容，而不仅仅是添加内容。

如果您想了解有关 Awk 的更多信息，The Awk Programming Language是权威书籍。它深入地涵盖了语言。还介绍了如何用 awk 构建小型编程语言，如何用 awk 构建数据库，以及其他一些有趣的项目。

就连亚马逊也认为它很棒：

$ ./average "The AWK "          
The AWK Programming Language                            👍👍

此外，如果您是那种不怕在命令行上做事的人，那么您可能会喜欢 Earthly：

当你在这里时：

Earthly是用于定义构建的语法。它适用于您现有的构建系统。立即获得可重复且易于理解的构建。

原文地址：https://earthly.dev/blog/awk-examples/