打算把自己学习实时计算的相关东西写出来,形成一个从零开始学实时计算的系列,由于我也是刚开始接触,系列文中的描述或概念有不当的地方,还请不吝指教。在此谢过。

本文对 storm 的几种分组方式进行测试,加深对每一种分组方式的理解。首先,storm 包含下面七种分组方式:

  • Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  • Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id”‘s may go to different tasks.
  • Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
  • All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.
  • Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.
  • None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
  • Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the emitDirect methods. A bolt can get the task ids of its consumers by either using the providedTopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
    由于测试环境种没有 Partial Key grouping 方式,Direct grouping 方式使用不同的消息发送方式。这里只对其他五种方式进行了测试。

测试环境为:

  • Spout 一个,循环发送一百个单词,配置了一个线程
  • Bolt 一个,统计单词数目,配置了两个线程
    测试结果为(下面出现的阿拉伯数字为单词重复的次数):

  • Shuffle 从第一百零八个统计数据出现 2,后面还会穿插出现 1

  • Field 从第一百零一个统计数据出现 2,出现方式为一百个个1,然后一百个个 2,然后一百个3….
  • Global 从第一百零一个统计数据出现2,出现方式与 Field grouping 方式一样
  • All 从第二百零一个统计数据出现2,然后是两百个2,接着是两百个3….
  • None 从第一百个统计数据出现 2,后面会穿插着出现 1,次数随机出现,与 Shuffle grouping 方式一样
    其中 Shuffle 和 None 都是随机模式,会随机的发送给下一个 Bolt 的任何一个 task。Field 方式会把相同字段的分到同一个 task 上(不同字段的也可以在相同 task 上),Global 方式效果和 Field 一样,根据官方文档,每次都发送给 id 小的 task,All 会发送给 Bolt 上的所有 task(所有上述例子的循环长度为二百),这种方式会浪费比较多的资源。

另外根据文档说明,Partial Key grouping 是在 Field 的基础上进行了压力均衡;Direct 方式需要使用 emitDirect 发送数据。

Comments