Bootstrap

PHP实现一致性Hash算法

目标:

  • 用你熟悉的编程语言实现一致性 hash 算法。

  • 编写测试用例测试这个算法,测试 100 万 KV 数据,10 个服务器节点的情况下,计算这些 KV 数据在服务器上分布数量的标准差,以评估算法的存储负载不均衡性。

代码编写

应该有以下几个步骤:

  • hash函数(把字符串hash成int)

  • 编写获取第一个大于等于该key的node查找函数

这其中比较重要的就是把字符串hash成int值的函数了,我在网上找到一个JShash,用php实现了一下

private static function jsHash(string $str) {
      $hash = 0;

      for ($i = 0; $i < strlen($str); $i++) {
         $hash ^= ($hash << 5) + (int)ord($str[$i]) + ($hash >> 2);
      }
      return $hash & 0x7FFFFFFF;
    }

完整代码:

config = $config;
        $this->initNodes();
    }

    private function initNodes()
    {
        $virtualNodeCount = $this->config['virtual_node_count'] ?? 150;
        $nodes = $this->config['nodes'] ?? ['default'];
        foreach ($nodes as $node) {
            $this->initVirtualNodes($node, $virtualNodeCount);
        }
        $this->hashKeys = array_keys($this->allNodes);
        sort($this->hashKeys);
        $this->keyCount = count($this->hashKeys);
    }

    private function initVirtualNodes($node, $virtualNodeCount)
    {
        for ($i = 0; $i < $virtualNodeCount; $i++) {
//            $virtualNodeName = $node.'#'.$i;
            $virtualNodeName = $node. '#' . $i . mt_rand(0, $virtualNodeCount);
            $this->allNodes[self::jsHash($virtualNodeName)] = $node;
        }
    }

    public function getNode(string $key)
    {
        $hashKey = self::jsHash($key);
        return $this->allNodes[$this->searchFirstLarger($hashKey)];
    }

    private static function jsHash(string $str) {
      $hash = 0;

      for ($i = 0; $i < strlen($str); $i++) {
         $hash ^= ($hash << 5) + (int)ord($str[$i]) + ($hash >> 2);
      }
      return $hash & 0x7FFFFFFF;
    }



    private function searchFirstLarger(int $search)
    {
        $len = $this->keyCount;
        $left = 0;
        $right= $len - 1;
        while ($left <= $right) {
            $mid = $left + floor(($right - $left) / 2);
            if ($this->hashKeys[$mid] >= $search) {
                $right = $mid - 1;
            } elseif ($this->hashKeys[$mid] < $search) {
                $left = $mid + 1;
            }

        }
        if ($left > $len-1) $left = 0;
        return $this->hashKeys[$left];
    }

}

测试

php里面没有标准差函数,就单独写了个。

代码:

function standardDeviation($arr) {
    $length = count($arr);
    if ($length == 0) {
        return 0;
    }
    $average = array_sum($arr)/$length;
    $count = 0;
    foreach ($arr as $v) {
        $count += pow($average-$v, 2);
    }
    $variance = $count/$length;
    return sqrt($variance);
}

$config = [
    'nodes' => [
        '192.168.1.1',
        '192.168.1.2',
        '192.168.1.3',
        '192.168.1.4',
        '192.168.1.5',
        '192.168.1.6',
        '192.168.1.7',
        '192.168.1.8',
        '192.168.1.9',
        '192.168.1.10'
    ],
    'virtual_node_count' => 250
];
$nodes_hit_count = [
    '192.168.1.1' => 0,
    '192.168.1.2' => 0,
    '192.168.1.3' => 0,
    '192.168.1.4' => 0,
    '192.168.1.5' => 0,
    '192.168.1.6' => 0,
    '192.168.1.7' => 0,
    '192.168.1.8' => 0,
    '192.168.1.9' => 0,
    '192.168.1.10' => 0
];
$start_time = microtime(true);
$consistent_hash = new ConsistentHashing($config);

for($i = 0; $i < 1000000; $i++) {
    $node_name = $consistent_hash->getNode('key'.(string)$i.mt_rand(0, 100000));
    $nodes_hit_count[$node_name]++;
}
foreach ($nodes_hit_count as $node => $count) {
    print $node.'存储key数'. $count ."\n";
}
print "标准差".standardDeviation($nodes_hit_count)."\n";

$end_time = microtime(true);
print "执行时间: ". ($end_time - $start_time) . "\n";

当虚拟node数设置为150时:

192.168.1.1存储key数49295
192.168.1.2存储key数56708
192.168.1.3存储key数59475
192.168.1.4存储key数264412
192.168.1.5存储key数150626
192.168.1.6存储key数113880
192.168.1.7存储key数73496
192.168.1.8存储key数95936
192.168.1.9存储key数104041
192.168.1.10存储key数32131
标准差64199.634646935
执行时间: 3.8713130950928
real	0m 3.94s
user	0m 3.85s
sys	0m 0.03s

为180时:

192.168.1.1存储key数49232
192.168.1.2存储key数56847
192.168.1.3存储key数61377
192.168.1.4存储key数265102
192.168.1.5存储key数150616
192.168.1.6存储key数113706
192.168.1.7存储key数72529
192.168.1.8存储key数96250
192.168.1.9存储key数104306
192.168.1.10存储key数30035
标准差64515.24847662
执行时间: 4.2461450099945
real	0m 4.29s
user	0m 4.22s
sys	0m 0.03s

200时:

192.168.1.1存储key数59224
192.168.1.2存储key数57170
192.168.1.3存储key数61129
192.168.1.4存储key数265007
192.168.1.5存储key数150797
192.168.1.6存储key数113896
192.168.1.7存储key数68320
192.168.1.8存储key数96682
192.168.1.9存储key数97474
192.168.1.10存储key数30301
标准差63943.531801113
执行时间: 3.6802899837494
real	0m 3.72s
user	0m 3.68s
sys	0m 0.02s

220时:

192.168.1.1存储key数86243
192.168.1.2存储key数55618
192.168.1.3存储key数51023
192.168.1.4存储key数239631
192.168.1.5存储key数136547
192.168.1.6存储key数148513
192.168.1.7存储key数66055
192.168.1.8存储key数73461
192.168.1.9存储key数97445
192.168.1.10存储key数45464
标准差57079.820346599
执行时间: 4.0990281105042
real	0m 4.20s
user	0m 4.09s
sys	0m 0.02s

结论

大致上来说虚拟node数在150-200之间,标准差和执行时间相对好些。

从结果可以看出来,node不是那么的均匀,特别有一个很高。基本上hash函数的问题,因为测试的key和服务器的node都是随机数,生成的数偏向某个node。

我也尝试用别人写的java的hash算法,但是计算出的hash值特别特别大,和java计算出来的不一致,还未找到原因。后面有找到了再改一下试试。