Darts: Double-ARray Trie System 翻译文档

作者rickjin

3 月 23, 2013 #Darts, #中文分词, #双数组

Darts: Double-ARray Trie System

开篇

Darts 是用于构建双数组 Double-Array [Aoe 1989] 的简单的 C++ Template Library . 双数组 (Double-Array) 是用于实现 Trie 的一种数据结构, 比其它的类 Trie 实现方式(Hash-Tree, Digital Trie, Patricia Tree, Suffix Array) 速度更快。原始的 Double-Array 使能够支持动态添加删除 key, 但是 Darts 只支持把排好序的词典文件转换为静态的 Double-Array.

Darts 既可以像 Hash 一样作为简单的词典使用，也能非常高效的执行分词词典中必须的 Common Prefix Search 操作。

自2003年7月起，两个开源的日语分词系统 MeCab, ChaSen 都使用了 Darts .

下载

Darts 是自由软件．遵循 LGPL(Lesser GNU General Public License) 和 BSD 协议, 可以修改后重新发布.

Source

darts-0.32.tar.gz: HTTP

安装

% ./configure 
% make
% make check
% make install
然后在程序中  include /usr/local/include/darts.h

使用方法

Darts 只提供了 darts.h 这个 C++ 模板文件。每次使用的时候 include 该文件即可.
使用这样的发布方式是希望通过内联函数实现高效率。

类接口

namespace Darts {

   template <class NodeType, class NodeUType class ArrayType, 
                class ArrayUType, class LengthFunc = Length<NodeType> >

   class DobuleArrayImpl
   {
    public:
     typedef ArrayType   result_type;
     typedef NodeType    key_type;

     DoubleArrayImpl();
     ~DoubleArrayImpl();

     int    set_array(void *ptr, size_t = 0);
     void   *array();
     void   clear();
     size_t size ();
     size_t unit_size ();
     size_t nonzero_size ();
     size_t total_size ();

     int build (size_t        key_size,
                key_type      **key,
                size_t        *len = 0,
                result_type   *val = 0,
                int (*pg)(size_t, size_t) = 0);

     int open (const char *file,
               const char *mode = "rb",
               size_t offset = 0,
               size_t _size = 0);

     int save (const char *file,
               const char *mode = "wb",
               size_t offset = 0);

     result_type exactMatchSearch (const key_type *key,
                                   size_t len = 0,
                                   size_t node_pos = 0)

     size_t commonPrefixSearch  (const key_type *key,
                                 result_type *result,
                                 size_t result_size,
                                 size_t len = 0,
                                 size_t node_pos = 0)

     result_type traverse (const key_type *key, 
                           size_t &node_pos, 
                           size_t &key_pos, 
                           size_t len = 0)
   };

   typedef Darts::DoubleArrayImpl<char, unsigned char,
              int, unsigned int> DoubleArray;
};

模板参数说明

NodeType	Trie 节点类型，对普通的 C 字符串检索，设置为 char 型即可.
NodeUType	Trie 节点转为无符号整数的类型, 对普通的 C 字符串检索，设置为 unsigned char 型即可.
ArrayType	Double-Array 的 Base 元素使用的类型, 通常设置为有符号 32bit 整数
ArrayUType	Double-Array 的 Check 元素使用的类型, 通常设置为无符号 32bit 整数
LengthFunc	使用 NodeType 数组的时候，使用该函数对象获取数组的大小, 在该函数对象中对 operator() 进行重载. NodeType 是 char 的时候，缺省使用 strlen 函数，否则以 0 作为数组结束标志计算数组的大小 .

typedef 说明

模板参数类型的别名. 在外部需要使用这些类型的时候使用 .

key_type	待检索的 key 的单个元素的类型. 等同于 NodeType.
result_type	单个结果的类型. 等同于 ArrayType .

方法说明

int Darts::DoubleArrayImpl::build(size_t size, const key_type **str, const size_t *len = 0, const result_type *val = 0, int (*progress_func)(size_t, size_t) = 0)

构建 Double Array .

size 词典大小 (记录的词条数目),
str 指向各词条的指针 (共 size个指针)
len 用于记录各个词条的长度的数组(数组大小为 size)
val 用于保存各词条对应的 value 的数组 (数组大小为 size)
progress_func 构建进度函数.

str 的各个元素必须按照字典序排好序.
另外 val 数组中的元素不能有负值.
len, val, progress_func 可以省略,
省略的时候， len 使用 LengthFunc 计算,
val 的各元素的值为从 0 开始的计数值。

构建成功，返回 0；失败的时候返回值为负.
进度函数 progress_func 有两个参数.
第一个 size_t 型参数表示目前已经构建的词条数
第二个 size_t 型参数表示所有的词条数

result_type Darts::DoubleArrayImpl::exactMatchSearch(const key_type *key, size_t len = 0, size_t node_pos = 0)

进行精确匹配(exact match) 检索, 判断给定字符串是否为词典中的词条.

key 待检索字符串,
len 字符串长度,
node_pos 指定从 Double-Array 的哪个节点位置开始检索.

len, 和 node_pos 都可以省略, 省略的时候， len 缺省使用 LengthFunc 计算,
node_pos 缺省为 root 节点.

检索成功时，返回 key 对应的 value 值, 失败则返回 -1.

size_t Darts::DoubleArrayImpl::commonPrefixSearch (const key_type *key, result_type *result, size_t result_size, size_t len = 0, size_t node_pos = 0)

执行 common prefix search. 检索给定字符串的哪些的前缀是词典中的词条

key 待检索字符串,
result 用于保存多个命中结果的数组,
result_size 数组 result 大小,
len 待检索字符串长度,
node_pos 指定从 Double-Array 的哪个节点位置开始检索.

len, 和 node_pos 都可以省略, 省略的时候， len 缺省使用 LengthFunc 计算,
node_pos 缺省为 root 节点.

函数返回命中的词条个数. 对于每个命中的词条，词条对应的 value 值存依次放在 result 数组中. 如果命中的词条个数超过 result_size 的大小，则 result 数组中只保存 result_size 个结果。函数的返回值为实际的命中词条个数，可能超过 result_size 的大小。

result_t Darts::DoubleArrayImpl::traverse (const key_type *key, size_t &node_pos, size_t &key_pos, size_t len = 0)

traverse Trie，检索当前字符串并记录检索后到达的位置

key 待检索字符串,
node_pos 指定从 Double-Array 的哪个节点位置开始检索.
key_pos 从待检索字符串的哪个位置开始检索
len 待检索字符串长度,

该函数和 exactMatchSearch 很相似. traverse 过程是按照检索串 key 在 TRIE 的节点中进行转移.
但是函数执行后, 可以获取到最后到达的 Trie 节点位置，最后到达的字符串位置 . 这和 exactMatchSearch 是有区别的.

node_pos 通常指定为 root 位置 (0) . 函数调用后， node_pos 的值记录最后到达的 DoubleArray 节点位置。
key_pos 通常指定为 0. 函数调用后， key_pos 保存最后到达的字符串 key 中的位置。

检索失败的时候，返回 -1 或者 -2 .
-1 表示再叶子节点失败, -2 表示在中间节点失败,.
检索成功的时候，返回 key 对应的 value.

int Darts::DoubleArrayImpl::save(const char *file, const char *mode = "wb", size_t offset = 0)

把 Double-Array 保存为文件.

file 保存文件名,
mode 文件打开模式
offset 保存的文件位置偏移量, 预留将来使用，目前没有实现 .

成功返回 0 ，失败返回 -1

int Darts::DoubleArrayImpl::open (const char *file, const char *mode = "rb", size_t offset = 0, size_t size = 0)

读入 Double-Array 文件.

file 读取文件名,
mode 文件打开模式
offset 读取的文件位置偏移量

size 为 0 的时候, size 使用文件的大小 .

成功返回 0 ，失败返回 -1

size_t Darts::DoubleArrayImpl::size()

返回 Double-Array 大小.

size_t Darts::DoubleArrayImpl::unit_size()

Double-Array 一个元素的大小(byte).

size() * unit_size() 是, 存放 Double-Array 所需要的内存(byte) 大小.

size_t Darts::DoubleArrayImpl::nonzero_size()

Double-Array 的所有元素中，被使用的元素的数目, .
nonezero_size()/size() 用于计算压缩率.

例子程序

从静态词典构建双数组 Double-Array.

#include <iostream>
#include <darts.h>

int main (int argc, char **argv)
{
  using namespace std;

  Darts::DoubleArray::key_type   *str[] = { "ALGOL", "ANSI", "ARCO",  "ARPA", "ARPANET", "ASCII" }; // same as char*
  Darts::DobuleArray::result_type val[] = { 1, 2, 3, 4, 5, 6 }; // same as int 

  Darts::DoubleArray da;
  da.build (6, str, 0, val); 

  cout << da.exactMatchSearch("ALGOL") << endl;
  cout << da.exactMatchSearch("ANSI") << endl;
  cout << da.exactMatchSearch("ARCO") << endl;;
  cout << da.exactMatchSearch("ARPA") << endl;;
  cout << da.exactMatchSearch("ARPANET") << endl;;
  cout << da.exactMatchSearch("ASCII") << endl;;
  cout << da.exactMatchSearch("APPARE") << endl;

  da.save("some_file");
}

执行结果
1
2
3
4
5
6
-1

从标准输入读取字符串, 对 Double-Array 执行 Common Prefix Search

#include <iostream>
#include <string>
#include <algorithm>
#include <darts.h>

int main (int argc, char **argv)
{
  using namespace std;

  Darts::DoubleArray da;
  if (da.open("some_file") == -1) return -1;

  Darts::DoubleArray::result_type  r [1024];
  Darts::DoubleArray::key_type     buf [1024];

  while (cin.getline (buf, 1024)) {
    size_t result = da.commonPrefixSearch(buf, r, 1024);
    if (result == 0) {
       cout << buf << ": not found" << endl;
    } else {
       cout << buf << ": found, num=" << result << " ";
       copy (r, r + result, ostream_iterator<Darts::DoubleArray::result_type>(cout, " "));
       cout << endl;
    }
  }

  return 0;
}

付属程序说明

mkdarts

% ./mkdarts DictionaryFile DoubleArrayFile

把排序好的词典 DictionaryFile 转换为 DoubleArrayFile

darts

% ./darts DoubleArrayFile

使用 DoubleArrayFile 做 common prefix search .

使用例子

% cd tests
% head -10 linux.words
ALGOL
ANSI
ARCO
ARPA
ARPANET
ASCII
 .. 

% ../mkdarts linux.words dar
Making Double Array: 100% |*******************************************|
Done!, Compression Ratio: 94.6903 %

% ../darts dar
Linux
Linux: found, num=2 3697 3713
Windows
Windows: not found
LaTeX
LaTeX: found, num=1 3529

参考文献, 链接

[Aoe1989] Aoe, J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering. Vol. 15, 9 (Sep 1989). pp. 1066-1077.
[Datrie] Theppitak Karoonboonyanan An Implementation of Double-Array Triehttp://www.links.nectec.or.th/~thep/datrie/
[単語と辞書] 松本裕治ほか. 単語と辞書 岩波講座言語の科学 Vol.3 pp.79-81

翻译

Zhihui JIN (ZhihuiJin AT gmail.com)

日语水平有限，翻译不准确之处请见谅

作者 rickjin

LLm 自然语言处理

《Darts: Double-ARray Trie System 翻译文档》有4条评论

blowyourheart说道：

2013年03月26号 23:36

基本上每个公司都把darts封装一下使用。

[回复]
雷龙说道：

2013年08月1号 18:21

一种实现人工智能程序自进化的概念原理 http://blog.csdn.net/liron71/article/details/8242670

[回复]
ggaaooppeenngg说道：

2013年11月18号 20:16

你好，我最近拜读了你的博客，感觉很有意思。我尝试着读Github上面一个用Go写的MMSEG算法的库，但是又要看一个双数组trie树，有些代码似乎是优化的代码，没有看明白。
在算法描述中要满足每个状态的下个状态c1,c2,c3,c4,cn都能在数组存下，所以需要找到一个合适的pos,使得
base[pos] check[pos] ……base[pos],check[pos]都为0
为了找到这个pos 用了一个循环遍历数组的下标
但是在遍历之前会有一个
var pos int = max(int(siblings[0].code)+1, d.nextCheckPos) - 1
nonZeroNum = 0
first :=false
把c0的序码和nextCheckPos中较大的一个赋值给base，我能理解将序码赋给pos，因为从下标为0的地方开始找，但是nextCheck这块就有点不明白、
接下来是一个计数器用来记录没有找到合适位置的次数
………………………………
if d.darts.Check[pos] > 0 {
nonZeroNum++
continue
} else if !first {
//如果否first，那么下个要检查的地方就是pos，但是first改为true，不知道为什么
d.nextCheckPos = pos
first = true
}
//最后会有一个这样的判断，完全不明白
if float32(nonZeroNum)/float32(pos-d.nextCheckPos+1) >= 0.95 {
d.nextCheckPos = pos
}
原github的地址是https://github.com/awsong/go-darts 好像是一个c++版本的迁移。
我猜是某种优化，但是我一直没有查到，希望仁兄能给一点启发。

[回复]
天志鹏飞回复:
26 7 月, 2018 at 10:25
首先说下下面的那个，那个应该是一个经验公式，如果非空的占用了大于0.95的话，下次就不从这个地方再查找了，因为太费时间了，是一种浪费空间节省时间的做法。明白了这个，再说下上面那个，上面那个就是计算非空所占的比例的，用在和0.95进行比较。
以上都是个人理解，如有不正，勿喷，欢迎指正。

[回复]

Darts: Double-ARray Trie System 翻译文档

作者rickjin

Darts: Double-ARray Trie System

开篇

下载

Source

安装

使用方法

类接口

模板参数说明

typedef 说明

方法说明

例子程序

付属程序说明

mkdarts

darts

使用例子

参考文献, 链接

翻译

作者 rickjin

相关文章

DeepSeek-V3解析及技术报告英中报告对照版

如何构建和优化推理型大型语言模型？DeepSeek R1的启示

新浪张俊林：大语言模型的涌现能力——现象与解释

《Darts: Double-ARray Trie System 翻译文档》有4条评论

发表回复

You missed

Qwen2.5-Omni：迈向通用多模态AI的里程碑——解读首个支持实时多模态输入与输出的统一模型

Google DeepMind 发布多模态轻量级开源模型 Gemma 3：性能与功能全面升级

DeepSeek-V3解析及技术报告英中报告对照版

Qwen2.5-VL：阿里巴巴新一代多模态大模型的技术突破与应用前景

作者rickjin

Darts: Double-ARray Trie System

开篇

类接口

模板参数说明

typedef 说明

方法说明

例子程序

付属程序说明

mkdarts

darts

使用例子

参考文献, 链接

翻译

相关文章：

作者 rickjin

相关文章

《Darts: Double-ARray Trie System 翻译文档》有4条评论

发表回复

You missed