简介
我们首先应该从WAP-Tree说起,下面一段话摘自《Effective Web Log Mining using WAP Tree-Mine》原文
Abstract -World Wide Web is a huge data repository and is growing with the explosive rate of about 1 million pages a day,web log records each access of the web
page and number of entries in the web logs is increasing rapidly. These web logs,when mined properly can provide useful information for decision-making. Sequential pattern mining discovers frequent user access patterns from web logs. Since Apriori-like sequential
pattern mining techniques requires expensive multiple scans of database. But, recently a novel data structure, known as Web Access Pattern Tree (or WAP-tree), was developed. This proposed method an efficient WAP-tree mining algorithm,known as DLT-mine (Doubly
Linked Tree algorithm). Proposed recursive algorithm uses this doubly Linked tree to efficiently find all access patterns that satisfy user specified criteria. This mining algorithm is faster than the other Apriori-based mining algorithms.
这段话的大致意思就是:互联网数据十分巨大,如今每天大约有1百万的网页访问增加量,并且增长十分迅速。而这些访问的信息可能给我们提供许多有用的信息并且用来制定相应的决策,并且传统的搜索信息的方法并不是十分高效,于是后来一个神奇的数据结构,称为“Web Access Pattern Tree”(简称WAP-Tree)被人们提出来了。显然,它具有更优秀的性质,并比以往的方法要更快。
后来在实现WAP-Tree的算法的过程中,人们发现WAP-Tree在搜索频繁项的过程中还可以更进一步的优化,于是人们将它改进后成为“Pre-Order Linked WAP-Tree”(简称PLWAP-Tree),具体内容我们会在下面陈述。
WAP-Tree
首先给出伪代码
树的构建:
A. Algorithm 2 (Doubly Linked Tree Construction)
Input: A Web access sequence database WAS and a set of all possible events E.
Output: A doubly linked tree T.
Method:
Scan 1:
Scan 2:
其中Scan1首先计算输入序列的每个字符出现的频度,Scan2筛除其中频度低于阀值lamda的字符,然后构建一颗字典树T,同时要让上一个Si指向此Si,最后返回这颗树。
最后这颗树看起来应该是这个样子的
TID
Web access sequence
Frequent subsequence
100
abdac
abac
200
eaebcac
abcac
300
babfaec
babac
400
afbacfc
abacc
最后在树中进行搜索频发序列,伪代码如下:
B. Algorithm 2 (Mining all ξ-patterns in doubly linked tree)
Input: a Doubly linked tree T and support threshold ξ.
Output: the complete set of ξ-patterns.
Method:
最后我们就会得到频繁项如下:
{c, aac, bac, abac, ac, abc, bc, b, ab, a, aa,ba, aba}
PLWAP-Tree
人们在运用WAP-Tree的过程中,发现其在时间复杂度上并不理想,请看原文《PLWAP Sequential Mining: Open Source Code》中对PLWAP-Tree的一段介绍:
Abstract -PLWAP algorithm uses a preorder linked, position code dversion of WAP tree and eliminates the need to recursively re-construct intermediate WAP trees
during sequential mining as done by WAP tree technique. PLWAP produces sig-nificant reduction in response time achieved by the WAP algorithm and provides a position code mechanism for remembering the stored database, thus, eliminating the need to re-scan the
original database as would be necessary for applications like those incrementally maintaining mined frequent patterns, performing stream or dynamic mining.
大致意思就是:PLWAP 算法使用先序遍历整个树来建立Head-Table链表队列,并且为每个节点设置一个独一无二的编号,并且可以根据这个编号立刻知道一个节点是不是另一个节点的子节点。
最后相同数据下PLWAP-Tree构造如图:
看图应该很容易懂,这里提示几点方便大家理解:
1、上图中比如{c:1:1110}表示这个节点代表的字符是c,而其权重是1,即只有1个c,而1110表示这个节点的编号。编号规则是
①根节点编号为空
②对于节点u其编号为s,设其子节点从左到右分别为v1,v2,v3……,则其编号分别s1,s10,s100……以此类推,即每次多一个0
这样判断p是否是q的后辈点的方法就是:在q的后面加一个“1”,然后判断是否是p的前缀,如果是则p是q的后辈节点
2、关于Head-Table,在PLWAP-Tree中其是在整棵树构建成功后再构建PLWAP-Tree链表的(和WAP-Tree的不同,希望大家好好体会),构建的方案是按照先序遍历的顺序(上图的虚线部分)。大家可以和WAP-Tree的Head-Table的虚线箭头做一下对比,很容易就能发现它们的区别。
PLWAP-Tree代码实现(c++)
这里放上我自己实现的PLWAP-Tree代码,供给大家参考
#include
#include
#include
#include
#include
#include
#include
#include
#define alp_maxn 130
using namespace std;
struct Node{
char alp;
int alp_count;
struct Node * nex;
vector
string seq;
Node(int _siz, char _alp);
};
class PLWAPTREE{
private:
Node * root; //the root of the plwap-tree
Node * Head_Table[alp_maxn]; //Head_Table
Node * alp_las[alp_maxn];
int lamda; //lamda
int alp\_tot; //the number of valid words
char alp\_link\[alp\_maxn\]; //discratization
int alp\_count\[alp\_maxn\]; //discratization
map<char, int>alp\_translate; //discratization
public:
vector<string>reads;
vector<string>feq; //the frequent words
void Init(int \_lamda);
void AddString(string st);
void BuildTree();
void BuildTree(Node \*s, string id);
void SearchFeq(vector<string>R, string now\_feq);
void print\_tree(Node \*s); //debug only...
Node \* get\_root(); //debug only...
};
Node * PLWAPTREE::get_root(){
return root;
}
void PLWAPTREE::print_tree(Node *s){
if (s == NULL) return;
cout << "char : " << s->alp << " seq : " << s->seq << " alp_count : " << s->alp_count;
if (s->nex != NULL) cout << " nex_seq :" << s->nex->seq << endl;
else cout << endl;
for (int i = 0; i < alp_tot; i++)
print_tree(s->son[i]);
}
Node::Node(int _siz, char _alp = -1){
nex = NULL;
son.clear();
while (_siz--) {
son.push_back(NULL);
}
alp = _alp;
alp_count = 0;
}
void PLWAPTREE::Init(int _lamda){
root = new Node(alp_maxn);
for (int i = 0; i < alp_maxn; i++){
Head_Table[i] = NULL;
alp_count[i] = 0;
alp_las[i] = NULL;
}
reads.clear();
feq.clear();
alp_translate.clear();
alp_tot = 0;
lamda = _lamda;
}
void PLWAPTREE::AddString(string st){
int alp_tmp[alp_maxn];
memset(alp_tmp, 0, sizeof(alp_tmp));
for (int i = 0; i < st.length(); i++)
alp_tmp[(int)st[i]] = 1;
for (int i = 0; i < alp_maxn; i++)
alp_count[i] += alp_tmp[i];
reads.push_back(st);
}
void PLWAPTREE::BuildTree(){
for (int i = 0; i < alp_maxn; i++){
if (alp_count[i] >= lamda){
alp_link[alp_tot] = (char)i;
alp_translate[(char)i] = alp_tot;
alp_tot++;
}
} //discretization to save memory and time
printf("-discretization success !\\n");
for (int i = 0; i < reads.size(); i++){
string now\_string = reads\[i\];
Node \* pnow = root;
for (int j = 0; j < now\_string.length(); j++){
if (alp\_count\[(int)now\_string\[j\]\] < lamda) continue;
int sig = alp\_translate\[now\_string\[j\]\];
if (pnow->son\[sig\] == NULL){
Node \* tmp = new Node(alp\_tot, now\_string\[j\]);
pnow->son\[sig\] = tmp;
}
pnow = pnow->son\[sig\];
pnow->alp\_count++;
}
}
printf("-trip-build success !\\n");
BuildTree(root, "");
}
void PLWAPTREE::BuildTree(Node *s, string id){
string seq = id + "1";
for (int i = 0; i < alp_tot; i++){
if (s->son[i] == NULL) continue;
if (Head_Table[i] == NULL){
Head_Table[i] = s->son[i];
}
if (alp_las[i] != NULL){
alp_las[i]->nex = s->son[i];
}
alp_las[i] = s->son[i];
s->son[i]->seq = seq;
BuildTree(s->son[i], seq);
seq = seq + "0";
}
}
void PLWAPTREE::SearchFeq(vector
for (int i = 0; i < alp_tot; i++){
Node * p = Head_Table[i];
bool flag = true;
if (R.size() != 0){
flag = false;
while (p != NULL){
for (int j = 0; j < R.size(); j++){
string str = R[j] + "1";
int sig = p->seq.find(str);
if (sig == 0){
flag = true;
break;
}
}
if (flag) break;
p = p->nex;
}
}
if (flag == false) continue;
int C = p->alp\_count;
string S = p->seq;
vector<string>Rs; Rs.clear();
Rs.push\_back(p->seq);
for (p = p->nex; p != NULL; p = p->nex){
bool is\_son\_of\_R = false;
bool is\_son\_of\_S = false;
if (R.size() == 0) is\_son\_of\_R = true;
else{
for (int j = 0; j < R.size(); j++){
string str = R\[j\] + "1";
int sig = p->seq.find(str);
if (sig == 0){
is\_son\_of\_R = true;
break;
}
}
}
string str = S + "1";
int sig = p->seq.find(str);
if (sig == 0){
is\_son\_of\_S = true;
}
if (is\_son\_of\_R == true && is\_son\_of\_S == false){
C += p->alp\_count;
Rs.push\_back(p->seq);
S = p->seq;
}
}
if (C >= lamda){
feq.push\_back(now\_feq + alp\_link\[i\]);
SearchFeq(Rs, now\_feq + alp\_link\[i\]);
}
}
}
int main(){
PLWAPTREE pt;
pt.Init(3);
printf("Init success !\\n");
pt.AddString("abdac");
pt.AddString("eaebcac");
pt.AddString("babfaec");
pt.AddString("afbacfc");
printf("read string success !\\n");
pt.BuildTree();
printf("Buile tree success !\\n");
/\*
printf("tree just like :\\n");
pt.print\_tree(pt.get\_root());
\*/
vector<string>tmp; tmp.clear();
pt.SearchFeq(tmp, "");
printf("result : \\n");
for (int i = 0; i < pt.feq.size(); i++)
cout << pt.feq\[i\] << endl;
getchar();
return 0;
}
参考资料:
《Effective Web Log Mining using WAP Tree-Mine》
《PLWAP Sequential Mining: Open Source Code ∗》
等
手机扫一扫
移动阅读更方便
你可能感兴趣的文章