3: 链接分析之链接统计

现在我们组分析源码和统计分析链接的工作正在同步进行,稍后还会有分析源码和统计分析链接的进度报告发布。

本文说的是如何解析链接关系,供统计分析之用。

一句话——人生苦短,我用Python。

基本工作原理是遍历mirror下面的网页, 用正则表达式解析出链接地址, 然后输出链接关系.
最后得到的文件可以作为下一个程序的输入, 以统计网页出度入度和计算PR值.

以下是源码:

1 # coding: utf-8
 2 # 
 3 
 4 import os, re
 5 
 6 rootdir= '/home/xxx/workspace/heritrix/jobs/ccer-20100930010817713/mirror/www.ccer.pku.edu.cn'
 7 
 8 dotfile = open('links.data', 'w', 4096000)
 9 
10 count = 0
11 urllist = []
12 
13 def append2list(url):
14     if url not in urllist:
15         urllist.append(url)
16     return urllist.index(url)
17 
18 def extract(dirr, name):
19     #print "extracting:", dirr, name
20     f = open(dirr+'/'+name, 'r')
21     cururl = 'http://' + dirr[dirr.find('www.ccer.pku.edu.cn'):] + '/' + name
22     curindex = append2list(cururl)
23 
24     hrefs = re.findall(r'''href=('|")?([^\s'"><()]+)(\1?)''', f.read())
25     for href in hrefs:
26         if not href[0] == href[2]\
27             or href[1] == '#'\
28             or href[1] == './'\
29             or href[1].startswith('mailto:')\
30             or href[1].startswith('javascript')\
31             or href[1].endswith('.css')\
32             or href[1].endswith('.jpg')\
33             or href[1].endswith('.bmp')\
34             or href[1].endswith('.jpeg')\
35             or href[1].endswith('.ico')\
36             or href[1].endswith('.gif')\
37             or href[1].endswith('.pdf')\
38             or href[1].endswith('.ppt')\
39             or href[1].endswith('.doc')\
40             or href[1].endswith('.xls')\
41             or href[1].endswith('.pptx')\
42             or href[1].endswith('.docx')\
43             or href[1].endswith('.xlsx')\
44             or href[1].endswith('.zip')\
45             or href[1].endswith('.png'):
46             pass
47         else:
48             realref = href[1]
49             if not realref.startswith('http'): #relative links
50                 if '.asp?' in realref:
51                     realref = realref.replace('.asp?', '', 1) + '.asp' # file name on disk
52                 realref = 'http://' + dirr[dirr.find('www.ccer.pku.edu.cn'):] + '/' + realref
53             #print realref
54             refindex = append2list(realref)
55             global count
56             dotfile.write('%d %d\n' % (curindex, refindex))
57             count += 1
58             if count % 10000 == 0:
59                 print count
60     #f.close()
61 
62 def filter(dummy, dirr, filess):
63     for name in filess:
64         if os.path.splitext(name)[1] in ['.asp', '.htm', '.html'] and os.path.isfile(dirr+'/'+name):
65             extract(dirr, name)
66 
67 os.path.walk(rootdir, filter, None)
68 
69 dotfile.close()
70 
71 urlfile = open('linkindex.txt', 'w', 4096000)
72 for url in urllist:
73     urlfile.write(url + '\n')
74 urlfile.close()

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/25787.html