最近在研究几个省份的政府工作报告,想统计一下报告中的高频词汇。原以为这是一项很基础的工作,网上会有很多工具,结果却大失所望,仅仅找到一款在线工具,而且不太准确。
想想也是,一篇长文章,好几万字,不同的词组有成百上千个,光断句分词就能难倒不少人工智能。
看到网上很多人用python+jieba来抓取高频词,感觉挺简单。虽然不懂代码,但是照葫芦画瓢还是会的。记录下过程,备用。
首先,安装jieba:
pip install jieba
然后,在运行目录下执行python程序,代码如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import jieba
import jieba.analyse
import codecs
import re
from collections import Counterclass WordCounter(object):
def count_from_file(self, file, top_limit=0):
with codecs.open(file, ‘r’, ‘utf-8′) as f:
content = f.read()
content = re.sub(r’\s+’, r’ ‘, content)
content = re.sub(r’\.+’, r’ ‘, content)
return self.count_from_str(content, top_limit=top_limit)def count_from_str(self, content, top_limit=0):
if top_limit <= 0:
top_limit = 100
tags = jieba.analyse.extract_tags(content, topK=100)words = jieba.cut(content,cut_all=True) #自行设置jieba的模式
counter = Counter()
for word in words:
if word in tags:
counter[word] += 1return counter.most_common(top_limit)
if __name__ == ‘__main__’:
counter = WordCounter()
result = counter.count_from_file(r’bj.txt’, top_limit=20) #文件名bj.txt,选取前20高频词
for k, v in result:
print (k, v)
或执行以下程序,运行结果类似:
#! python3
# -*- coding: utf-8 -*-
import os, codecs
import jieba
from collections import Counterdef get_words(txt):
seg_list = jieba.cut(txt,cut_all=True) #自行设置jieba的模式
c = Counter()
for x in seg_list:
if len(x)>1 and x != ‘\r\n’:
c[x] += 1
print(‘常用词频度统计结果’)
for (k,v) in c.most_common(100):
print(‘%s%s %s %d’ % (‘ ‘*(5-len(k)), k, ‘*’*int(v/3), v))if __name__ == ‘__main__’:
with codecs.open(‘bj.txt’, ‘r’, ‘utf8’) as f: #文件名bj.txt
txt = f.read()
get_words(txt)