基于WordCloud的微信聊天记录词云绘制

zcz 3303

1.微信聊天记录导出

1.1用已ROOT的安卓手机找到存放聊天记录的数据库文件

工具:MT管理器

路径:/data/data/com.tencent.mm/MicroMsg/一大串长文件夹/EnMicroMsg.db


1.2文件是加密的,找到你的解锁钥匙

钥匙 KEY = IMEI(手机序列号)+ UIN(用户信息号)

手机输入 *#06# 能得到IMEI

UIN查看路径:/data/data/com.tencent.mm/shared_prefs/system_config_prefs.xml

把这KEY拷贝到网站计算MD5值,网站地址:https://bbs.ntrqq.net/md5.html

1.3下载打开数据库的软件SQLite Database Browse

下载地址:https://pan.baidu.com/s/1dDBa4FZ

输入KEY,打开数据库文件EnMicroMsg.db并导出csv格式的message表

参考方法:https://www.zhihu.com/question/19924224/answer/69982884

2.数据处理和绘制词云

2.1导入第三方库

import jieba
import codecs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import imageio
from wordcloud import WordCloud,ImageColorGenerator

2.2Pandas提取所需的数据

#读取保存的csv文件
pd1=pd.read_csv(r'C:\users\email\desktop\wx_message.csv',encoding='ANSI')
#删掉多余的记录
pd1=pd1.content[2:]
word_list=list(pd1)

2.3Jieba进行分词并筛选统计

#分词
segment=[]
for i in range(len(word_list)):
    segs=jieba.cut(word_list[i])
    for seg in segs:
        if len(seg)>1 and seg!='\r\n':
            segment.append(seg)    
#去停用词
words_df=pd.DataFrame({'segment':segment})
stopwords=pd.read_csv(r'C:\users\email\desktop\stopwords.txt',index_col=False,quoting=3,sep='\t',names=['stopword'])
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

停用词表:https://blog.csdn.net/dorisi_h_n_q/article/details/82114913

#统计词频
words_stat=words_df.groupby(by=['segment'])['segment'].agg({'计数':np.size})
words_stat=words_stat.reset_index().sort_values(by='计数',ascending=False)
#这三个是小表情,所以删掉了
words_stat.drop([5335,3577,7225],inplace=True)

2.4Wordcloud绘制词云

#画图
bimg=imageio.imread(r'C:\users\email\desktop\words.jpg')
wordcloud=WordCloud(background_color="white",mask=bimg,font_path='simhei.ttf',scale=7)
words = words_stat.set_index("segment").to_dict()
wordcloud=wordcloud.fit_words(words["计数"])
bimgColors=ImageColorGenerator(bimg)
plt.axis("off")
plt.imshow(wordcloud.recolor(color_func=bimgColors))
plt.show()
wordcloud.to_file(r'C:\users\email\desktop\temp.png')

具体的聊天记录词云就不放了,随便放个效果图

pic6.png