威尼斯人线上娱乐

中贝叶斯分类中程导弹入CRUISERSS源例子,中文词云代码调节和测试

18 4月 , 2019  

词云是个很风趣的东西。

华语词云代码调节和测试,中文调节和测试

词云是个很有趣的事物。

用jieba断词,随笔文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很重点,必要持续调节补充。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

  wordcloud.to_file(‘test.png’)

效果图:

威尼斯人线上娱乐 1

下边是词频计算排序,词长排序的代码。

##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很便利,也值得借鉴,Python是个好东西,庞大,易重用。

 

词云是个很风趣的东西。
用jieba断词,散文文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,…

《机器学习实战》中贝叶斯分类中程导弹入路虎极光SS源例子,机器学习实战

随着书中代码往下写在此间卡住了,恐怕还会有别的同学也蒙受了这么的难点,记下来分享。

 

跟着书中代码往下写在此间卡住了,怀恋到或然还会有其余同学也遇上了那般的难点,记下来分享。

用jieba断词,随笔文本存入”mori.txt”,停用词列表在”stopword.txt”中,断词结果好坏,停用词很首要,须要不停调节补充。

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提示错误说并未有setuptools,然后去找setuptools,官方的布道是windows最佳用ez_setup.py安装,小编实在下载不下来官方网址的充足ez_etup.py,这么些帖子给出了缓解方案:

ez_setup.py

将以此文件直接拷贝到C:\\python二柒文书夹中,输入命令行:python
ez_setup.py install

下一场转到放feedparser安装文件的文书夹中,命令行输入:python setup.py
install

 

先嘲谑一下,相信大多数网上朋友在那边卡住的基本点原因是大侠的GFW,所以无论是软件FQ依然肉身FQ的伙伴们臆想是无论如何也看不到那篇博文的,不想往下看的请自觉运用FQ技艺。

from wordcloud import WordCloud
import jieba

f = open(u'mori.txt','r').read()
##cuttext=" ".join(jieba.cut(f))
cuttext= jieba.cut(f) 
final= [] 
stopwords=open(u'stopword.txt','r').read() 

for seg in cuttext:
    ##seg = seg.encode('utf-8')
    if seg[0] not  in ['0','1','2','3','4','5','6','7','8','9']:##忽略数字
        if seg not in stopwords:
            final.append( seg) ## 列表添加   

font=r"c:/Windows/Fonts/simsun.ttc"##中文显示必须加
wordcloud = WordCloud(font_path=font,background_color="white",width=1000, height=860, margin=2).generate(" ".join(final))

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

笔者提供的哈弗SS源链接“

书中笔者的乐趣是的话自源
中的小说作为分类为1的稿子,以来自源
中的文章作为分类为0的篇章

为了能够跑通示例代码,能够找两可用的安德拉SS源作为替代。

小编用的是那四个源:

NASA Image of the
Day:

中贝叶斯分类中程导弹入CRUISERSS源例子,中文词云代码调节和测试。Yahoo Sports – NBA – Houston Rockets
News:

也正是说,倘诺算法运行正确的话,全体来自于 nasa
的稿子将会被归类为一,全体来自于yahoo sports的休斯顿休斯敦火箭队新闻将会分类为0

 

 

  wordcloud.to_file(‘test.png’)

应用自身定义的牧马人SS源,当程序运维到trainNB0(array(trainMat),array(trainClasses))时会报错,如何做?

从书中小编的例证来看,小编采纳的源普通话章数量较多,len(ny[‘entries’])
为 十0,小编找的多少个 陆风X8SS 源唯有10-1捌个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为小编的多少个QashqaiSS源有100篇小说,所以她得以在代码中剔除了三十几个“停用词”,随机选择20篇作品作为测试集,可是当大家利用取代EnclaveSS源时我们只有十篇文章却要抽取20篇小说作为测试集,那样显然是会出错的。只要本人调整下测试集的数据就足以让代码跑通;要是小说中的词太少,减少剔除的“停用词”数量得以巩固算法的准确度。

 

怎么设置feedparser?

按书中提供的网站直接设置feedparser会提示错误说并未有setuptools,然后去找setuptools,官方的说教是windows最佳用ez_setup.py安装,小编的确下载不下来官网的分外ez_etup.py,那几个帖子给出了化解方案:

ez_setup.py

将那些文件一向拷贝到C:\\python二7文件夹中,输入命令行:python
ez_setup.py install

然后转到放feedparser安装文件的文本夹中,命令行输入:python setup.py
install

 

效果图:

借使不想将现出频率排序最高的二十八个单词移除,该怎么去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(代替移除高频词的代码)。具体要求停用哪些词可以参照那里

以下代码想符合规律运维需要将停用词保存至stopword.txt中。

自小编的txt中保存了以下单词,效果尚可:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb

小编提供的悍马H二SS源链接“

书中小编的意趣是的话自源
中的小说作为分类为一的稿子,以来自源
中的小说作为分类为0的稿子

为了能够跑通示例代码,能够找两可用的MuranoSS源作为代替。

我用的是那四个源:

NASA Image of the
Day:

Yahoo Sports – NBA – Houston Rockets
News:

约等于说,假使算法运转正确的话,全数来自于 nasa
的稿子将会被分类为一,全体来自于yahoo sports的休斯顿休斯敦火箭队资源信息将会分类为0

 

威尼斯人线上娱乐 2

动用本人定义的SportageSS源,当程序运维到trainNB0(array(trainMat),array(trainClasses))时会报错,如何是好?

从书中作者的例证来看,小编利用的源中文章数量较多,len(ny[‘entries’])
为 拾0,笔者找的多少个 汉兰达SS 源唯有十-21个左右。

>>> import feedparser
>>>ny=feedparser.parse(”)
>>> ny[‘entries’]
>>> len(ny[‘entries’])
100

因为笔者的一个QashqaiSS源有100篇小说,所以他得以在代码中剔除了32个“停用词”,随机选择20篇小说作为测试集,可是当大家采纳代替LX570SS源时大家只有10篇小说却要抽取20篇小说作为测试集,那样分明是会出错的。只要自个儿调控下测试集的数量就可以让代码跑通;假设文章中的词太少,减弱剔除的“停用词”数量能够巩固算法的准确度。

 

下边是词频总括排序,词长排序的代码。

跟着书中代码往下写在此地卡住了,可能还会有其余同学也蒙受了那样的问…

假设不想将应运而生频率排序最高的三十二个单词移除,该怎么去掉“停用词”?

能够把要去掉的停用词存放到txt文件中,使用时读取(替代移除高频词的代码)。具体须求停用哪些词能够参考这里

以下代码想健康运作需求将停用词保存至stopword.txt中。

自笔者的txt中保存了以下单词,效果还行:

a
about
above
after
again
against
all
am
an
and
any
are
aren’t
as
at
be
because
been
before
being
below
between
both
but
by
can’t
cannot
could
couldn’t
did
didn’t
do
does
doesn’t
doing
don’t
down
during
each
few
for
from
further
had
hadn’t
has
hasn’t
have
haven’t
having
he
he’d
he’ll
he’s
her
here
here’s
hers
herself
him
himself
his
how
how’s
i
i’d
i’ll
i’m
i’ve
if
in
into
is
isn’t
it
it’s
its
itself
let’s
me
more
most
mustn’t
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours
ourselves
out
over
own
same
shan’t
she
she’d
she’ll
she’s
威尼斯人线上娱乐,should
shouldn’t
so
some
such
than
that
that’s
the
their
theirs
them
themselves
then
there
there’s
these
they
they’d
they’ll
they’re
they’ve
this
those
through
to
too
under
until
up
very
was
wasn’t
we
we’d
we’ll
we’re
we’ve
were
weren’t
what
what’s
when
when’s
where
where’s
which
while
who
who’s
whom
why
why’s
with
won’t
would
wouldn’t
you
you’d
you’ll
you’re
you’ve
your
yours
yourself
yourselves

 

'''
Created on Oct 19, 2010

@author: Peter
'''
from numpy import *

def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help','my','dog', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets
    return list(vocabSet)

def bagOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
        else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

def testingNB():
    print '*** load dataset for training ***'
    listOPosts,listClasses = loadDataSet()
    print 'listOPost:\n',listOPosts
    print 'listClasses:\n',listClasses
    print '\n*** create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'myVocabList:\n',myVocabList
    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList, postinDoc))
    print 'train matrix:',trainMat
    print '\n*** train P0V p1V pAb ***'
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
    print '\n*** classify ***'
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    testEntry = ['stupid', 'garbage']
    thisDoc = array(bagOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

def textParse(bigString):    #input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2] 

def spamTest():
    docList=[]; classList = []; fullText =[]
    for i in range(1,26):
        wordList = textParse(open('email/spam/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    trainingSet = range(50); testSet=[]           #create test set
    for i in range(10):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])  
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
            errorCount += 1
            print "classification error",docList[docIndex]
    print 'the error rate is: ',float(errorCount)/len(testSet)
    #return vocabList,fullText

def calcMostFreq(vocabList,fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token]=fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedFreq[:30]       

def stopWords():
    import re
    wordList =  open('stopword.txt').read() # see http://www.ranks.nl/stopwords
    listOfTokens = re.split(r'\W*', wordList)
    return [tok.lower() for tok in listOfTokens] 
    print 'read stop word from \'stopword.txt\':',listOfTokens
    return listOfTokens

def localWords(feed1,feed0):
    import feedparser
    docList=[]; classList = []; fullText =[]
    print 'feed1 entries length: ', len(feed1['entries']), '\nfeed0 entries length: ', len(feed0['entries'])
    minLen = min(len(feed1['entries']),len(feed0['entries']))
    print '\nmin Length: ', minLen
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        print '\nfeed1\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1) #NY is class 1
        wordList = textParse(feed0['entries'][i]['summary'])
        print '\nfeed0\'s entries[',i,']\'s summary - ','parse text:\n',wordList
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)#create vocabulary
    print '\nVocabList is ',vocabList
    print '\nRemove Stop Word:'
    stopWordList = stopWords()
    for stopWord in stopWordList:
        if stopWord in vocabList:
            vocabList.remove(stopWord)
            print 'Removed: ',stopWord
##    top30Words = calcMostFreq(vocabList,fullText)   #remove top 30 words
##    print '\nTop 30 words: ', top30Words
##    for pairW in top30Words:
##        if pairW[0] in vocabList:
##            vocabList.remove(pairW[0])
##            print '\nRemoved: ',pairW[0]
    trainingSet = range(2*minLen); testSet=[]           #create test set
    print '\n\nBegin to create a test set: \ntrainingSet:',trainingSet,'\ntestSet',testSet
    for i in range(5):
        randIndex = int(random.uniform(0,len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    print 'random select 5 sets as the testSet:\ntrainingSet:',trainingSet,'\ntestSet',testSet
    trainMat=[]; trainClasses = []
    for docIndex in trainingSet:#train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    print '\ntrainMat length:',len(trainMat)
    print '\ntrainClasses',trainClasses
    print '\n\ntrainNB0:'
    p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
    #print '\np0V:',p0V,'\np1V',p1V,'\npSpam',pSpam
    errorCount = 0
    for docIndex in testSet:        #classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        classifiedClass = classifyNB(array(wordVector),p0V,p1V,pSpam)
        originalClass = classList[docIndex]
        result =  classifiedClass != originalClass
        if result:
            errorCount += 1
        print '\n',docList[docIndex],'\nis classified as: ',classifiedClass,', while the original class is: ',originalClass,'. --',not result
    print '\nthe error rate is: ',float(errorCount)/len(testSet)
    return vocabList,p0V,p1V

def testRSS():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    vocabList,pSF,pNY = localWords(ny,sf)

def testTopWords():
    import feedparser
    ny=feedparser.parse('http://www.nasa.gov/rss/dyn/image_of_the_day.rss')
    sf=feedparser.parse('http://sports.yahoo.com/nba/teams/hou/rss.xml')
    getTopWords(ny,sf)

def getTopWords(ny,sf):
    import operator
    vocabList,p0V,p1V=localWords(ny,sf)
    topNY=[]; topSF=[]
    for i in range(len(p0V)):
        if p0V[i] > -6.0 : topSF.append((vocabList[i],p0V[i]))
        if p1V[i] > -6.0 : topNY.append((vocabList[i],p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

def test42():
    print '\n*** Load DataSet ***'
    listOPosts,listClasses = loadDataSet()
    print 'List of posts:\n', listOPosts
    print 'List of Classes:\n', listClasses

    print '\n*** Create Vocab List ***'
    myVocabList = createVocabList(listOPosts)
    print 'Vocab List from posts:\n', myVocabList

    print '\n*** Vocab show in post Vector Matrix ***'
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bagOfWords2Vec(myVocabList,postinDoc))
    print 'Train Matrix:\n', trainMat

    print '\n*** Train ***'
    p0V,p1V,pAb = trainNB0(trainMat,listClasses)
    print 'p0V:\n',p0V
    print 'p1V:\n',p1V
    print 'pAb:\n',pAb
##统计词频
freqD2 = {}
for word2 in final:
  freqD2[word2] = freqD2.get(word2, 0) + 1 

##按词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: x[1], reverse=True) 
_2000=counter_list[0][1] + 1
print(_2000)##用于词长词频排序用
fp = open('sort.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

##按词长词频排序输出
counter_list = sorted(freqD2.items(), key=lambda x: len(x[0])*_2000+x[1], reverse=True) 
fp = open('sortlen.txt',"w+",encoding='utf-8')
for d in counter_list:
  fp.write(d[0]+':'+str(d[1]))
  fp.write('\n')
fp.close()

排序代码很有益于,也值得借鉴,Python是个好东西,壮大,易重用。

 


相关文章

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图