[source code] Python Programming Tutorial - 35, 36, 37 - Word Frequency Counter

+1 Bucky Roberts · September 28, 2014
import requests
from bs4 import BeautifulSoup
import operator
 
 
# Create a list of words
def start(url):
    word_list = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code)
    # loop through each post and get text
    for post_text in soup.findAll('a', {'class': 'index_singleListingTitles'}):
        content = post_text.string
        # break each post up into a list of words
        words = content.lower().split()
        for each_word in words:
            word_list.append(each_word)
    clean_up_list(word_list)
 
 
# Lowercase and remove odd symbols
def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "!@#$%^&*()_+{}|:<>?,./;'[]\=-\""
        for i in range(0, len(symbols)):
            word = word.replace(symbols, "")
        if len(word) > 0:
            clean_word_list.append(word)
    create_dictionary(clean_word_list)
 
 
# Create dictionary with word counts
def create_dictionary(clean_word_list):
    word_count = {}
    for word in clean_word_list:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    # sort this dictionary by (0 for key, 1 for values)
    for key, value in sorted(word_count.items(), key=operator.itemgetter(1)):
        print(key, value)
 
 
start('https://buckysroom.org/tops.php?type=text&period=this-month')

Post a Reply

Replies

Oldest  Newest  Rating
0 qin zhongke · October 5, 2014
Hello Bucky:
  I have tested the code in python 2.7, and found something wrong.
  In the second loop of clean_up_list function, symbol variable have to be with , as follow:
  word = word.replace(symbols, "")
0 Kevin Devey · October 12, 2014
Some guy has put up a post in the location being crawled with a URL in the title and the lower().split() function really doesn't like it! coughs out an AttributeError on it....exceptions time!
0 bai nie · September 21, 2015
Hi Bucky,
I tried to crawl the source code from this web page as you suggested at the end of the tutorial video, but not succeeded. With my code, I can find each thread titled '[source code]'. The problem is I don't know how to extract the code out? Can you please give some hint? Thanks.
0 Thanos Ktistakis · October 14, 2014
Yeah I was the one that asked on youtube and I still can't figure out what's wrong :/ Bucky save us! Thanks for the videos by the way...
0 bai nie · September 23, 2015
Hi Bucky, please ignore my previous post. I figured out how to do that.
0 Jakob Jensen · September 30, 2014
Hi Bucky

Is this the last video in the series or is more videos coming that i should watch before i go to the python gui series? :)

- Jakob 
0 Bucky Roberts · September 30, 2014
I am going to be making a lot more for both series very soon. 
0 Wayne [Im That Damn Good] Leyden · October 1, 2014
Could you use .lstrip(chars) instead of the fro loop ?
0 Simon Ward · September 12, 2015
So, I coded this the exact same way as you did, but there seems to be a problem with the encoding. As soon as the character " appears, there are problems and it stops. What encoding do you use? because the &amp, &lt, &gt are interpreted wrong by mine too. it takes each of the individual symbols and replaces them with nothing.
This is the error message:
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-2: character maps to <undefined>
  • 1

Python

107,326 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator