web crawler + word freq. question! HELP!

+1 Sicong Ye · February 2, 2016
I am doing a web crawler thing + word count, but I don't know how to combine them together,  I feel stupid for that. Please teach me on how to do this, I use same website to do both things and here is the source code for each.

Web Crawler,

import requests
from bs4 import BeautifulSoup

def photo_spider(max_pages):
page = 0
while page <= max_pages:
url = r'https://bloomington.craigslist.org/search/apa?s=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': 'hdrlnk'}):
href = "https://bloomington.craigslist.org" + link.get('href')
title = link.string
# print(href)
# print(title)
get_single_item_data(href)
page += 100

def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for item_price in soup.findAll('span', {'class': "price"}):
print(item_price.string)
for item_address in soup.findAll('div', {'class': "mapaddress"}):
print(item_address.string)

#for link in soup.findAll('a'):
# href = "https://bloomington.craigslist.org" + link.get('href')
# print(href)

photo_spider(1000)

Word Count

import requests
from bs4 import BeautifulSoup
import operator

# break up into 3 functions
# find all words it use(a list of every single word

def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "html.parser")
for title_text in soup.findAll('a', {'class': 'hdrlnk'}):
content = title_text.string
words = content.lower().split()
for each_word in words:
#print(each_word)
word_list.append(each_word)
clean_up_list(word_list)

def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "!@#$%^&*()_+|{}:<>?-=\[]\";',./'1234567890"
for i in range(0, len(symbols)):
word = word.replace(symbols, "")
if len(word) > 0:
# print(word)
clean_word_list.append(word)
create_dictionary(clean_word_list)

def create_dictionary(clean_word_list):
word_count = {}
for word in clean_word_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
for key, value in sorted(word_count.items(), key=operator.itemgetter(1)):
print(key, value)
start('https://bloomington.craigslist.org/search/apa')

Each of them works well, I just don't know how to combine those.
My ulitimate object is to work on more complex webpages like larger online shopping sites. SO I ENCOUNTER WITH SOME PROBLEMS!
1. if I want to count all the word use on class = "brandName", class="productName", instead of a single category, how should I do it?


2. there is a class under each a href named "productPrice", how could I collect the number instead of a string in order to do statistic collection?


3. the only way I do it right now is to go to an actual webpage to get all the information, for example, I have to click men's, shoes in order to get to the page.
How can I create a function which can loop through all the (for example, all the shoes' category) categories from a main page?

Thanks a lot folks, I am a newbie even I am a incoming info.science grad student! 
PLEASE TEACH ME, I WILL BUY U GUYS COOKIES SERIOUSLY!

Post a Reply

Replies

Oldest  Newest  Rating
0 Sicong Ye · February 3, 2016
Thanks man, the first question does not apply on craigslist, I am trying to do it on Zappos.com.
0 sfolje 0 · February 2, 2016
I didnt understand you much, but tried to answer.

1. I dont understand fully, but if I understand you correctly, you want to go into each category, and count all "productNames" . I am confused, because class "productNames" dont exists on https://bloomington.craigslist.org/search/apa.


2. you can convert string to number. First convert string "$100" -> "100"  then "100"

-> 100 using int("100"). Or if you mean this: you can get data-id which is string "5420428760" in <a href="/apa/5420428760.html" data-id="5420428760" class="hdrlnk">Meadow Lark Senior Apartments</a> , just use link.get("data-id")



3.  It very much depends on the very website. In your case you can notice that categories are connected to 3 letters at the end of https://bloomington.craigslist.org/search/apa
in this case "apa". Your crawler should go first on https://bloomington.craigslist.org and gather info about all posible 3 leters: ggg,jjj,apa,... When you have this list, you just go crawl in loop through all that webpages.
  • 1

Python

106,955 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator