I got most of the web crawler working, but some problems need help.

+1 Sicong Ye · January 30, 2016
So I got most part of it working, but when I try to pull all the links more "dynamic" from the web site, errors happen, please tell me how to convert a "NoneType" object to str implicitly

Source code and error message here.
import requests
from bs4 import BeautifulSoup

def photo_spider(max_pages):
page = 100
while page <= max_pages:
url = r'https://bloomington.craigslist.org/search/apa?s=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': 'hdrlnk'}):
href = "https://bloomington.craigslist.org" + link.get('href')
title = link.string
# print(href)
# print(title)
get_single_item_data(href)
page += 100

def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for item_price in soup.findAll('span', {'class': "price"}):
print(item_price.string)
for item_address in soup.findAll('div', {'class': "mapaddress"}):
print(item_address.string)

for link in soup.findAll('a'):
href = "https://bloomington.craigslist.org" + link.get('href')
print(href)

photo_spider(1000)


Message here.

$1250
E. 10th St. at IN State Rd. 45/46
https://bloomington.craigslist.org/
https://bloomington.craigslist.orghttps://bloomington.craigslist.org
https://bloomington.craigslist.org/
https://bloomington.craigslist.org/hhh
https://bloomington.craigslist.org/apa
https://bloomington.craigslist.orghttps://post.craigslist.org/c/bmg
https://bloomington.craigslist.orghttps://post.craigslist.org/c/bmg
https://bloomington.craigslist.orghttps://accounts.craigslist.org/login/home
https://bloomington.craigslist.orghttps://accounts.craigslist.org/login/home
https://bloomington.craigslist.org#
https://bloomington.craigslist.org/reply/bmg/apa/5397587859
https://bloomington.craigslist.orghttps://post.craigslist.org/flag?flagCode=28&postingID=5397587859&subareaid=0&areaid=229&cat=apa&area=bmg
https://bloomington.craigslist.orghttp://www.craigslist.org/about/prohibited
Traceback (most recent call last):
  File "C:/Users/Sicong/PycharmProjects/webcrawler/main.py", line 32, in <module>
    photo_spider(1000)
  File "C:/Users/Sicong/PycharmProjects/webcrawler/main.py", line 16, in photo_spider
    get_single_item_data(href)
  File "C:/Users/Sicong/PycharmProjects/webcrawler/main.py", line 29, in get_single_item_data
    href = "https://bloomington.craigslist.org" + link.get('href')
TypeError: Can't convert 'NoneType' object to str implicitly

Process finished with exit code 1


Also, one more minor thing is the first page of the html page is like 

https://bloomington.craigslist.org/search/apa

It does not have a number specify it, however, the second one starts 

https://bloomington.craigslist.org/search/apa?s=100

So basically I start to get everything from page 2, this is how I do it. I want to start from page 1.
website is craigslist, I don't think anyone will bother me one that legal thing.
Helllllllllllp.:P

Post a Reply

Replies

Oldest  Newest  Rating
+3 Nikita Volobuev · January 30, 2016
Your program gets everything correctly. But when it loops through web page it finds something like that:


<a class="prev"> prev </a>


And when you try to get "href" attribute value you recieve None (because it doesn't exist). To fix the error you just need to make additional check:


for link in soup.findAll('a'):
if link.has_attr('href'): #checks if the link has href attribute
href = "https://bloomington.craigslist.org" + link.get('href')
+2 Nikita Volobuev · January 31, 2016
https://bloomington.craigslist.org/search/apa?s=0
This URL is your first page. (0 is the first page, 100 is the second, etc - because there are 100 items per page). So just change your code to:



def photo_spider(max_pages):
page = 0 #start from the item with number 0 (first page)
while page <= max_pages:
+2 sfolje 0 · January 31, 2016
https://bloomington.craigslist.org/search/apa?s=0 shows same website as https://bloomington.craigslist.org/search/apa, so just start with

page=0

instead of
page=100
+1 Sicong Ye · January 31, 2016
That helps but how do you solve the last question.
Also, one more minor thing is the first page of the html page is like


https://bloomington.craigslist.org/search/apa


It does not have a number specify it, however, the second one starts


https://bloomington.craigslist.org/search/apa?s=100


So basically I start to get everything from page 2, this is how I do it. I want to start from page 1.
0 Sicong Ye · February 1, 2016
Thanks a lot for both of u!!!!!!!!!!!!!
0 sfolje 0 · February 1, 2016
thanks for points bra
  • 1

Python

107,097 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator