Having problems with the Web Crawlerlesson 25. Please help

+1 Frank Baldassarre · January 25, 2016

import requests
from bs4 import BeautifulSoup


def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.thenewboston.com/search.php?type=0&sort=reputation&page==' + str(page)
source_code = requests.get(url, allow_redirects=False)
plain_text = source_code.text.encode('ascii', 'replace')
# just get the code, no headers or anything
BeautifulSoup = source_code.text
# To do something with this text, we need to call an object from BeautifulSoup.
# Otherwise it is just a bunch of text that we cannot work on

# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup([plain_text],'html.parser')
for link in soup.findAll('a', {'class': 'index_singleListingTitles'}):
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string
# just the text, not the HTML
print(href)
print(title)
get_single_item_data(href)
page += 1


def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
# if you want to gather information from that page
for item_name in soup.findAll('div', {'class': 'i-name'}):
print(item_name.string)
# if you want to gather links for a web crawler
for link in soup.findAll('a'):
href = "https://www.thenewboston.com/" + link.get('href')
print(href)


trade_spider(3)

The above code is slightly different than the original source code post. 1. the url is different. 2 I had an error about  redirects and made a change to line 9. Then I was also getting a suggestion to specify the parser which after research prompted me to change line 17.
Currently I am getting the following Traceback. there is no str variable anywhere.

Traceback (most recent call last):
  File "C:/Python exercises/New Boston/Python_language/web_crawler.py", line 41, in <module>
    trade_spider(3)
  File "C:/Python exercises/New Boston/Python_language/web_crawler.py", line 17, in trade_spider
    soup = BeautifulSoup([plain_text],'html.parser')
TypeError: 'str' object is not callable

Process finished with exit code 1

I would appreciate some help
yhsnk you

Post a Reply

Replies

Oldest  Newest  Rating
0 sfolje 0 · January 27, 2016
Here is code, because it looks nicer than in pm.
What code does:
Goes to 'people' with users with most points, and click on them, and prints url to photos, that they have their accounts. To see photo click on url.

Green: added important code
Red: commented out.
I also make more print lines, so you can see more in detail what is going on.

import requests
from bs4 import BeautifulSoup


def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.thenewboston.com/search.php?type=0&sort=reputation&page==' + str(page)
source_code = requests.get(url, allow_redirects=False)
plain_text = source_code.text.encode('ascii', 'replace')
# just get the code, no headers or anything
#BeautifulSoup = source_code.text #this line caused error!
# To do something with this text, we need to call an object from BeautifulSoup.
# Otherwise it is just a bunch of text that we cannot work on

# BeautifulSoup objects can be sorted through easy
# soup = BeautifulSoup([plain_text],'html.parser') #this line caused error!
soup = BeautifulSoup(plain_text,'html.parser')

# for link in soup.findAll('a', {'class': 'index_singleListingTitles'}):
for link in soup.findAll('a', {'class': 'user-name'}):
print(' <<---BEGINNING OF LINK--->>')
print('link: ',link)
# href = "https://www.thenewboston.com/" + link.get('href') # this string looks like: https://www.thenewboston.com/https://www.thenewboston.com/profile.php?user=2"
href = link.get('href')
title = link.string
# just the text, not the HTML
print('href: ' ,href)
print('title: ',title)
get_single_item_data(href)
print(' <<---END OF LINK--->>')
print('page: ',page)
page += 1




def get_single_item_data(item_url):

print(' <<--- BEGINNING OF get_single_item_data() --->>')

source_code = requests.get(item_url)
plain_text = source_code.text
# soup = BeautifulSoup(plain_text)
soup = BeautifulSoup(plain_text,"lxml") #use this line, to avoid error!!
# if you want to gather information from that page
# for item_name in soup.findAll('div', {'class': 'i-name'}):
for item_name in soup.findAll('img', {'class': 'img-responsive'}): # all images of the user
# for item_name in soup.findAll('img', {'class': 'follow-page-icon'}): # this also works: all images of followed pages by user
print('item_name :',item_name)
photo='https://www.thenewboston.com'+item_name.get('src')
print('Click the link to open the photo: ', photo)

# print(item_name.string) # item_name <img class="img-responsive" src="/photos/users/2/resized/351109af7635e9df16ac399df5228d2a.gif"/> has no strings, so wont print anything
# if you want to gather links for a web crawler
# for link in soup.findAll('a'):
# href = "https://www.thenewboston.com/" + link.get('href')
# print(href)

print(' <<--- END OF get_single_item_data() --->>')


trade_spider(1)

end of code.
same code in better light (with wannabe indentations):



import requests
from bs4 import BeautifulSoup


def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.thenewboston.com/search.php?type=0&sort=reputation&page==' + str(page)
source_code = requests.get(url, allow_redirects=False)
plain_text = source_code.text.encode('ascii', 'replace')
# just get the code, no headers or anything
#BeautifulSoup = source_code.text #this line caused error!
# To do something with this text, we need to call an object from BeautifulSoup.
# Otherwise it is just a bunch of text that we cannot work on

# BeautifulSoup objects can be sorted through easy
# soup = BeautifulSoup([plain_text],'html.parser') #this line caused error!
soup = BeautifulSoup(plain_text,'html.parser') #this line caused error!

# for link in soup.findAll('a', {'class': 'index_singleListingTitles'}):
for link in soup.findAll('a', {'class': 'user-name'}):
print(' <<---BEGINNING OF LINK--->>')
print('link: ',link)
# href = "https://www.thenewboston.com/" + link.get('href') # this string looks like: https://www.thenewboston.com/https://www.thenewboston.com/profile.php?user=2"
href = link.get('href')
title = link.string
# just the text, not the HTML
print('href: ' ,href)
print('title: ',title)
get_single_item_data(href)
print(' <<---END OF LINK--->>')
print('page: ',page)
page += 1




def get_single_item_data(item_url):

print(' <<--- BEGINNING OF get_single_item_data() --->>')

source_code = requests.get(item_url)
plain_text = source_code.text
# soup = BeautifulSoup(plain_text)
soup = BeautifulSoup(plain_text,"lxml") #use this line, to avoid error!!
# if you want to gather information from that page
# for item_name in soup.findAll('div', {'class': 'i-name'}):
for item_name in soup.findAll('img', {'class': 'img-responsive'}): # all images of the user
# for item_name in soup.findAll('img', {'class': 'follow-page-icon'}): # this also works: all images of followed pages by user
print('item_name :',item_name)
photo='https://www.thenewboston.com'+item_name.get('src')
print('Click the link to open the photo: ', photo)
# print(item_name.string) # item_name <img class="img-responsive" src="/photos/users/2/resized/351109af7635e9df16ac399df5228d2a.gif"/> has no strings, so wont print anything
# if you want to gather links for a web crawler
# for link in soup.findAll('a'):
# href = "https://www.thenewboston.com/" + link.get('href')
# print(href)
print(' <<--- END OF get_single_item_data() --->>')


trade_spider(1)
0 Frank Baldassarre · January 27, 2016
Thank you for taking the time to comment on my post.
Unfortunately what you said is not  clear to me. 
Except for the changes I made, the code is the same as what was posted by Bucky.
Please tell me what I need to change to make it work, or point me to a location where the working code is.
Once it works, I will then experiment with it to learn how to make it parse a different website.
Thanks
0 Frank Baldassarre · January 27, 2016
How would you code this to make it work?
0 sfolje 0 · January 25, 2016
sup!
the line:
BeautifulSoup = source_code.text

before:
 # To do something with this text, we need to call an object from BeautifulSoup.
# Otherwise it is just a bunch of text that we cannot work on

# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup([plain_text],'html.parser')
for link in soup.findAll('a', {'class': 'index_singleListingTitles'}):

is useless, ...
,well in fact painful, because it causes to define new string variable called "BeautifulSoup", which is unhealthy, especially when using beautifulsoup library. What happens in your case- on next uncommented line is that it calls the string BeautifulSoup like a function, which is not because it is a string.

#uncomment
  • 1

Python

107,071 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator