Web crawler trouble

0 Shlesh Tiwari · November 9, 2015
I'm trying to use the web crawler program with this very set of web pages (forums). The url is 
https://www.thenewboston.com/forum/home.php?page=1

I just want to get the titles. 



import requests
from bs4 import BeautifulSoup

def forum_spider(max_page):
page = 1
while page <= max_page:
url = 'https://www.thenewboston.com/forum/home.php?page='+str(page)
source_code = requests.get(url)
plain_text = source_code.text
#To do something with this text, we need to call an object from BeautifulSoup. Otherwise it is just a bunch of text that we
#cannot work on
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a',{'class':'post-title'}):
title = link.string
print (title)
page += 1

forum_spider(2)

The output is:

C:\Python27\python.exe C:/Users/LENOVO/Documents/6.00.1xFile/webcrawler.py
Traceback (most recent call last):
  File "C:/Users/LENOVO/Documents/6.00.1xFile/webcrawler.py", line 18, in <module>
    forum_spider(2)
  File "C:/Users/LENOVO/Documents/6.00.1xFile/webcrawler.py", line 8, in forum_spider
    source_code = requests.get(url)
  File "C:\Python27\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 597, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 113, in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

Process finished with exit code 1

Can someone help with this?

I have tried a similar code with just one web page to crawl, and have no trouble with that. The url used for that was 
https://www.thenewboston.com/

Please help

Thanks!

Post a Reply

Replies

Oldest  Newest  Rating
+2 sfolje 0 · November 9, 2015
type
requests.exceptions.TooManyRedirects: Exceeded 30 redirects

into google.
lets give it a try.
+1 Paul D · November 10, 2015
source_code = requests.get(url, allow_redirects=False)
+1 sfolje 0 · November 10, 2015
url = 'https://www.thenewboston.com/forum/recent_activity.php?page=" +str(page)
and
source_code = requests.get(url, allow_redirects=False)
0 BERNARD NJUGUNA · January 28, 2016
hey


 i am using pycharm 5.0
 and the results run says there is no beautifulsoup  module .. while i ahve areald insatlled it.....

help out pleaase...

 thank you
0 sfolje 0 · January 30, 2016
@Bernard How did you install BeautifulSoup? with pip command or with pycharm? Can you see BeautifulSoup module in Settings? File->Settings->Project->Project Interpreter->3.5.0 here should be listed all your installed modules, recognized by PyCharm. Is there?
0 Shlesh Tiwari · January 6, 2016
Thank you so much people. This helped!
  • 1

Python

107,036 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator