[source code] Python Programming Tutorial - 25 - How to Make a Web Crawler

+38 Bucky Roberts · September 3, 2014

import requests
from bs4 import BeautifulSoup


def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://buckysroom.org/trade/search.php?page=" + str(page)
        source_code = requests.get(url)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('a', {'class': 'item-name'}):
            href = "https://buckysroom.org" + link.get('href')
            title = link.string  # just the text, not the HTML
            print(href)
            print(title)
            # get_single_item_data(href)
        page += 1


def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    # if you want to gather information from that page
    for item_name in soup.findAll('div', {'class': 'i-name'}):
        print(item_name.string)
    # if you want to gather links for a web crawler
    for link in soup.findAll('a'):
        href = "https://buckysroom.org" + link.get('href')
        print(href)


trade_spider(1)

Post a Reply

Replies

- page 4
Oldest  Newest  Rating
0 Tanner Hoke · January 29, 2015
import requests


from bs4 import BeautifulSoup

source_code = requests.get('https://www.thenewboston.com/forum/topic.php?id=1610')
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print(str(soup.find('code')).replace('<code>', '').replace('<br>', '\n').replace('?', '\t').replace('</code>', '').replace('</br>', ''))
0 Ola Berglund · January 4, 2015
Yeah I had totally forgotten about this, thanks for reminding me that they are in his videos, great help!!
+1 Chris Nelson · January 4, 2015
<br> is a line break in old HTML.
I believe HTML 5 replaces that with.. <br />

I could be wrong, it has been so long since writing any HTML.

I haven't gone into the webcrawler tutorial.. But if those are referenced in Bucky's python videos, I believe that because you can use certain HTML code in python.

But please someone with more experience, feel free to correct me!
0 Ola Berglund · January 4, 2015
I tried to solve the problem Bucky gave us, but i couldn't figured it out so I went to watch these spoilers. Now I'm completely lost. What is this "('<br>',)"? I really need to understand since a webcrawler is basic stuff!
0 Arthur lee · December 11, 2014
hi, bro ,i have a question, how to craw the  pictures down ? which function shall i use?
+1 jadava Umesh · December 10, 2014
I am sorry to say Bucky, I am getting error like this, please explain me what to do.

Traceback (most recent call last):
  File "try.py", line 36, in <module>
    trade_spider(1)
  File "try.py", line 9, in trade_spider
    source_code = requests.get(url)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 578, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 178, in resolve_redirects
    allow_redirects=False,
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 385, in send
    raise SSLError(e)
requests.exceptions.SSLError: hostname 'www.thenewboston.com' doesn't match either of 'www.buckysroom.org', 'buckysroom.org'
+2 DG Wright · December 8, 2014

I'm getting an error when I try to import requests. Am I missing a package?



Apparently so; from your Project Interpreter in Settings, add the package 'requests' (it's described as "Python HTTP for Humans"). Then the code should work 'as advertised' ;)
0 Alexander Mentyu · November 4, 2014
import requests
from bs4 import BeautifulSoup

def code_crawl(id):
    url = 'https://buckysroom.org/forum/topic.php?id=' + str(id)
    raw_html = requests.get(url)
    plain_text = raw_html.text
    soup = BeautifulSoup(plain_text)

    for code_line in soup.find('code'):
        result_line = str(code_line).replace('<br>', '\n').replace('</br>', '').replace('\ufffd', ' ')
        print(result_line)

code_crawl(1610)
+3 Russell Allen · October 11, 2014
I'm getting an error when I try to import requests.  Am I missing a package?  
+2 Nate Penner · October 4, 2014
Bucky! For some reason I cannot get anything from the trades page! Every page in the trade sections says 'No data available' so I have to skip the Web Crawler tutorial :(

Python

128,252 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator