[source code] Python Programming Tutorial - 25 - How to Make a Web Crawler

+39 Bucky Roberts · September 3, 2014

import requests
from bs4 import BeautifulSoup


def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://buckysroom.org/trade/search.php?page=" + str(page)
        source_code = requests.get(url)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('a', {'class': 'item-name'}):
            href = "https://buckysroom.org" + link.get('href')
            title = link.string  # just the text, not the HTML
            print(href)
            print(title)
            # get_single_item_data(href)
        page += 1


def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    # if you want to gather information from that page
    for item_name in soup.findAll('div', {'class': 'i-name'}):
        print(item_name.string)
    # if you want to gather links for a web crawler
    for link in soup.findAll('a'):
        href = "https://buckysroom.org" + link.get('href')
        print(href)


trade_spider(1)

Post a Reply

Replies

- page 5
Oldest  Newest  Rating
0 Poz Pozeidon · April 23, 2017
I want to add a functionality to this web crawler, actually when crawling a website like nytimes.com, I want to obtain all the text under the website and put them together with their urls. What do you recomend
0 Mark Drew · April 23, 2017
Hello all

I have tried to amend Bucky's code to work with a different website given the one in the tutorial is no longer available.  Given I am actually trying to follow the principal rather than just typing the code verbatim, it is probably a good thing.

Anyway, here is the code I have come up with to search a local buy and sell site.


import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
   page = 1
   while page <= max_pages:
       url = 'https://www.gumtree.com.au/s-appliances/page-' + str(page) + '/c20088'
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text)
       for link in soup.findAll('a', {'class': 'ad-listing__title-link'}):
           href = 'https://www.gumtree.com.au' + link.get('href')
           title = link.string
           print(href)
           print(title)
           page += 1

trade_spider(1)




I know it is because I am not understanding the html code properly but I am not seeing what I expect to see.


  • Rather than seeing all the ads, I only recognise the last 8 (out of 30) hrefs.

  • And for the title, I get 'none' for each item.



I was hoping someone could have a look and help me understand why it is behaving like this.

Thanks
+1 Mark Drew · April 23, 2017
Hi All

Earlier today I posted a question (hasn't yet been approved by moderators).  The question I posed was why the results were not the same as the url.  The answer is that the website I used was changing too frequently.  I used a more static list and got the result I was after.  
I also wondered why my code was providing None for all the titles, I have since found that was because the a tag had no title attribute.  I have been able to achieve the same result as Bucky with the following script:


import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
   page = 1
   while page <= max_pages:
       url = 'https://www.gumtree.com.au/s-property-for-rent/hobart-cbd-hobart/page-' + str(page) + '/c18364l3000302'
       source_code = requests.get(url)
       plain_text = source_code.text
       soup = BeautifulSoup(plain_text)
       for link in soup.findAll('span', {'itemprop': 'name'}):
           title = link.string
           for each in soup.findAll('a', {'class': 'ad-listing__title-link'}):
               href = 'https://www.gumtree.com.au' + each.get('href')
           print(title)
           print(href)

       page += 1

trade_spider(2)

Python

131,171 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator