Web crawler help

0 Tony Stark · June 26, 2015
/images/forum/upload/2015-06-26/c850eb33b1446603e9b4c5d57e62487e.png
I want to build a web crawler to extract the googledrive link in the above. However If I do soup.findAll using 'a' there is no special class name that I can use to get the links. However if I use soup.findAll('li', {'class' : 'gdf-docitem'}), I am able to get the the text of the link using .string, I don' t know how to get the actual link. .get('href') does not work

I'll post a picture of my code. The results is the text, however I want the googledrive links. The website I'm trying to crawl is http://www.physicsandmathstutor.com/a-level-maths-papers/c1-aqa/
/images/forum/upload/2015-06-26/c2faf9030197e2a5921ffe2211c8856b.png

Post a Reply

Replies

Oldest  Newest  Rating
+1 Halcyon Abraham Ramirez · June 27, 2015

import requests
from bs4 import BeautifulSoup

class Link:
    def __init__(self):
        url = "http://www.physicsandmathstutor.com/a-level-maths-papers/c1-aqa/"
        source = requests.get(url).text
        self.soup = BeautifulSoup(source)

    def Links(self):
        
        
        linkz = [i["href"] for i in self.soup.find_all("a") if "googledrive" in i["href"]]

        for i in linkz:
            print(i)

a = Link()
a.Links()

I don't know how you go down and extract the children using beautiful soup. but it's very easy to do with selenium. anyway

we can extract all the <a> tags and since you only want the ones from google drive. we can add a bit of logic to it to elminate all the other <a> tags not containing the word. "googledrive" 

ouput is:



https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/AQA%20Maths%20A-level%20Grade%20Boundaries.docx
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/Combined%20MS%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/Combined%20QP%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202006%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202006%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202007%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202007%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202008%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202008%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202009%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202009%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202010%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202010%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202011%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202011%20QP%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202012%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202012%20QP%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202013%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/January%202013%20QP%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202006%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202006%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202007%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202007%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202008%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202008%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202009%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202010%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202010%20QP%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202011%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202011%20QP%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202012%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202012%20QP%20-%20C1%20AQA.pdf
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202013%20MS%20-%20C1%20AQA.PDF
https://googledrive.com/host/0B1ZiqBksUHNYdGRhbnBwZWlobFE/June%202013%20QP%20-%20C1%20AQA.pdf
0 Tony Stark · June 27, 2015
Thanks! but can you explain this line please, I don't understand it very well

linkz = [i["href"] for i in self.soup.find_all("a") if "googledrive" in i["href"]]

for i in linkz:
print(i)
+1 Halcyon Abraham Ramirez · June 27, 2015
thats a list comprehension

it's basically 

linkz = []
for i in self.soup.find_all("a"):
        if "googledrive" in i["href"]:
        linkz.appened(i["href"])


but writing it in a list comprehension looks much cleaner

0 Tony Stark · June 28, 2015
Oh thank you! That is more clear now, but yeah I agree list comprehension is clean (y)
  • 1

Python

107,177 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator