Web crawler assignment to get the source code

+4 Shlesh Tiwari · January 6, 2016
Hi all, 

Bucky asked us to try something fun by trying to crawl his page for the source code, instead of copying it. So I tried doing that. I'm just getting partial results. Kindly help me out here. Here is the code.

import requests
from bs4 import BeautifulSoup

def get_single_link_data(item_url):
source_code = requests.get(item_url, allow_redirects=False)
plain_text = source_code.text
soup = BeautifulSoup(plain_text,"html.parser")
for sources in soup.findAll('span',{'class':'pl-c1'},{'class':'pl-k'}):
print (sources.string)


def web_crawler(url):
source_code = requests.get(url, allow_redirects=False)
plain_text = source_code.text
soup = BeautifulSoup(plain_text,"html.parser")
for link in soup.findAll('a',{'class':'js-directory-link js-navigation-open'}):
href = 'https://github.com' + link.get('href')
print href
get_single_link_data(href)

web_crawler("https://github.com/buckyroberts/Source-Code-from-Tutorials/tree/master/Python")


It gives me only either the numbers, or the operators/characters. I'm unable to extract the full text. What am I doing wrong here? 

Any help would be greatly appreciated. 

Thanks

Post a Reply

Replies

Oldest  Newest  Rating
+1 Sjoerd van den Belt · January 8, 2016
The problem with the code on the github page was that some of the code on the github pages was outside of a span element. To grab the complete text from the code block you need to take the table-data and grab all of the text from it as such:




import requests
from bs4 import BeautifulSoup

def printable_text(text):
    return ''.join(|i if ord(i) < 128 else ' ' for i in text|)

def get_single_link_data(item_url):
    source_code = requests.get(item_url, allow_redirects=False)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,"html.parser")
    for sources in soup.findAll('td',{'class':'blob-code'},{'class':'blob-code-inner'}):
        print(printable_text(sources.text))



def web_crawler(url):
    source_code = requests.get(url, allow_redirects=False)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,"html.parser")
    for link in soup.findAll('a',{'class':'js-directory-link js-navigation-open'}):
        href = 'https://github.com' + link.get('href')
        print(href)
        get_single_link_data(href)

web_crawler("https://github.com/buckyroberts/Source-Code-from-Tutorials/tree/master/Python")



Besides this I encountered an error of a character that could not be printed and caused the program to fail, so I added the above function "printable_text(text)" to make sure it replaces the unprintable character with a space.

For some magical buggy reason the code inside the printable text function doesnt show propperly when square brackets "[ ]" are used, please note that the characters "| |" should be replaced with "[ ]" in order for the function to run.




0 Sjoerd van den Belt · January 10, 2016
Did it work for you?
0 Shlesh Tiwari · January 15, 2016
Sorry for the late reply. Yeah I tried what you said and it worked. Thank you so much! I am still confused about what the printable_text is. I did not understand the syntax and what exactly it is supposed to do. Can you elaborate on this? 

Also, why do we have to use "html.parser"? I know I used it too, but that was because the error message prompted me to do so. 

Thanks in advance. 
  • 1

Python

107,161 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator