Problem with web crawling output

+1 Hanna C. · November 17, 2014
Hello! I am new to Python and this forum. I've been watching the Python tutorials on YouTube, which have been great! I've been trying to scrape a Russian news site for practice, but have been having problems with my output: nothing prints! My console ends with "Process finished with exit code 0" but doesn't give me any output. Here is my code: 

I appreciate the help!/images/forum/upload/2014-11-17/9e1c4d0690b8a8ce942d818dab1f44a6.png
.

Post a Reply

Replies

Oldest  Newest  Rating
0 Hanna C. · November 18, 2014
Thank you so much! It is working perfectly now. Your changes to the code make absolute sense. I appreciate the feedback. 
0 Vaggelis Theodoridis · November 17, 2014
import requests
from bs4 import BeautifulSoup

def PC_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://www.oprf.ru/984/?offset=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)

        ul_tag = soup.find('ul', {'class': 'list allList'})
        p_tags = ul_tag.findAll('p', {'class': 'name'})
        a_tags = [p.a for p in p_tags]
        #Relative paths
        hrefs = [a['href'] for a in a_tags]
        #Complete url's
        links = ['https://www.oprf.ru' + href for href in hrefs]

        titles = [a.text for a in a_tags]

        for title, link in zip(titles, links):
            print(title, link)

        page += 1


PC_spider(1)

My first post looks ugly and looses indents so i repost it.
Btw hello everybody!

Hanna C.  look carefully in the site because theres no 'a' tag with 'class=name' BUT theres p tags with 'class=name'.
I tryed to simplify the code so u can understand step by step.
Also in your loop u request every link to get it's title? If so, theres no point in doing that cause you already have it in 'p' tags, again look at site's source code ;)
0 Vaggelis Theodoridis · November 17, 2014
import requests
from bs4 import BeautifulSoup

def PC_spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.oprf.ru/984/?offset=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

ul_tag = soup.find('ul', {'class': 'list allList'})
p_tags = ul_tag.findAll('p', {'class': 'name'})
a_tags = [p.a for p in p_tags]
#Relative paths
hrefs = [a['href'] for a in a_tags]
#Complete url's
links = ['https://www.oprf.ru' + href for href in hrefs]

titles = [a.text for a in a_tags]

for title, link in zip(titles, links):
print(title, link)

page += 1


PC_spider(1)
  • 1

Python

107,020 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator