need a help with this web Crawler

0 lester Li · June 15, 2015
I want to get the main text from these pages in Chinese,
like these:
/images/forum/upload/2015-06-15/8f65f560103927979f8e21de56965a48.png

 
and I wrote this, 




import urllib2
from bs4 import BeautifulSoup


def get_single_item_data(item_url):
source_code = urllib2.urlopen(item_url)
soup = BeautifulSoup(source_code)
for item in soup.find_all("td",{"class":"t_f"},limit = 1):
print item.encode('gbk')


url = "http://bbs.wjdaily.com/bbs/thread-466430-1-1.html"

get_single_item_data(url)


but we I run this ,I get something I don't want. 

/images/forum/upload/2015-06-15/63d0e1e01c7f881019d8ee0a56bdf8a3.png

anyone can help me with this problem, thank you very much!

Post a Reply

Replies

Oldest  Newest  Rating
0 Halcyon Abraham Ramirez · June 17, 2015
Alright. High five u used selenium?
0 lester Li · June 17, 2015
thanks, Ramirez,  this also solved my problem!
0 Halcyon Abraham Ramirez · June 16, 2015
lol it can't output chinese characters fml :/
0 Halcyon Abraham Ramirez · June 16, 2015
from selenium import webdriver

class Test:
    def __init__(self):
        self.driver = webdriver.Chrome()
        self.driver.get("http://bbs.wjdaily.com/bbs/thread-466430-1-1.html")
        self.message()

    def message(self):
        text = self.driver.find_element_by_xpath("//td[@class='t_f']").text
        print(text)

test = Test()

output:

????????    ???????    ?????    ??????????????    ?????????????   ??????????   ?   ?????



I don't know much about beautiful soup but selenium seems pretty easier to me
  • 1

Python

106,959 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator