How to access string inside tags having &n b s p  or tag inside it?

0 baqir khan · July 2, 2015

import requests
from bs4 import BeautifulSoup


def clean(words):
cleanlist=[]
symbol="~`!@#$%^&*()_+=-/?.>,< ;:'\"[{}]|\}]"
for word in words:
for j in range(0, len(symbol)):
word=word.replace(symbol[j],"")
if(len(word)>0):
print(word)
cleanlist.append(word)


def start(url):
words=[]
code=requests.get(url).text
soup= BeautifulSoup(code)
for link in soup.findAll('a', {'data-analytics-class' : 'nextclicks'}):
sent=link.string
#print(sent)
#print(url)
#print(link.get('href'))
single=sent.lower().split()
for i in single:
words.append(i)
clean(words)


start('http://www.theverge.com/2014/11/17/7234729/steve-jobs-daughter-lisa-is-the-heroine-of-aaron-sorkins-film')

This is my code, and its output is :


C:\Python34\python.exe C:/Users/baqir/PycharmProjects/untitled/day3.py
michigan
why
problem
reboot
expensive
services
friends

Process finished with exit code 0

The link I'm on is :

http://www.theverge.com/2014/11/17/7234729/steve-jobs-daughter-lisa-is-the-heroine-of-aaron-sorkins-film

Now I do not understand how do I get all the text from the section below named "More from verge", I am only getting the last word of each of the link which in HTML code is followed by   !
This also happens when there is a <br> tag, I did that on amazon website in the comments sections !

Help me!

Post a Reply

Replies

Oldest  Newest  Rating
+1 Halcyon Abraham Ramirez · July 2, 2015
its as simple as:


import requests
from bs4 import BeautifulSoup

url = "http://www.theverge.com/2014/11/17/7234729/steve-jobs-daughter-lisa-is-the-heroine-of-aaron-sorkins-film"
contents = requests.get(url).content
soup = BeautifulSoup(contents)
links = [i.text for i in soup.find_all("a",{"data-analytics-class":"nextclicks"})]

for i in links:
    print(i)


output is:

         Stephen Colbert just interviewed Eminem on a public access cable show in Monroe, Michigan
        

          Rihanna debuts NSFW video as she's crowned the most successful singles artist in US history    
        

          Sprint CEO calls T-Mobile's Uncarrier movement 'bullshit'
        

          We’re so close to seeing Pluto it hurts
        

          Facebook is talking with music labels, but why?
        

          Apple Music has an iCloud problem
        

          Terminator Genisys review: How far can nostalgia carry an ill-advised reboot?
0 baqir khan · July 2, 2015
Thanks for answering :)
But what is wrong with my code?
0 Halcyon Abraham Ramirez · July 2, 2015
could you edit your code with proper indentation?

so i'll be easier to break down
+1 Halcyon Abraham Ramirez · July 3, 2015
The things I see wrong in your code are:


single=sent.lower().split() 




def clean(words):
cleanlist=[]
symbol="~`!@#$%^&*()_+=-/?.>,< ;:'\"[{}]|\}]"
for word in words:
for j in range(0, len(symbol)):
word=word.replace(symbol[j],"")
if(len(word)>0):
print(word)
cleanlist.append(word)


and basically the entire clean function here. it's doesn't really do anything  because if you wanted to to extract just the text.you would've just had one function. and it's the your start  function.




your "sent" variable already contains the complete text of the data-analytics class

you split it thereby chunking them into individual words. which would defeat the purpose of your wanting to extract all the text 

you then passed each chunk to your clean words function and try to eliminate any character in the symbols variable if it appears on word but it's kinda pointless because all the words you chunked does not contain any of those symbols

as for why you only get the last word in each group of words

I encountered that before and it usually has something to do with nested for loops
and I don't know why
  • 1

Python

107,215 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator