How to grab a link from multiple html tags?

+1 jhonathan macy · November 20, 2014
I am creating a web crawler and i come across an issue i can't figure out. If let's say i am trying to grab a link but the link is like this


<tag1><tag2><tag3><a href="example.com"></tag3></tag2></tag1>


how can i pull that link from within all those tags?

Post a Reply

Replies

Oldest  Newest  Rating
+1 Vaggelis Theodoridis · November 20, 2014
Hi,
in case you use the BeautifulSoup module you can do:

soup = BeautifulSoup(html_document)

link = soup.find("tag3").a['href']
+1 Vaggelis Theodoridis · November 20, 2014
I wasnt clear in my previous post so here's some explanation.
With my previous code it will return you the FIRST tag3.a['href'] that it will find..

So in order to get what you want you have to specify something more.Doesnt the tag3 or tag2 or tag1 have a class or id or something?(anything?)
<tag3 something = "value">        </tag3>

If so, you can search like this:  soup.find('tag3', {'something':'value'})

Tags have attributes, one of them is the name. You can search by parent's name for example:

soup.find('a').parent.name == 'tag3'
0 jhonathan macy · November 21, 2014
Vaggelis Theodoridis
Thank you for your reply i got it working to grab the links in the tags. here is my code:



import requests
from bs4 import BeautifulSoup

string = ""

url = "http://www.example.com"
scode = requests.get(url)
text = scode.text
soup = BeautifulSoup(text)
for link in soup.find("tag2").find("tag3").a['href']:
    string+=str(link)
print string


but there are sometimes where a website is so basic the html may look like this:


<html>
<body>
<center>
<p>
<a href="https://www.youtube.com">
This is just a test link
</a>
</p>
<p>
<a href="https://www.google.com">
this is another test link
</a>
</p>
<center>
<body>
</html>

this is just an example file, but you can see there are two links in the same type of tags, but let's say i only want the second link, how would i do this?
0 Vaggelis Theodoridis · November 21, 2014
I see. Well one solution that comes to my mind is using a method from standard library: string.startswith()
For example:
for link in soup.find("tag2").find("tag3").a['href']:
if link.startswith("https://www.youtube.com") #Or if you want to exclude links you can pass to those that start with...
string+=str(link)
else:
pass

Or you can "play" with: string.endswith()
Or:
if "youtube" in link:
  • 1

Python

107,200 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator