How to grab a link from multiple html tags?

+1 jhonathan macy · November 20, 2014
I am creating a web crawler and i come across an issue i can't figure out. If let's say i am trying to grab a link but the link is like this

<tag1><tag2><tag3><a href=""></tag3></tag2></tag1>

how can i pull that link from within all those tags?

Post a Reply


Oldest  Newest  Rating
+1 Vaggelis Theodoridis · November 20, 2014
in case you use the BeautifulSoup module you can do:

soup = BeautifulSoup(html_document)

link = soup.find("tag3").a['href']
+1 Vaggelis Theodoridis · November 20, 2014
I wasnt clear in my previous post so here's some explanation.
With my previous code it will return you the FIRST tag3.a['href'] that it will find..

So in order to get what you want you have to specify something more.Doesnt the tag3 or tag2 or tag1 have a class or id or something?(anything?)
<tag3 something = "value">        </tag3>

If so, you can search like this:  soup.find('tag3', {'something':'value'})

Tags have attributes, one of them is the name. You can search by parent's name for example:

soup.find('a') == 'tag3'
0 jhonathan macy · November 21, 2014
Vaggelis Theodoridis
Thank you for your reply i got it working to grab the links in the tags. here is my code:

import requests
from bs4 import BeautifulSoup

string = ""

url = ""
scode = requests.get(url)
text = scode.text
soup = BeautifulSoup(text)
for link in soup.find("tag2").find("tag3").a['href']:
print string

but there are sometimes where a website is so basic the html may look like this:

<a href="">
This is just a test link
<a href="">
this is another test link

this is just an example file, but you can see there are two links in the same type of tags, but let's say i only want the second link, how would i do this?
0 Vaggelis Theodoridis · November 21, 2014
I see. Well one solution that comes to my mind is using a method from standard library: string.startswith()
For example:
for link in soup.find("tag2").find("tag3").a['href']:
if link.startswith("") #Or if you want to exclude links you can pass to those that start with...

Or you can "play" with: string.endswith()
if "youtube" in link:
  • 1



This section is all about snakes! Just kidding.

Bucky Roberts Administrator