How to extract info in the tag by looking for a tag inside of that tag?

winedragon Source

Say I want to extract 24 min per episode info or the N13 information under Rating. Now this is just part of the code, and some of the span tags hold not dark_text class but something else. But when I look for tags that hold say Rating, when I find it I can't extract what Rating it is, because N13 now is under div tag, not span, but since I'm looking for 'Rating' or 'Duration' I have to look for 'span' tag. And Beautiful Soup doesn't allow you to do findAll('div').findAll('span', {'class':'...'}),so I can't get back to the div tag if it finds the span tag I'm looking for.

When I do a for loop it prints out all these additional Nones, among other stuff. Anyone has any tips on how to parse this well?

The question is really just how to look for something in <span> tag that is under div tag, but once located then extract the entire div tag, or maybe preferably even what is only in the div tag but not in the span tag? This has turned out to be more complicated than I anticipated.

from bs4 import BeautifulSoup
x= '''<div>
<a href="javascript:void(0);" onclick="$('#score143583').toggle()">Overall Rating</a>:
    2
  </div>
  <div class="spaceit">
  <span class="dark_text">Duration:</span>
    24 min. per ep.
    </div>
  <div>
  <span class="dark_text">Rating:</span>
    N13
    </div>'''


bs = BeautifulSoup(x, 'html.parser')
pythonhtmlparsingbeautifulsouptags

Answers

answered 3 months ago Keyur Potdar #1

You can use the next_sibling method to get the text that is located immediately after the span tag. To get the span tag you can use find('span', class_='dark_text', text='Duration:').

Creating a simple function, you can use this:

def get_next_text(soup, text):
    return soup.find('span', class_='dark_text', text=text).next_sibling

soup = BeautifulSoup(html, 'lxml')
duration = get_next_text(soup, 'Duration:')
print('Duration:', duration.strip())
rating = get_next_text(soup, 'Rating:')
print('Rating:', rating.strip())

Output:

Duration: 24 min. per ep.
Rating: N13

If you want to get the whole div tag that contains the text you want, you can use .parent.

def get_parent(soup, text):
    return soup.find('span', class_='dark_text', text=text).parent

soup = BeautifulSoup(html, 'lxml')
duration = get_parent(soup, 'Duration:')
print(duration)
rating = get_parent(soup, 'Rating:')
print(rating)

Output:

<div class="spaceit">
<span class="dark_text">Duration:</span>
    24 min. per ep.
</div>
<div>
<span class="dark_text">Rating:</span>
    N13
</div>

comments powered by Disqus