extract text in bs4 element

gert Source

I am trying to extract some data out of the following bs4 element (exemplification bellow), specifically building a loop that would extract all company names out of it (and maybe also the location):

    [<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <div class="field-content"><div class="wrapper hidden">
 <p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
 Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
 Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
 Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisiƫ<br/>blabla useless data<br/><hr/>
 Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
 </div>
 <a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>]

The names are the "Hak Industrial ..." strings.

Output: two lists like

[Hak Industrial Services B.V., Hak Industrial Services B.V., Hak Industrial Services Middle East LLC, Hak Industrial Services SEA Sdn. Bhd., Hak Industrial Services USLLC]

and

[Nederland, Nederland, Verenigde Arabische Emiraten, Maleisiƫ, Verenigde Staten van Amerika]

Would anyone know how to do this in bs4?

thanks in advance,

pythonbeautifulsoup

Answers

answered 3 months ago smbarz #1

I recently had to accomplish a goal similar to this. I build a function to parse the HTML from emails. It goes something like this;

from bs4 import BeautifulSoup as bs

def parser(data):
    # this will parse the data from ticket and create a list.
    html = data
    parsed = bs(html, "lxml")
    data = [line.strip() for line in parsed.stripped_strings]
    print data

passing in the HTML will give you an output like this;

[u'[', u'Nevenvestiging:', u'Hak Industrial Services B.V., Hoogeveen', u'Nederland', u'blabla useless data', u'Hak Industrial Services B.V., Nieuw Heeten', u'Nederland', u'blabla useless data', u'Hak Industrial Services Middle East LLC, Abu Dhabi', u'Verenigde Arabische Emiraten', u'blabla useless data', u'Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor', u'Maleisi\xeb', u'blabla useless data', u'Hak Industrial Services USLLC, Houston', u'Verenigde Staten van Amerika', u'blabla useless data', u'Toon nevenvestigingen', u']']

You could probably refactor this little bit to make it more like what you're looking for, but I hope this points you in the right direction.

answered 3 months ago xralf #2

Which format of the data must hold? I tried to parse it a little.

# coding: utf-8
from __future__ import unicode_literals
from bs4 import BeautifulSoup
from bs4 import NavigableString, Tag

html = """<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <
 <p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
  Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
   Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
    Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisiƫ<br/>blabla useless data<br/><hr/>
     Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
      </div>
       <a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>"""

if __name__ == "__main__":
    soup = BeautifulSoup(html, "lxml")
    companies = []
    for child in soup.find("div", class_ = "wrapper hidden").contents:
        siblings = []
        if isinstance(child, Tag):
            if child.name == "hr":
                previous = child.previous_sibling
                if previous:
                    siblings.append(previous)
                while previous:
                     if isinstance(previous, Tag) and previous.name != "hr" or isinstance(previous, NavigableString):
                         siblings.append(previous)
                         previous = previous.previous_sibling
                     else:
                         previous = False


                print siblings[::-1]

comments powered by Disqus