Python - BS4 - extracting a subtable from a wikipedia table using only table header + save as dictionary

yozhix Source

I am trying to define a function which extracts all rows of the 'Basisdaten' table on the website https://de.wikipedia.org/wiki/Stuttgart and return a dictionary whose keys and values correspond to the first and second cells in each row of the table.

The 'Basisdaten' table is part of a much larger table, as shown through the result of the following code:

from bs4 import BeautifulSoup 
import requests
r=requests.get("https://de.wikipedia.org/wiki/Stuttgart")
soup=BeautifulSoup(r.text,"html.parser")
soup.find('th', text=re.compile('Basisdaten')).find_parent('table')

Unfortunately, there is no unique ID which I can use to only select those rows making up the 'Basisdaten' table. These are the rows which I hope to extract in HTML format:

<tr>
<th colspan="2">Basisdaten
</th></tr>
<tr class="hintergrundfarbe2">
<td><a href="/wiki/Land_(Deutschland)" title="Land (Deutschland)">Bundesland</a>:</td>
<td><a href="/wiki/Baden-W%C3%BCrttemberg" title="Baden-Württemberg">Baden-Württemberg</a>
</td></tr>
<tr class="hintergrundfarbe2">
<td><a href="/wiki/Regierungsbezirk" title="Regierungsbezirk">Regierungsbezirk</a>:
</td>
<td><a href="/wiki/Regierungsbezirk_Stuttgart" title="Regierungsbezirk Stuttgart">Stuttgart</a>
</td></tr>
<tr class="hintergrundfarbe2">
<td><a href="/wiki/H%C3%B6he_%C3%BCber_dem_Meeresspiegel" title="Höhe über dem Meeresspiegel">Höhe</a>:
</td>
<td>247 m ü. <a href="/wiki/Normalh%C3%B6hennull" title="Normalhöhennull">NHN</a>
</td></tr>
<tr class="hintergrundfarbe2">
<td><a href="/wiki/Katasterfl%C3%A4che" title="Katasterfläche">Fläche</a>:
</td>
<td>207,35 km<sup>2</sup>
</td></tr>
<tr class="hintergrundfarbe2">
<td>Einwohner:
</td>
<td style="line-height: 1.2em;">628.032 <small><i>(31. Dez. 2016)</i></small><sup class="reference" id="cite_ref-Metadaten_Einwohnerzahl_DE-BW_1-0"><a href="#cite_note-Metadaten_Einwohnerzahl_DE-BW-1">[1]</a></sup>
</td></tr>
<tr class="hintergrundfarbe2">
<td><a href="/wiki/Bev%C3%B6lkerungsdichte" title="Bevölkerungsdichte">Bevölkerungsdichte</a>:
</td>
<td>3029 Einwohner je km<sup>2</sup>
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;"><a href="/wiki/Postleitzahl_(Deutschland)" title="Postleitzahl (Deutschland)">Postleitzahlen</a>:
</td>
<td>70173–70619
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;"><a href="/wiki/Telefonvorwahl_(Deutschland)" title="Telefonvorwahl (Deutschland)">Vorwahl</a>:
</td>
<td>0711
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;"><a href="/wiki/Kfz-Kennzeichen_(Deutschland)" title="Kfz-Kennzeichen (Deutschland)">Kfz-Kennzeichen</a>:
</td>
<td>S
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;"><a href="/wiki/Amtlicher_Gemeindeschl%C3%BCssel" title="Amtlicher Gemeindeschlüssel">Gemeindeschlüssel</a>:
</td>
<td>08 1 11 000
</td></tr>
<tr class="hintergrundfarbe2 metadata">
<td><a href="/wiki/UN/LOCODE" title="UN/LOCODE">LOCODE</a>:
</td>
<td>DE STR
</td></tr>
<tr class="hintergrundfarbe2 metadata">
<td><a href="/wiki/NUTS" title="NUTS">NUTS</a>:
</td>
<td>DE111
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Stadtgliederung:
</td>
<td>23 <a href="/wiki/Liste_der_Stadtbezirke_und_Stadtteile_von_Stuttgart" title="Liste der Stadtbezirke und Stadtteile von Stuttgart">Stadtbezirke</a><br/>mit 152 Stadtteilen
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;">Adresse der<br/>Stadtverwaltung:
</td>
<td>Marktplatz 1<br/>70173 Stuttgart
</td></tr>
<tr class="hintergrundfarbe2" style="vertical-align: top;">
<td>Webpräsenz:
</td>
<td style="max-width: 10em; overflow: hidden; word-wrap: break-word;"><a class="external text" href="//www.stuttgart.de/" rel="nofollow">www.stuttgart.de</a>
</td></tr>
<tr class="hintergrundfarbe2">
<td style="vertical-align: top;"><a href="/wiki/Oberb%C3%BCrgermeister" title="Oberbürgermeister">Oberbürgermeister</a>:
</td>
<td><a href="/wiki/Fritz_Kuhn" title="Fritz Kuhn">Fritz Kuhn</a> (<a href="/wiki/B%C3%BCndnis_90/Die_Gr%C3%BCnen" title="Bündnis 90/Die Grünen">Bündnis 90/Die Grünen</a>)
</td></tr>

I have succeeded in writing this code which gives me the desired result in dictionary form:

data = []
def extractDict(y):
    results = y.find("th", {"colspan" : "2"}).find_parent('table').select('td')[3:35]
    for row in results:
        data.append(row.text.strip().replace('\xa0', '').replace(':', '').replace('[1]', ''))
    return dict(zip(data[::2], data[1::2]))
basisdaten=extractDict(soup)
basisdaten

Result:

{'Adresse derStadtverwaltung': 'Marktplatz 170173 Stuttgart',
 'Bevölkerungsdichte': '3029Einwohner je km2',
 'Bundesland': 'Baden-Württemberg',
 'Einwohner': '628.032 (31.Dez.2016)',
 'Fläche': '207,35km2',
 'Gemeindeschlüssel': '08111000',
 'Höhe': '247m ü.NHN',
 'Kfz-Kennzeichen': 'S',
 'LOCODE': 'DE STR',
 'NUTS': 'DE111',
 'Oberbürgermeister': 'Fritz Kuhn (Bündnis 90/Die Grünen)',
 'Postleitzahlen': '70173–70619',
 'Regierungsbezirk': 'Stuttgart',
 'Stadtgliederung': '23 Stadtbezirkemit 152 Stadtteilen',
 'Vorwahl': '0711',
 'Webpräsenz': 'www.stuttgart.de'}

However I am looking for a better solution which does not involve simply picking the 4th to 35th row from the parent table. I subsequently intend to use this code on other similar wikipedia urls and the 'Basisdaten' tables may vary across websites in terms of number of rows.

The similarity amongst all 'Basisdaten' tables is that they are all embedded within the first table and that they all have two columns, hence all start with 'th colspan="2"'. The parent table contains other subtables, for example in this case the subtable 'Lage der Stadt Stuttgart in Baden-Württemberg' comes after 'Basisdaten'.

Is it possible to write a loop which searches for the 'Basisdaten' subtable header and takes all rows thereafter, but stops when it reaches the next subtable header ('th colspan="2"')?

I have only gotten as far as to find the row which contains the start of the Basisdaten table:

soup.find('th', text=re.compile('Basisdaten'))

Hope that made sense! I am very new to Beautifulsoup and Python and this is a very challenging problem for me.

pythonbeautifulsoupwikipedia

Answers

answered 3 weeks ago bobrobbob #1

this should do

from bs4 import BeautifulSoup
import requests

data = requests.get("https://de.wikipedia.org/wiki/Stuttgart").text
soup = BeautifulSoup(data, "lxml")
trs = soup.select('table[id*="Infobox"] tr')
is_in_basisdaten = False
data = {}
clean_data = lambda x: x.get_text().strip().replace('\xa0', '').replace(':', '')
for tr in trs:
    if tr.th:
        if "Basisdaten" in tr.th.string:
                is_in_basisdaten = True
        if is_in_basisdaten and "Basisdaten" not in tr.th.string:
            break
    elif is_in_basisdaten:
        key, val = tr.select('td')
        data[clean_data(key)] = clean_data(val)

print(data)

comments powered by Disqus