Beautiful Soup Wikipidia nested tables

fdrigo Source

I am new to Beautiful Soup and nested table and therefore I try to get some experience scraping a wikipedia table.

I have searched for any good example on the web but unfortunately I have not found anything.

My goal is to parse via pandas the table "States of the United States of America" on this web page. As you can see from my code below I have the following issues:

1) I can not extract all the columns. Apparently my code does not allow to import all the columns properly in a pandas DataFrame and writes the entries of the third column of the html table below the first column.

2) I do not know how to deal with colspan="2" which appears in some lines of the table. In my pandas DataFrame I would like to have the same entry when capital and largest city are the same.

Here is my code. Note that I got stuck trying to overcome my first issue.

Code:

from urllib.request import urlopen
import pandas as pd

wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page)

right_table=soup.find_all('table')[0] # First table

rows = right_table.find_all('tr')[2:]

A=[]
B=[]
C=[]
D=[]
F=[]

for row in rows:
    cells = row.findAll('td')
#   print(len(cells))
    if len(cells)>=11: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        D.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))

df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D   
df['Population']=F
df
print(df)

Do you have any suggestings? Any help to understand better BeautifulSoup would be appreciated. Thanks in advance.

pythonweb-scrapingbeautifulsouphtml-parsing

Answers

answered 11 months ago Bill Bell #1

Here's the strategy I would use.

I notice that each line in the table is complete but, as you say, some lines have two cities in the 'Cities' column and some have only one. This means that we can use the numbers of items in a line to determine whether we need to 'double' the city name mentioned in that line or not.

I begin the way you did.

>>> import requests
>>> import bs4
>>> page = requests.get('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> right_table=soup.find_all('table')[0]

Then I find all of the rows in the table and verify that it's at least approximately correct.

>>> trs = right_table('tr')
>>> len(trs)
52

I poke around until I find the lines for Alabama and Wyoming, the first and last rows, and display their texts. They're example of the two types of rows!

>>> trs[2].text
'\n\xa0Alabama\nAL\nMontgomery\nBirmingham\n\nDec 14, 1819\n\n\n4,863,300\n\n52,420\n135,767\n50,645\n131,171\n1,775\n4,597\n\n7\n\n'
>>> trs[51].text
'\n\xa0Wyoming\nWY\nCheyenne\n\nJul 10, 1890\n\n\n585,501\n\n97,813\n253,335\n97,093\n251,470\n720\n1,864\n\n1\n\n'

I notice that I can split these strings on \n and \xa0. This can be done with a regex.

>>> [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
['Wyoming', 'WY', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']
>>> [_ for _ in re.split(r'[\n\xa0]', trs[2].text) if _]
['Alabama', 'AL', 'Montgomery', 'Birmingham', 'Dec 14, 1819', '4,863,300', '52,420', '135,767', '50,645', '131,171', '1,775', '4,597', '7']

The if _ conditional in these list comprehensions is to discard empty strings.

The Wyoming string has a length of 12, Alabama's is 13. I would leave Alabama's string as it is for pandas. I would extend Wyoming's (and all the others of length 12) using:

>>> row = [_ for _ in re.split(r'[\n\xa0]', trs[51].text) if _]
>>> row[:3]+row[2:]
['Wyoming', 'WY', 'Cheyenne', 'Cheyenne', 'Jul 10, 1890', '585,501', '97,813', '253,335', '97,093', '251,470', '720', '1,864', '1']

answered 11 months ago A Magoon #2

The solution below should fix both issues you have mentioned.

from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup

wiki='https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States?action=render'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
right_table=soup.find_all('table')[0] # First table
rows = right_table.find_all('tr')[2:]

A=[]
B=[]
C=[]
D=[]
F=[]

for row in rows:
    cells = row.findAll('td')
    combine_cells = cells[1].get('colspan')  # Tells us whether columns for Capital and Established are the same
    cells = [cell.text.strip() for cell in cells]  # Extracts text and removes whitespace for each cell
    index = 0  # allows us to modify columns below

    A.append(cells[index])  # State Code
    B.append(cells[index + 1])  # Capital
    if combine_cells:  # Shift columns over by one if columns 2 and 3 are combined
        index -= 1

    C.append(cells[index + 2])  # Largest
    D.append(cells[index + 3])  # Established
    F.append(cells[index + 4])  # Population

df=pd.DataFrame(A,columns=['State'])
df['Capital']=B
df['Largest']=C
df['Statehood']=D
df['Population']=F
df
print(df)

Edit: Here's a cleaner version of the above code

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

wiki = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')
table_rows = soup.find('table')('tr')[2:]  # Get all table rows
cells = [row('td') for row in table_rows]  # Get all cells from rows


def get(cell):  # Get stripped string from tag
    return cell.text.strip()


def is_span(cell):  # Check if cell has the 'colspan' attribute <td colspan="2"></td>
    return cell.get('colspan')


df = pd.DataFrame()
df['State'] = [get(cell[0]) for cell in cells]
df['Capital'] = [get(cell[1]) for cell in cells]
df['Largest'] = [get(cell[2]) if not is_span(cell[1]) else get(cell[1]) for cell in cells]
df['Statehood'] = [get(cell[3]) if not is_span(cell[1]) else get(cell[2]) for cell in cells]
df['Population'] = [get(cell[4]) if not is_span(cell[1]) else get(cell[3]) for cell in cells]
print(df)

comments powered by Disqus