Basic web scraping using beautiful soup: scrape a table

gnicholas Source

I'm attempting to learn some basic web scraping. I initially setup scrapy and noticed it was a bit daunting so I decided to first use beautifulsoup for some single page scraping practice before I move onto crawling. My project idea was to scrape the following table and output the information to an excel file.

The table is located at this page on wikipedia: http://en.wikipedia.org/wiki/List_of_largest_corporate_profits_and_losses

The output I got was quite successful! However, I am not sure my code is very "pythonic". I kind of brute forced my way to grab the data using some regular expressions and I feel there is definitely an easier and faster way to grab the table data and remove some of the pesky u'Name' formatting and image links that are placed throughout the table. In the future, I would like to know what is the standard way of scraping a table and removing formatting besides my hacky way.

Specifically, in column 3 of the table we see that there is an image of the flag of the country as well as the information I care about (the country name).Because of this, I could not just do cells[3].find(text=True). I got around this by grabbing all the a tags in only cell 3 and then using regular expressions to grab only the country name contained within the title:

for j,cell in enumerate(cells):
            if j%3 == 0:
                text = (cell.findAll('a'))

Thanks and sorry for the long post!

from bs4 import BeautifulSoup
import urllib2
import re

wiki = "http://en.wikipedia.org/wiki/List_of_largest_corporate_profits_and_losses"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

table = soup.find("table", { "class" : "wikitable sortable" })

f = open('output.csv', 'w')

num = []; company = []; industry = []; country = []; year = []; reportdate = [];          
earnings = []; usdinflation = []; usdrealearnings = []; cunts = [];

for i,row in enumerate(table.findAll("tr")):
    cells = row.findAll("td")
    if len(cells) == 9:
        num.append(cells[0].find(text=True))
        company.append(cells[1].findAll(text=True))
        industry.append(cells[2].find(text=True))
        country.append(cells[3].find(text=True))
        year.append(cells[4].find(text=True))
        reportdate.append(cells[5].find(text=True))
        earnings.append(cells[6].find(text=True))
        usdinflation.append(cells[7].find(text=True))
        usdrealearnings.append(cells[8].find(text=True))
    for j,cell in enumerate(cells):
        if j%3 == 0:
            text = (cell.findAll('a'))
            newstring = re.search(r'(title="\w+\s\w+")|(title="\w+")',str(text))
            if not(newstring is None):
                newstring2 = re.search(r'("\w+")|("\w+\s\w+")',newstring.group())
                cunts.append(newstring2.group())


for i in range(len(num)):
    s = str(company[i])
    newstring = re.search(r'\w+\s|\w+\w+', s).group(); 
    write_to_file = str(num[i])+ "," + newstring + "," + str(industry[i]) + "," +      cunts[i].encode('utf-8') + ","+ str(year[i]) + ","+ str(reportdate[i])+ ","+     earnings[i].encode('utf-8') + "," + str(usdinflation[i]) + "," + str(usdrealearnings[i]) +     "\n";
    f.write(write_to_file)

f.close()
pythonweb-scrapingbeautifulsoupwikipedia

Answers

answered 4 years ago Amazingred #1

How's this:

from bs4 import BeautifulSoup
import urllib2
import re

wiki = "http://en.wikipedia.org/wiki/List_of_largest_corporate_profits_and_losses"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "wikitable sortable" })
f = open('output.csv', 'w')
for row in table.findAll('tr'):
    f.write(','.join(''.join([str(i).replace(',','') for i in row.findAll('td',text=True) if i[0]!='&']).split('\n')[1:-1])+'\n')

f.close()

outputs to file:

#,Company,Industry,Country,Year,Report Date,Earnings (Billion),USD Inflation to December 2012[1],USD "Real" Earnings (Billion)
1,ExxonMobil,Oil and gas,United States,2008,31 December 2008,$45.22[2],9.40%,$49.50
2,ExxonMobil,Oil and gas,United States,2006,31 December 2006,$39.5[2],13.95%,$45.01
3,ExxonMobil,Oil and gas,United States,2007,31 December 2007,$40.61[2],9.50%,$44.47
4,ExxonMobil,Oil and gas,United States,2005,31 December 2005,$36.13[3],16.85%,$42.22
5,ExxonMobil,Oil and gas,United States,2011,31 December 2011,$41.06[4],1.90%,$41.84
6,Apple,Consumer electronics,United States,2012,29 September 2012,$41.73 [5],-0.63%,$41.47
-,Industrial & Commercial Bank of China,Banking,China,2012,31 December 2012,RMB 238.7[6],-,$38.07
7,Nestlé,Food processing,Switzerland,2010,31 December 2010,$37.88[7],4.92%,$39.74
.....and so on

Explain ~

Couple of things to remember here about python.

  • making u'foo' str(u'foo') removes the unicode literal notation
  • Don't underestimate the value of using an if/else statement or comparison(!=) inside of a list comprehension. Its a killer way to filter out garbage without having to code another section.

Okay after using prettify() on the table you notice that the format is done fairly well and every bit of data that you want on each line of your csv is separated into <tr> tags.

Using row.findAll(tag='tr', text=True) what i've done is split up all of the data (not filtered yet) into a list of lines. soup.findAll will make a list of every instance of the specified tag. In this case every <tr\> contained in the table.

we only want the table text not any of the extra garbage that comes with the formatting so text=True only gets the text that shows in the cells of the table.

i've nested this inside a list comprehension that converts whatever data returned by the search into a string (removing the u'foo') and separating each element in the line by a ',' to jive with the required csv format and added a few if requirements to filter out any remaining garbage like brackets .

comments powered by Disqus