pandas reading from html from wikipedia as dataframe second table has Nan

Samuel M. Source

I'm attempting to read from a wikipedia page as a dataframe

pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)')

It works and I've listed through the different dataframes to find the one that shows the table data of GDP, the first table looks okay, all the data is in place

enter image description here

but the second table has Nan's on all the GDP column

enter image description here

this is unexpected but not something I couldn't find a way around with using another tool or getting it manually but still there might be a way of tweaking pandas to fix this or push future versions to fix this so I decided to post the question.

pythonpandas

Answers

answered 5 days ago tobsecret #1

I am using python 3.6 and pandas 0.23.0 and got it to work by using flavor='bs4' for pd.read_html. This results in it using html5lib instead of the default which is lxml, so you would have to install hmtl5lib (I have version 1.0.1).

Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.


In [1]: import pandas as pd

In [2]: df = pd.read_html('http://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)',
                          flavor='bs4')

In [3]: df[2]
Out[3]: 
        0                               1           2
0    Rank                         Country  GDP(US$MM)
1     NaN                       World[19]    79865481
2       1                   United States    19390600
3       —         European Union[n 1][19]    17308862
4       2                      China[n 2]    12014610
5       3                           Japan     4872135
6       4                         Germany     3684816
7       5                  United Kingdom     2624529
8       6                           India     2611012
9       7                          France     2583560
10      8                          Brazil     2054969
11      9                           Italy     1937894
12     10                          Canada     1652412
13     11                     South Korea     1538030
14     12                     Russia[n 3]     1527469
15     13                       Australia     1379548
16     14                           Spain     1313951
17     15                          Mexico     1149236
18     16                       Indonesia     1015411
19     17                          Turkey      849480
20     18                     Netherlands      825745
21     19                    Saudi Arabia      683827
22     20                     Switzerland      678575
23     21                       Argentina      637717
24     22                          Taiwan      579302
25     23                          Sweden      538575
26     24                          Poland      524886
27     25                         Belgium      494733
28     26                        Thailand      455378
29     27                            Iran      431920
..    ...                             ...         ...
165   162                         Lesotho        2721
166   163                     Timor-Leste        2716
167   164                          Bhutan        2321
168   165                         Liberia        2140
169   166                        Djibouti        2082
170   167        Central African Republic        1992
171   168                          Belize        1819
172   169                      Cape Verde        1728
173   170                       St. Lucia        1717
174   171                      San Marino        1592
175   172             Antigua and Barbuda        1535
176   173                      Seychelles        1479
177   174                   Guinea-Bissau        1295
178   175                 Solomon Islands        1273
179   176                         Grenada        1111
180   177                      The Gambia        1038
181   178             St. Kitts and Nevis         939
182   179                           Samoa         844
183   180                         Vanuatu         837
184   181  St. Vincent and the Grenadines         815
185   182                         Comoros         659
186   183                        Dominica         608
187   184                           Tonga         437
188   185           São Tomé and Príncipe         372
189   186  Federated States of Micronesia         329
190   187                           Palau         321
191   188                Marshall Islands         199
192   189                        Kiribati         186
193   190                          Tuvalu          40
194   191                    Vatican City           2

[195 rows x 3 columns]

In [4]: df[3]
Out[4]: 
        0                                 1           2
0    Rank                           Country  GDP(US$MM)
1     NaN                             World    80683787
2       1                     United States    19390604
3       —           European Union[n 1][23]    17277698
4       2                        China[n 5]    12237700
5       3                             Japan     4872137
6       4                           Germany     3677439
7       5                    United Kingdom     2622434
8       6                             India     2597491
9       7                            France     2582501
10      8                            Brazil     2055506
11      9                             Italy     1934798
12     10                            Canada     1653043
13     11                       Russia[n 3]     1577524
14     12                       South Korea     1530751
15     13                         Australia     1323421
16     14                             Spain     1311320
17     15                            Mexico     1149919
18     16                         Indonesia     1015539
19     17                            Turkey      851102
20     18                       Netherlands      826200
21     19                      Saudi Arabia      683827
22     20                       Switzerland      678887
23     21                         Argentina      637590
24     22                            Sweden      538040
25     23                            Poland      524510
26     24                           Belgium      492681
27     25                          Thailand      455221
28     26                              Iran      439514
29     27                           Austria      416596
..    ...                               ...         ...
161   159                       Timor-Leste        2955
162   160                           Lesotho        2639
163   161                            Bhutan        2512
164   162                           Liberia        2158
165   163          Central African Republic        1949
166   164                          Djibouti        1845
167   165                            Belize        1838
168   166                        Cabo Verde        1754
169   167                       Saint Lucia        1712
170   168                        San Marino        1659
171   169               Antigua and Barbuda        1532
172   170                        Seychelles        1486
173   171                     Guinea-Bissau        1347
174   172                   Solomon Islands        1303
175   173                           Grenada        1119
176   174                        The Gambia        1015
177   175             Saint Kitts and Nevis         946
178   176                           Vanuatu         863
179   177                             Samoa         857
180   178  Saint Vincent and the Grenadines         790
181   179                           Comoros         649
182   180                          Dominica         563
183   181                             Tonga         426
184   182             Sao Tome and Principe         391
185   183    Federated States of Micronesia         336
186   184                             Palau         292
187   185                  Marshall Islands         199
188   186                          Kiribati         196
189   187                             Nauru         114
190   188                            Tuvalu          40

[191 rows x 3 columns]

Relevant versions:

html5lib                  1.0.1            py36h2f9c1c0_0  
pandas                    0.23.0           py36h637b7d7_0  
python                    3.6.6                hc3d631a_0  

Full environment:

# packages in environment at /home/user/miniconda3/envs/so_question:
#
# Name                    Version                   Build  Channel
backcall                  0.1.0                    py36_0  
beautifulsoup4            4.6.0            py36h49b8c8c_1    anaconda
blas                      1.0                         mkl  
ca-certificates           2018.03.07                    0  
certifi                   2018.4.16                py36_0  
decorator                 4.3.0                    py36_0  
html5lib                  1.0.1            py36h2f9c1c0_0  
intel-openmp              2018.0.3                      0  
ipython                   6.4.0                    py36_0  
ipython_genutils          0.2.0            py36hb52b0d5_0  
jedi                      0.12.0                   py36_1  
libedit                   3.1.20170329         h6b74fdf_2  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 7.2.0                hdf63c60_3  
libgfortran-ng            7.2.0                hdf63c60_3  
libstdcxx-ng              7.2.0                hdf63c60_3  
mkl                       2018.0.3                      1  
mkl_fft                   1.0.1            py36h3010b51_0  
mkl_random                1.0.1            py36h629b387_0  
ncurses                   6.1                  hf484d3e_0  
numpy                     1.14.5           py36hcd700cb_3  
numpy-base                1.14.5           py36hdbf6ddf_3  
openssl                   1.0.2o               h20670df_0  
pandas                    0.23.0           py36h637b7d7_0  
parso                     0.2.1                    py36_0  
pexpect                   4.6.0                    py36_0  
pickleshare               0.7.4            py36h63277f8_0  
pip                       10.0.1                   py36_0  
prompt_toolkit            1.0.15           py36h17d85b1_0  
ptyprocess                0.6.0                    py36_0  
pygments                  2.2.0            py36h0d3125c_0  
python                    3.6.6                hc3d631a_0  
python-dateutil           2.7.3                    py36_0  
pytz                      2018.5                   py36_0  
readline                  7.0                  ha6073c6_4  
setuptools                39.2.0                   py36_0  
simplegeneric             0.8.1                    py36_2  
six                       1.11.0           py36h372c433_1  
sqlite                    3.24.0               h84994c4_0  
tk                        8.6.7                hc745277_3  
traitlets                 4.3.2            py36h674d592_0  
wcwidth                   0.1.7            py36hdf4376a_0  
webencodings              0.5.1            py36h800622e_1  
wheel                     0.31.1                   py36_0  
xz                        5.2.4                h14c3975_4  
zlib                      1.2.11               ha838bed_2  

comments powered by Disqus