Extracting tables from Wikipedia XML dump

SKandeel Source

I want to parse the XML Wikipedia dump and extract all different kind of tables from it (not just infoboxes)

I am using wikixmlj to parse the dump, but the problem is parsing the different types of tables in the Wikipedia dump (split-cells tables, merged-cells tables, tables with color-codes).

I was able to parse the XML articles till I found the items marked as tables, but I have no standard to follow when parsing the tables into objects, and it appears that there's many types of tables with many arranges.

is there's some documented standard about table types to follow so that I can cover that in the runtime objects I am going to create or is there's any way to get around that?

NOTE:

these are some examples to help you know what I mean:

http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States See Andrew Jackson row (some rows are merged and split)

http://en.wikipedia.org/wiki/List_of_pharaohs

http://en.wikipedia.org/wiki/Open_Handset_Alliance

http://en.wikipedia.org/wiki/Comparison_of_web_server_software sometime the header is on both top and bottom

javaxml-parsingextractwikipedialarge-data

Answers

answered 6 years ago SKandeel #1

okay, if you're interested in the tables themselves only, you need to do the following

1-download the wikipedia dump (all of the dump)

2-Extract the tables from the dump into a separate file or set of files: use the regex \{\|[\s|\S]+?\n\|-?\}

3-Use the library gwtwiki to build a model for the dump and then convert the tables file only to html:

-add this class and this class to the project

-add the necessary libraries of the gwtwiki and others


You now have html files that holds the tables that appeared in the entire wikipedia dump, and the tables are in html format so it's easy to manipulate (note that if you want to manipulate any file through code to write it in a unicode file, because of the encoding of some of the characters of the tables)

comments powered by Disqus