Python libraries that can tokenize wikipedia pages

Mark L Source

I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.

For example, here are three data sets that I'd be interested in:

The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.

Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.



answered 6 years ago Burhan Khalid #1

Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.

Once you have the raw data, then you can run it through mwlib to parse it.

answered 6 years ago jhonkola #2

This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.

Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.

comments powered by Disqus