"using mediawiki to pull text from a wikia page but it comes back in a big mess is there a better way i could do this to pull text from each section?" Code Answer

3

the easiest way, if you don't want to parse the wiki markup yourself, is to retrieve the parsed html version of the page and then process it using an html parser (like jsoup, as recommended by hasham).

besides just scraping the normal wiki user interface (which will give you the page html wrapped in the navigation skin), there are two ways of getting the html text of a mediawiki page:

  1. use the api with action=parse, which will return the page html wrapped in a mediawiki api xml (or json / yaml / etc.) response, like this:

    • http://scottlandminecraft.wikia.com/api.php?format=xml&action=parse&page=zackscott

  2. or use the main index.php script with action=render, which will return just the page html:

    • http://scottlandminecraft.wikia.com/index.php?action=render&title=zackscott

ps. since you mention sections in your question, let me note that the action=parse api module can return information about the sections on the page using prop=sections (or even prop=sections|text). for an example, see this api query:

  • http://scottlandminecraft.wikia.com/api.php?format=xml&action=parse&page=zackscott&prop=sections
By Hattan Shobokshi on August 28 2022

Answers related to “using mediawiki to pull text from a wikia page but it comes back in a big mess is there a better way i could do this to pull text from each section?”

Only authorized users can answer the Search term. Please sign in first, or register a free account.