Unable to retrieve Javascript generated data with python

Rain Source

I have been attempting to scrape the data from this URL: http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa&txt_statelist=&txt_state=&ERR_LS_20161018_041816_21233=txt_statelist%7CLocation%7C20%7C0%7C%7C0 for most of the day - and know I have been incredibly inefficient with time. I have just recently learned to scrape ordinary html websites, and seem to be getting the hang of it. The javascript driven ones have been proving to be painful.

The scraper that I have worked on so far - after many angles of approaching the problem has yielded the same result. Below is the code I am using:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait

PHANTOMJS_PATH = './phantomjs.exe'

#Using PhantomJS to navigate the url
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa&txt_statelist=&txt_state=&ERR_LS_20161018_041816_21233=txt_statelist%7CLocation%7C20%7C0%7C%7C0')

wait = WebDriverWait(browser, 15)
# let's parse our html
soup = BeautifulSoup(browser.page_source, "html5lib")

# get all the games
test = soup.find_all('tr')

print test

My biggest problem is that I can't get the detail I'm looking for. In the below picture: Highlighted field

I am unable to get the URL related to that particular name. After getting the URL, I'd like to further navigate into the user to get additional detail.

So my questions are the following:

  1. Are there more effective ways of returning the data you are looking for programmatically (given limited time).
  2. Are there better ways to see how you are navigating through a javascript generated site when scraping?
  3. Please let me know if I need to give more clarity.

Thanks!


Part 2:

I have taken another approach, and am running into another issue.

I've tried getting the tags above using the following:

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Chrome()  
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
comments = soup.findAll('a')  
print comments

In the list of 'comments' I am printing, the particular element I'm looking for does not appear. i.e.

<a href="/members/?id=35097829" id="MiniProfileLink_35097829" onmouseover="MiniProfileLink_OnMouseOver(35097829);" onmouseout="HideMiniProfile();" target="_top">Namir Abraham</a>

I then went an attempted to use the selenium functionality:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

browser = webdriver.Chrome('C:/Users/rschilder/Desktop/Finance24 Scrape/Accountant_scraper/chromedriver.exe')
browser.get('http://www.thesait.org.za/search/newsearch.asp?bst=&cdlGroupID=&txt_country=South+Africa')  
browser.implicitly_wait(30)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#browser.quit()

print html

The challenges I have on this are:

  1. I'm not too sure how to search and get specific element using the selenium get functionality (it's not as intuitive as beautiful soup)
  2. Even with the selenium navigation - the element I am looking for (mentioned above) still doesn't appear in the output?
javascriptpythonweb-scraping

Answers

comments powered by Disqus