I am trying to scrape data using python (Requests and BeautifulSoup4 libraries along with Selenium)
When I tried to get some data out of website where the data loads after some delay, it returns an empty value. I understand that for this task i have to use WebDriverWait.
import requests from bs4 import BeautifulSoup # selenium imports from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException # Initialize a Chrome webdriver driver = webdriver.Chrome() # Grab the web page driver.get("http://") # use selenium.webdriver.support.ui.Select # that we imported above to grab the Select element called # lmStatType, then select the first value # We will use .find_element_by_name here because we know the name dropdown = Select(driver.find_element_by_name("lmStatType")) dropdown.select_by_value("1") # select the year 2560 dropdown = Select(driver.find_element_by_name("lmYear")) dropdown.select_by_value("60") # Now we can grab the search button and click it search_button = driver.find_elements_by_xpath("//*[contains(text(), 'ตกลง')]" search_button.click() # we just look at .page_source of the driver driver.page_source # We can feed that into Beautiful Soup doc = BeautifulSoup(driver.page_source, "html.parser") # It's a tricky table, also tried with class names rows = doc.find('table', id='datatable') print(rows) # returns empty
In the above example i haven't used the tried options with selenium webdriver wait & timeout related statements for understanding it step-by-step, even though i have tried several workarounds.
Also, just tried grabbing the district level data separately like (but can't figure out the exact class/id)
url = 'http://' res = requests.get(url) soup = BeautifulSoup(res.text,"lxml") for tr in soup.find(class_="display").find_all("tr"): data = [item.get_text(strip=True) for item in tr.find_all(["th","td"])] print(data)
Any help is appreciated. Thanks in advance. My apology, if this is a duplicate question.pythonseleniumweb-scrapingbeautifulsouppython-requests
As I stated in a comment the html actually gives you the endpoint where it gets that data. From the on it's actually quite easy getting the data using requests.
In your html as reads: "sAjaxSource": "../datasource/showStatProvince.php?statType=1&year=60". This is the endpoint the site uses. So you just need to go one level back in the sites url-structure and use "/datasource/...." instead
Here I'm printing the results, but say you wanted to follow the links and grab that data, you could store the results in a list of dicts and iterate over this subsequently or do it inside the for loop.