Webscrape JS rendered Website

brawlins4 Source

I am trying to figure out how to website this website https://cnx.org/search?q=subject:%22Arts%22 that is rendered via JavaScript. When I view the page source, there is very little code. I know that BeautifulSoup can't do this. I have tried Selenium but I am new to it. Any suggestions on how scraping this site could be accomplished?



answered 4 weeks ago user9973168 #1

Try Google's official headless browser wrapper around Chrome, puppeteer.


npm i puppeteer


const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({path: 'example.png'});

  await browser.close();

It's easy to use and have a good documentation.

answered 4 weeks ago Rick Sanchez #2

You can use selenium to do this. You won't look at HTML source code though. Press F12 on chrome (or install firebug on firefox) to get into the developer tools. Once there, you can select elements (pointer icon on top left of dev tools window). Once you click what you want, you can right click the highlighted portion in the "Elements" column and copy -> Xpath. Be careful to use proper quotes in your code because the xpaths usually use double quotes, which is also common when using the find_element_by_expath method.

Essentially you instantiate your browser, go to the page, find the element by xpath (an XML language to just go to a specific spot on a page that uses javascript). It's roughly like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Chrome()

# Load page

# Find your element via its xpath (see above to get)

# The "Madlavning" entry on the page would be: 
element = driver.find_element_by_xpath('//*[@id="results"]/div/table/tbody/tr[1]/td[2]/h4/a')

#Pull the text:

#ensure you dont get zombie/defunct chrome/firefox instances that suck up resources

selenium can be used for plenty of scraping, you just need to know what you want to do once you find the info.

answered 4 weeks ago Dan-Dev #3

You can use the API that the web-page gets it's data from (using JavaScript) directly. https://archive.cnx.org/search?q=subject:%22Arts%22 It returns JSON so you just need to parse the JSON.

import requests
import json
url = "https://archive.cnx.org/search?q=subject:%22Arts%22"
r = requests.get(url)
j = r.json()
# Print the json object
print (json.dumps(j, indent=4, sort_keys=True))
# Or print specific values
for i in j['results']['items']:
    print (i['title'])

comments powered by Disqus