Web Scrapping code review

Ezzy Source
from bs4 import BeautifulSoup
import requests
import pandas as pd

records=[]
keep_looking = True
url = 'https://www.tapology.com/fightcenter'
while keep_looking:
    re = requests.get(url)
    soup = BeautifulSoup(re.text,'html.parser')
    data = soup.find_all('section',attrs={'class':'fcListing'})
    for d in data:
        event = d.find('a').text
        date = d.find('span',attrs={'class':'datetime'}).text[1:-4]
        location = d.find('span',attrs={'class':'venue-location'}).text
        mainEvent = first.find('span',attrs={'class':'bout'}).text

    url_tag = soup.find('div',attrs={'class':'fightcenterEvents'})

    if not url_tag:
        keep_looking = False
    else:
        url = "https://www.tapology.com" + url_tag.find('a')['href']

i am wondering if there are any errors in my code? It is running, but it is taking a very long time to finish and i am afraid it might be stuck in an infinity loop. Please any feedback would be helpful. Please do not rewrite all of this and post, as i would like to keep this format, as i am learning and want to improve. thank you.

python-3.xweb-scrapingbeautifulsoup

Answers

answered 3 weeks ago SIM #1

Although this is not the right site to seek help for review related task, I considered giving a solution as it sounds that you may fall in an infinite loop according to your statement above.

Try this to get information from that site. It will run until there is a next page link to traverse. When there is no more new page link to follow, the script will automatically stop.

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

url = 'https://www.tapology.com/fightcenter'

while True:
    re = requests.get(url)
    soup = BeautifulSoup(re.text,'html.parser')
    for data in soup.find_all('section',attrs={'class':'fcListing'}):
        event = data.select_one('.name a').get_text(strip=True)
        date = data.find('span',attrs={'class':'datetime'}).get_text(strip=True)[:-1]
        location = data.find('span',attrs={'class':'venue-location'}).get_text(strip=True)
        try:
            mainEvent = data.find('span',attrs={'class':'bout'}).get_text(strip=True)
        except AttributeError: mainEvent = ""

        print(f'{event} {date} {location} {mainEvent}')

    urltag = soup.select_one('.pagination a[rel="next"]')
    if not urltag: break  #as soon as it finds that there is no next page link, it will break out of the loop
    url = urljoin(url,urltag.get("href")) #applied urljoin to save you from using hardcoded prefix

For future reference: feel free to post any question in this site to get your code reviewed.

comments powered by Disqus