Need a discount on popular programming courses? Find them here. View offers

Python and Web Analytics


Disclosure: Hackr.io is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission.



Python Web Scraping Guide: Step-by-Step with Code [2023]

Posted in Python , Web Analytics
Python Web Scraping Guide

Web scraping is a process that involves writing programs to fetch and parse publicly available data from websites. Depending on the website’s design, this may be a simple extraction of unstructured data or require you to simulate human actions like clicking on links or filling out forms. 

The Internet is an endlessly growing collection of websites, making it a massive informational data resource. By taking advantage of web scraping techniques, a range of industries, including data science, business intelligence, and more, can extract huge value from this information.

Python web scraping is one of the most popular ways to accomplish this activity. With an intuitive syntax and a range of powerful third-party web scraping libraries, web scraping in Python is an excellent way to produce structured data from public websites.

Why Learn Python Web Scraping?

  • Produce structured data: Gather and transform publicly available website data in a range of unstructured formats
  • Automation: Replace the slow and tedious process of manually collecting web data, saving time and effort and increasing productivity
  • Substitute for an API: Extract information from websites that don’t provide an API or other means for accessing their data
  • Data monitoring: Track competitors, SEO, news development, social media, etc
  • Marketing: Market analysis, lead generation, market trends, etc

Python Data Scraping Skills

Before you start data scraping with Python, you’ll need to understand various concepts.

  • Python Basics: Variables, data types, collections, loops, control structures, etc
  • Domain Object Model (DOM): Tree structure of objects created by your browser when a page loads, allows scripts to access/update web page content, structure & style
  • HTML & XML Basics: Structures and formats web pages, you’ll need to understand tags and attributes to Python scrape website data successfully
  • HTTP Methods: At the minimum, understand GET, POST, PUT, and DELETE methods

Are you a Python beginner that wants to start web scraping? To boost your skills check out the:

Best Free Udemy Courses

Web Scraping Python Libraries 

We’re lucky that we can pick from a range of popular Python libraries to scrape web data. We’ve selected the three most popular third-party libraries and compared their key features in the table below. So if you’re wondering how to scrape data from a website with Python, this can help.

While these are all useful for general web scraping, it helps to know when and why to use these tools based on your web scraping goals and the size of your task.

 

Beautiful Soup

Selenium

Scrapy

Ease of Use

Straightforward

Medium

Medium-Hard

Speed

Fast

Medium

Fast

JavaScript Support

No

Yes

Requires Middleware

Learning Curve

Easy to learn

Medium

Medium-Hard

Documentation

Excellent

Good

Good

Pros

Simple parser

Wide browser support

Complete API with Spider

Cons

Requires dependencies

Geared towards testing

Lacks JavaScript support

Use Case

Extract HTML data from a small number of pages

JavaScript page interactivity, including forms, navigating, etc

Large professional projects that require advanced features

Web Scraping Tutorial With Beautiful Soup

In this example, we’ll be web scraping with Python and Beautiful Soup via the Python library BeautifulSoup4.

We’ll scrape ArXiv, an open-access repository of scientific papers in math, physics, computer science, biology, finance, and more. We will focus on artificial intelligence papers by fetching the title, abstract, and authors.

Before we write any code, we need to head over to the webpage to examine the GUI and HTML content. Doing this shows that the papers are listed in a repeated format, as shown in the image below.

Begin By Researching The Webpage Structure

We can then inspect the DOM for this page by using our browser’s developer tools. You can either head to the browser menu or right-click any of the articles on the results page and select inspect, as shown in the image below. 

Access The Webpage DOM Using Developer Tools

We know that we want to scrape specific data fields relating to each paper, so we’ll need to examine the DOM until we find the HTML elements that contain this data.

Examine The Webpage DOM To Find HTML Elements

By examining the HTML elements within the DOM, we can see that the papers are contained within <li> tags with a class of ‘arxiv-result’. This information is essential, and we will use it to scrape the papers from the webpage’s HTML content within our Python program.

The final stage of our research is to look at the inner HTML elements within each of these <li> tags, as shown in the image below. 

We can see that each data element is contained within <p> tags with different class names to identify the different data fields. Again, this is essential information for us, and we will use it within our Python code to scrape this data.

Examine HTML Elements To Find Data Fields

We’re now ready to write our Beautiful Soup web scraping program in Python to fetch, process, and store this data in a structured data file.

Looking at the source code below, you’ll see that we must install and import the requests and beautifulsoup4 libraries. We’ve also imported the json and csv modules to save our structured data after we’ve completed our processing.

Source Code:

'''
Python Web Scraping: Beautiful Soup
-------------------------------------------------------------
pip install requests beautifulsoup4
'''


import requests
from bs4 import BeautifulSoup
import json
import csv


base_url = 'https://arxiv.org/search/?'
query = 'query=artificial+intelligence&searchtype=all&source=header'
request_url = base_url + query
response = requests.get(request_url)
soup = BeautifulSoup(response.text, 'html.parser')
papers = soup.find_all('li', class_='arxiv-result')

data = []
for paper in papers:
    title = paper.find('p', class_='title is-5 mathjax').text
    abstract = paper.find('p', class_='abstract mathjax').text
    authors = paper.find('p', class_='authors').text
    paper_data = {
        'Title': title,
        'Abstract': abstract,
        'Authors': authors
    }
    data.append(paper_data)

with open('papers.json', 'w') as f:
    json.dump(data, f)

with open('papers.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(["Title", "Abstract", "Authors"])
    for paper in data:
        writer.writerow(paper.values())

We start by fetching the webpage contents with the .get() method from requests. We can pass this as the first argument to the BeautifulSoup constructor, with the second argument being a parser parameter. We’ve selected the HTML parser as our webpage will be HTML content.

By creating a BeautifulSoup object, we can represent the HTML content of the page as a nested data structure object, allowing us to search via tag types and tag properties.

We use the .find_all() method to return all HTML elements within the nested structure that is a <li>tag with a class of ‘arxiv-result’ (I told you this would be useful!). We can then iterate over this list of tag objects with a for loop, allowing us to process each article.

By iterating over the articles, we can use the .find() method to extract data fields for each page's title, abstract, and authors.

We use the <p> tag classes we found in our preliminary research to ensure we fetch the correct HTML elements. We also use the .text attribute to ensure we only return the text contained within the HTML tag.

Each loop adds the key information to a dictionary, which is then appended to our data list object to save our scraped data to a data structure. This stage represents the conversion of unstructured webpage data into a structured format via a dictionary and a list.

Once we’ve looped over the articles, we use two approaches to persist our structured data to a file. 

Firstly, we save our data to a JSON file using json.dump() to serialize the data into a JSON format. This can be useful for sharing with other applications that interact with JSON data. 

Secondly, we save our data to a standard CSV file, allowing us to use this with spreadsheet software or other applications. This requires us to loop over each element within our data list to write each element as a new row in the CSV file.

Check out the official documentation to learn more about Beautiful Soup.

Python Scraping Tutorial With Selenium

In this example, we’ll use the Python library, Selenium, to interact with a website via forms and buttons.

We’ll be using the ChemNetBase website, a comprehensive database of chemical information, including structures, properties, spectra, and reactions used by researchers, students, and professionals in the chemical industry.

We’ll keep it relatively simple by searching for compounds containing the word ‘amino’ in their names or descriptions.

Before we start with any code, we need to conduct preliminary research on the website’s HTML element structure.

Firstly, we head to the search page to examine the GUI, as shown in the image below. This lets us identify the search page elements that will interact with our Python program.

Research GUI Structure of Search Page

We can inspect the DOM for the search page by right-clicking on both the search bar and search button, then select inspect. This highlights the DOM elements we need to interact with, as shown in the image below.

Examine Search Page DOM To Find Input & Button Elements

We can see that the search bar is contained within an <input> element with an id of ‘searchForm:searchTerm1’, and that the search button is contained within a <button> element with an id of ‘searchForm:j_idt101’. This will be essential for our Python program.

The second step is to examine the search results page to find the HTML elements that contain the data we want to scrape. We’ll run a search for the term ‘amino*’ and then right-click any of the results to inspect the DOM, as shown in the image below.

Examine Results Page DOM To Find HTML Elements

We see that the results are contained within a <table> with a <tbody> element that has an id of ‘PRODUCTtabs:resultsForm0:hitrowsTbl_data’. This represents the full set of results, but we’ll need to scrape data with Python by accessing the individual result <tr> elements.

We’re now ready to write our Python program to interact with the website, fetch data, process it, and store it in a structured CSV file.

Looking at the source code below, you’ll see that we must install Selenium and import various classes from the Selenium library.

At the center of this is the Selenium WebDriver, a simple but powerful object-oriented API that lets us programmatically interact with a webpage within a browser.

We’ll be using Google Chrome, which means we need to import Chrome from selenium.webdriver.

We’ll also import the Options class from the selenium.webdriver.chrome.options module. By creating an instance of this, we can set our WebDriver object to be headless, allowing our Chrome browser to run in the background without loading the GUI.

Depending on your OS, you may need to download and install ChromeDriver from the link in the source code. Selenium uses this to control your Chrome browser when interacting with webpages.

If you’d prefer to use Firefox, Safari, or another browser, our source code provides a template you can modify. We’d recommend heading to the official documentation to get specific details about each browser.

You’ll also see that we’ve imported the Byclass from the selenium.webdriver.common.by module to help locate webpage elements and TimeoutException from the selenium.common.exceptions module to handle webpage timeouts.

Source Code:

'''
Python Web Scraping: Selenium
-------------------------------------------------------------
Follow instructions to Download chromedriver from:
https://chromedriver.chromium.org/downloads/version-selection

pip install selenium
'''


from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import csv

opts = Options()
opts.headless = True
timeout = 10
url = 'https://www.chemnetbase.com'

with Chrome(options=opts) as driver:
   driver.get(url)
   driver.implicitly_wait(timeout)
   try:
       search_form = driver.find_element(By.ID, 'searchForm:searchTerm1')
       print('Page is ready!')
       search_form.send_keys('amino*')
       search_button = driver.find_element(By.ID, 'searchForm:j_idt101')
       search_button.click()

   except TimeoutException:
       print('Search page loading took too much time!')

   try:
       result_table = driver.find_element(
           By.ID, 'PRODUCTtabs:resultsForm0:hitrowsTbl_data')
       print('Results are ready!')
       result_rows = result_table.find_elements(By.TAG_NAME, 'tr')
       data = []
       for row in result_rows:
           cols = row.find_elements(By.TAG_NAME, 'td')
           name = cols[1].text
           synonyms = cols[2].text
           molec_form = cols[3].text
           CAS_num = cols[5].text
           data.append([name, synonyms, molec_form, CAS_num])

       with open('amino.csv', 'w') as f:
           csv_writer = csv.writer(f)
           csv_writer.writerow(
               ['Name', 'Synonyms', 'Molecular Formula', 'CAS Number'])
           csv_writer.writerows(data)

   except TimeoutException:
       print('Results loading took too much time!')​

Now we’ve imported all of the necessary modules and classes, we can get cracking! 

We start by creating an Options object and setting the headless attribute to True. We must also ensure that the WebDriver allows page elements to load by using implicit waiting with a timeout parameter.

This tells the WebDriver to poll the page’s DOM for a defined period before throwing a TimeOutException. In our example, we’ve set a timeout variable of 10 seconds to use with the .implicitly_wait() method from the WebDriver class.

The main body of the code uses a Python context manager to create (and automatically quit) our WebDriver object. We can then call the .get() method for the WebDriver class to load our URL, followed by the implicit wait method outlined above.

We’ve used a try-except block for the search page to catch any timeouts. If the WebDriver does not find our HTML elements within 10 seconds, it will throw a TimeOutException.

We’ve used the .find_element() method from the WebDriver class to locate our search bar. This uses the .ID attribute from the By class to pass in the HTML element ID for the search bar (from our preliminary research).

After the search bar loads, we can invoke the .send_keys() method to enter ‘amino*’ into the search bar within the headless browser. We can then find the search button the same way before invoking .click() to complete our search and load the results page.

At this stage, we’ve concluded our webpage interactions and are ready to scrape the results.

We’ve again used a try-except block to handle timeouts for the results page. After the results load, we use the .find_element() method from the WebDriver class to locate the results table with the ID we discovered in our research of the DOM.

We then call the .find_elements() method on the table element to return a list of the individual <tr> elements. This requires us to use the .Tag_Name attribute from the By class.

At this stage, we can iterate over our list of <tr> elements, allowing us to call .find_elements()  on each and extract a list of <td> elements. These represent the data fields we want to scrape.

It’s then a matter of indexing the values we want from the list of <td> elements and assigning these to variables. Note that we use the .text attribute to return the text within the tags.

We can then append the variables to a list that houses our results, providing a structured data format for our scraped data.

The final stage is to save our data to a standard CSV file. We do this by looping over each element within our list and writing each as a new row in the CSV file.

If you’d like to learn more about using Selenium, check out the official documentation.

Best Practices for Python Web Scraping

It’s important to follow best practices to ensure your web scraping activities are legal, ethical, efficient, and cause no harm to the websites you want to scrape.

  • Respect Terms of Service & robots.txt: Scrape responsibly and ethically by adhering to a website's terms of service and restrictions in their robots.txt file
  • Proxies & Rotate IP Addresses: Websites may block IP addresses that make excessive requests, so use proxies or rotate your IP address to avoid being blocked
  • Handle CAPTCHAs: CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) can prevent automated scraping, so you may need to manually solve these or use a CAPTCHA solving service
  • Avoid Excessive Requests: Scraping can strain a website's server if you make a large number of requests in a short time, so be mindful of your load to avoid denial of service
  • Appropriate User Agents: Identify yourself as a web scraper in user agent field in HTTP headers to help website owners block your Python scraper if it misbehaves
  • Export Data: After scraping data you need to store it for future analysis by writing to a database, file (txt, CSV, JSON, etc), or a programmatic data structure

Common Data Scraping Challenges

No matter how much experience you have, there are some common challenges faced by anyone using web scraping to gather unstructured data from the Internet.

  • Website Changes: Websites are constantly updating, which can stop your automated web scraping programs from extracting data until you alter your scraping code
  • Blocks or Rate-limits: Many websites try to prevent or limit web scraping with CAPTCHA tests, or IP-based rate-limits or service blocks
  • Authentication: Some websites require logins to access content or data, requiring your Python web scraper to interact with webpage elements before scraping data
  • Poor Quality Data: If web scraping programs collect inaccurate, outdated, or otherwise unreliable data it can impact the quality of any analyses performed on the data
  • Legal Issues: Web scraping can be controversial and is seen by some as a form of unauthorized website access, leading to Terms of Service that prohibit web scraping
  • Ethical Considerations: Web scraping programs can raise ethical concerns if they collect personal data without a user's knowledge or consent

Web Scraping Alternatives

Web scraping is an excellent way to automate and speed up the process of collecting raw data from web pages, however this is not the only way to gather data from the Internet.

  • Using APIs: More and more websites are providing APIs for you to access and retrieve data from their platforms. This can be a more reliable and efficient way to access data, removing the need to scrape websites directly.
  • Web Data Integration: Some websites offer pre-packaged data feeds that can be easily integrated into your application or platform. This can be a more straightforward way to access data from a website, as it avoids the need to write custom scraping scripts.
  • Manual Data Entry: In some cases, manually entering data from websites into your application or platform can be more efficient. While time-consuming, it may be necessary if the data you need is not readily available through other means.

Conclusion

With an endlessly growing collection of websites containing valuable and often unstructured data, the Internet is a massive informational data resource for industries like data science and business intelligence. Using web scraping techniques, we can write programs to fetch and parse publicly available data from these websites. 

Python web scraping is one of the most popular ways to accomplish this activity due to its intuitive syntax and range of third-party web scraping libraries.

This article covered the basics of how to webscrape with Python, including a comparison of the most popular third-party libraries for web scraping, best practices, and common challenges.

We also covered a detailed example of a Python data scraper with the Beautiful Soup library for a simple HTML website. This allowed us to scrape data, process it, and save it to a structured JSON and CSV file.

We then dove into a more complicated Python web scraping example with Selenium. This required us to interact with web page elements using the Selenium WebDriver to control a headless (no GUI) browser. We conducted a search, scraped the search results, then processed these before saving them to a structured CSV file.

Looking for ways to enhance your Python skills? Read this next:

10 Best Python Frameworks

Frequently Asked Questions

1. Is Python Good for Web Scraping?

Python is a popular choice for web scraping because it’s easy to learn, offers a range of third-party web scraping libraries and HTML parsing tools, and provides excellent documentation and community support for web scraping activities.

2. Which Module Is Used for Web Scraping in Python?

Python offers several modules for web scraping, including requests, Beautiful Soup, Selenium, Scrapy, and more. The best module to use for Python web scraping depends on the scope of your web scraping project, whether you need to interact with JavaScript elements, and your own experience with Python programming. 

3. How Long Does It Take To Learn Python Web Scraping?

This depends on your previous Python programming knowledge and your general understanding of HTML, XML, DOM, and HTTP.

If you are already familiar with these skills, you can start scraping within hours, though it may take a little longer if you’re an absolute beginner. Either way, our examples should allow you to quickly get to grips with the basic skills. Like all programming skills, mastery takes time, and the more time you invest in web scraping, the better you will be. 

Generally, web scraping is not illegal, but there are limitations depending on a website’s Terms of Service, data protection laws, and copyright laws. It is essential to be aware of these legal considerations to ensure your web scraping activities are within the bounds of the law.

5. What Is the Best Web Scraping Language?

Several programming languages can be used for web scraping, including Python, Java, and Ruby. Python is popular due to its libraries and frameworks, while Java is known for stability and performance. Ruby is also a good choice due to its simplicity and range of libraries.

The best language for a particular project depends on your skills and the project requirements.

Leave a comment

Your email will not be published
Cancel