nc-state logo
Selenium + Beautiful Soup
Christopher G. Healey

Introduction

A common need is web scraping, the ability to copy content from an online web page. In data analytics, this is often used to extract data from a web page into a format amenable to follow-on analysis in languages like Python, R, SAS, or SQL.

At its inception, web pages were a simple combination of Hypertext Markup Language (HTML) to style the content of a page, the Hypertext Transfer Protocol (HTTP) to transfer HTML documents over the Internet, and web browsers like Mosaic to convert HTML into a rendered presentation. HTML supported document structuring: paragraphs, tables, lists, and so on; and text styling: boldface, italics, and other types of visual modification of text.

Modern web pages are very different from their ancestors. Now, pages commonly contain complex styling and programming that a web browser must interpret and execute prior to displaying a page. Two common examples are Cascading Style Sheets (CSS) and Javascript programs to control both the appearance and the functionality of a web page.

In order to web scrape in this new reality, two steps are needed. First, the web page must be read and interpreted into its final format. Selenium is used to do this, since it has the ability to mimic a web browser by reading raw HTML, then performing the execution necessary to convert the HTML into its final format. Although Selenium is designed to perform web page testing, it can also deliver the HTML for a fully rendered page. Once that HTML is available, it needs to be read and parsed. In Python, we use Beautiful Soup to do this.

Even with Selenium and Beautiful Soup, web scraping is non-trivial. For example, web pages often have interactive controls that need to be invoked in a specific order to arrive at the page of interest. Selenium is fully capable of doing this, but the raw HTML must be examined to determine how to uniquely identify the web page controls to manipulate. This information is needed by Selenium to locate and modify widgets on the web page. Once the target page is scraped, the HTML must again be examined for to determine how to tell Beautiful Soup what we want to scrape. Well written HTML will have easy-to-locate identifiers for all the main elements on a page. Poorly written HTML will not. Both can be parsed, but the effort required for poorly written pages is more complex. Regardless, for both Selenium and Beautiful Soup, understanding how HTML works is a prerequisite for scraping most pages. If you need a quick introduction to HTML, refer back to our discussion of HTML5 on the plotly | Dash lessons.

HTML id & class

Extending our understanding of HTML, the most common way to identify a particular section of HTML is through its id or class, two attributes that can be attached to most HTML markup tags. As an example, consider the following simple HTML code.

<p>This is a paragraph.</p> <p id="id-tag">This is a paragraph with an id attribute of id-tag.</p> <p class="class-tag">This is a paragraph with a class attribute of class-tag.</p> <p id="id-tag" class="class-tag">This is a paragraph with an id attribute of id-tag and a class attribute of class-tag.</p>

In HTML, the id attribute is used to identify a specific HTML structure. The class tag is used to assign a pre-defined class to the structure, usually to style the structure in some common way throughout the document. Both Selenium and Beautiful Soup allow us to select HTML structures based on their id and class attributes, or combinations thereof. This is the most common way of identifying the target structure we wish to extract from a web page.

Selenium

Selenium can be installed as a package in Python from the Anaconda Prompt by typing the following.

conda install -c conda-forge selenium

You may need to run the prompt in Administrator mode to allow conda to update your Anaconda installation. Previously, Selinium also required you to install a webdriver. However, the most recent versions of Selenium include webdrivers for most common web browsers as part of their library. This makes using Selenium much easier.

A webdriver allows you to programmatically drive a web page, exactly as though you were a user. You can ask Selenium to load a page, click elements on the page, fill in text fields, scroll the page, and do any of the other things a real user could do if they were viewing the page in their own web browser. The webdriver itself is a program that runs on your computer and mimics one of the common web browser. Currently, Selenium includes webdrivers for Chrome, Firefox, Edge, IE, and Safari. Unless you have a need for specific browser compatibility, the driver you choose isn't particularly important.

At this point, you have everything you need to load Selenium in Python, invoke a controllable version of one of the common web browsers, then use Selenium to load a page and manipulate its contents to navigate to to the location where you want to scrape data. At that point, the page's HTML can be retrieved and passed to Beautiful Soup for further processing. Given this, the high-level order of processing for web scraping with Selenium and Beautiful Soup is as follows.

  1. Ask Selenium to invoke a webdriver, creating a browser instance that we can control through Selenium.
  2. Load the initial web page in the browser instance.
  3. Use Selenium to navigate from the initial page to the target page exactly as a user would, by clicking buttons, selecting from lists, entering text in text fields, and so on.
  4. Once the target page is loaded, ask Selenium to return the HTML used to represent the target page.
  5. Pass the page's HTML to Beautiful Soup to parse its contents into a searchable parse tree.
  6. Use Beautiful Soup to retrieve information from the web page's parse tree.
  7. Save information scraped from the web page into Python variables for follow-on analysis.

Exploring Web Pages

As noted above, one of the fundamental requirements for Selenium or Beautiful Soup is properly identifying the HTML structures in a web page you want to manipulate or scrape. The easiest way to do this is to load a web page into your favourite web browser, then use the developer tools every browser provides to examine the underlying HTML source code in detail. The discussion in these notes will use Chrome as an example, since it provides a robust set of examination tools. The same functionality can be performed in Firefox, Safari, or any other browser, however, using whatever commands they make available for this type of exploration.

To begin, run Chrome and load NC State's homepage at https://www.ncsu.edu. Next, click on the three vertical dots in the top-right corner of the browser window to reveal Chrome's menu. Choose More tools → Developer tools to bring up Chrome's developer tools (you can also use the keyboard shortcut Ctrl+Shift+I to do the same thing.) If this is the first time you've used the developer tools, they will appear in a dock inside the browser window to the left or right. To force the developer tools into their own, separate window, click the three vertical dots in the upper-right corner of the tools dock, and click on the Dock side option that shows two overlapping windows. This will pull the developer tools into a window separate from the browser window.

The developer tools are designed for a variety of options, including examining a web page's source code, debugging Javascript code, and confirming resources for the page loaded properly over the network. Since we're interested in examining HTML source code, choose the tab labelled Elements at the top of the page. This shows the an overview of the code that makes up NC State's homepage, with exposure triangles to allow you to show and hide more detailed information contained in the page. Move your mouse over the different lines in the source code list. You should see different parts of the main browser window highlight. This is showing you which parts of the web page correspond to which parts of the code you are moving over.

Example: Identifying the RESOURCES Link

Notice that if we click on the magnifying glass icon with the three vertical dots to its right at the top of the page, a panel slides down with additional options to selection. If we wanted to do this with Selenium, we would need to determine how to uniquely identify the magnifying glass button. The fastest way to do this is to ask Chrome's developer tools to identify the HTML that controls the icon. To do this, click on the dashed box with an arrow in it's lower-right corner in the very upper-right corner of the developer tools window or hit Ctrl+Shift+C. The page inspector icon should turn blue. Next, hover over the magnifying glass in NC State's homepage. A blue overlay will cover the icon, with an information popup shown underneath it. Click on the icon, and the exact line of HTML in the source code will be revealed that controls the magnifying glass.

When I hover over the Page Inspector icon in the DevTools window, the svg element with id="Layer_1" is exposed and highlighted.

Using Chrome's Developer Tools to highlight the HTML controlling the NC State homepage search button

This shows how you can use Chrome's developer tools to walk in and out of the code to find the specific elements you want to manipulate, and what unique identifiers can be used to allow Selenium to manipulate them.

Coding Selenium

Now that we know how to identify the magnifying glass button, how would we use Selenium to automatically select it in a webdriver? The following Python code snippet will create a webdriver, load the NC State homepage, click the magnifying glass link, wait for 5 seconds, then terminate.

% import time % from selenium import webdriver % from selenium.webdriver.common.by import By % % driver = webdriver.Firefox() % driver.get( "https://www.ncsu.edu" ) % btn = driver.find_element( By.ID, value="Layer_1" ) % btn.click() % % time.sleep( 5 ) % driver.close()

Moving further into our example, supposed I want to select the Directory link at the top of the panel. Some exploring identifies its id as ncstate-utility-bar-first-link. The following code duplicates our previous operation to click the Directory link.

% import time % from selenium import webdriver % from selenium.webdriver.common.by import By % % driver = webdriver.Firefox() % driver.get( "https://www.ncsu.edu" ) % btn = driver.find_element( By.ID, value="Layer_1" ) % btn.click() % % elem = driver.find_element( By.CLASS_NAME, value="go282567778" ) % elem.click() % % time.sleep( 5 ) % driver.close()

XPATH

If you run this code, it will not pull up the directory page. This is because if you examine the HTML, either with the Page Inspector or explicitly, you will see every option in the RESOURCES list has an identical class of go282567778. So, if we try to identify the Directory entry by its class, we will get the first entry in the list instead, Academic and Student Affairs.

To solve this, we need to look for a span entry that contains the text Directory. This can be done using an XPATH search. What an XPATH? An XPATH is short for XML path, the path used to navigate through the structure of an HTML document. By.XPATH allows you to locate elements using XML path expressions. XPATHs come in two flavours: absolute and relative. An absolute XPATH defines the exact path from the beginning of the HTML page to the page element you want to locate.

/html/body/div[2]/div[1]/h4[1]

This absolute XPATH says start at the root node (/), then find the HTML element (which is the entire HTML for the page), then the body element, the second div in the body, the first div inside the body's second div, and finally the first h4 section heading within that div. Although this allows very explicit selection, it is also labourious, and if the format of the web page changes, the absolute XPATH will break. The much more common alternative is a relative XPATH, which allows searching within the web page for target elements. The basic format of a relative XPATH is:

//tagname[ @attribute = 'value' ]

where:

Alternatively, @attribute can be replaced by a function call that the specific tagname understands. We could use this to ask a span element to return its text with text(), then search for a span with the text Directory.

elem = driver.find_element( By.XPATH, "//span[text()='Directory']" )

Given this, we could change our code to find the Directory link using an XPATH, then click it.

% import time % from selenium import webdriver % from selenium.webdriver.common.by import By % % driver = webdriver.Firefox() % driver.get( "https://www.ncsu.edu" ) % btn = driver.find_element( By.ID, value="Layer_1" ) % btn.click() % % elem = driver.find_element( By.XPATH, "//span[text()='Directory']" ) % elem.click() % % time.sleep( 5 ) % driver.close()

Unfortunately, identifying the Directory link is only one problem that we may encounter. You might assume, "If a link is always available, we can just locate it and click it." This may not be true, because exposing the panel containing Directory is also making the link available. Trying to click the link programmatically before we reveal the panel tells us the element is not yet available for interaction (an ElementNotInteractableException error). But, our code above first clicked the magnifying glass link to make the panel visible, then found and clicked the Directory link. Even doing this, we may still encounter an interaction error. Why would this happen?

When the web browser executes code for complicated web pages like the NC State homepage, it takes time for the operations to complete and the web page to update. Our code is running too quickly, so it asks for a reference to the Directory link before the webdriver has processed our first click and rendered the drop-down panel. This is a very common occurrence during web scraping of dynamic pages.

How can we solve this second error? An obvious way would be to create an infinite loop that located the Directory link, and if it wasn't available for interaction, slept for a short period of time, then try again. This is strongly discouraged, however, since it is inefficient, and it also blocks the Python interpreter from performing any actions while the sleep command runs. Selenium provides two possible methods for dealing with this issue: implicit waits and explicit waits. An implicit wait will wait a certain amount of time to locate an element before it gives up and returns an error. An explicit wait will wait a certain amount of time for a specific condition to evaluate to True based on the web page's contents before it gives up and returns an error. It is also fairly easy to write our own function to wait a set number of attempts for a target element to become available on the web page before giving up and deciding something has gone wrong.

In our situation, an implicit wait will not work, since the Directory link is always present whether it is visible (and clickable) or not. This means we will need an explicit wait, with a specific condition we are waiting on. Selenium's expected_conditions class provides numerous ways to wait for specific conditions, including waiting until an element is clickable. We will set an explicit wait for a span element with text Directory for five seconds until it becomes clickable.

% import time % import sys % % from selenium import webdriver % from selenium.webdriver.common.by import By % from selenium.webdriver.support.ui import WebDriverWait % from selenium.webdriver.support import expected_conditions as EC % % driver = webdriver.Firefox() % driver.get( "https://www.ncsu.edu" ) % btn = driver.find_element( By.ID, value="Layer_1" ) % btn.click() % % try: % elem = WebDriverWait( driver, 5 ).until( % EC.element_to_be_clickable( % ( By.XPATH, "//span[text()='Directory']" ) % ) % ) % except: % print( 'mainline(), could not expose RESOURCES panel' ) % driver.close() % sys.exit( 0 ) % % elem.click() % % time.sleep( 5 ) % driver.close()

Now, the second click registers as expected and we move to the Campus Directory web page, ready to search for information on campus members.

Programatically navigating to NC State's Directory page

XPATH contains

As an alternative, we could also use contains to locate an element based on a partial text match. Rather than an XPATH of //tagname[ @attribute='value' ], we can use //tagname[ contains(@attribute='partial-value' ] to locate an element of type tagname with an attribute whose text contains partial-value, for example:

elem = driver.find_element( By.XPATH, "//span[contains(text(), 'Dir')]" )

This locates the span whose text contains Dir (the start of the Directory label). We could then elem.click() exactly as before to reveal the Campus Directory web page.

Querying a User

To finish our example, we will enter a last and first name into the appropriate fields on the web form, then click the Search button to retrieve information about the given individual. At this point, we will have arrived at our target page, and we are ready to extract the user's Email address from the resulting information. Fields are populated using the send_keys, in the following way.

% import time % import sys % % from selenium import webdriver % from selenium.webdriver.common.by import By % from selenium.webdriver.support.ui import WebDriverWait % from selenium.webdriver.support import expected_conditions as EC % % driver = webdriver.Firefox() % driver.get( "https://www.ncsu.edu" ) % btn = driver.find_element( By.ID, value="Layer_1" ) % btn.click() % % try: % elem = WebDriverWait( driver, 5 ).until( % EC.element_to_be_clickable( % ( By.XPATH, "//span[text()='Directory']" ) % ) % ) % except: % print( 'mainline(), could not expose RESOURCES panel' ) % driver.close() % sys.exit( 0 ) % % elem.click() % % form = driver.find_element( By.ID, "sn" ) % form.send_keys( "Healey" ) % form = driver.find_element( By.ID, "gn" ) % form.send_keys( "Christopher" ) % % try: % btn = WebDriverWait( driver, 5 ).until( % EC.element_to_be_clickable( ( By.CLASS_NAME, "btn-primary" ) ) % ) % except: % print( 'mainline(), could not find Directory Search button' ) % driver.close() % sys.exit( 0 ) % % btn.click()
Campus Directory information for Christopher Healey

At this point, if we want to retrieve the value attached to Email field, we have two options. First, we can do this directly in Selenium. Second, we can ask Selenium to return the HTML for the current page, parse that HTML with Beautiful Soup, then retrieve the value attached to the Email field using Beautiful Soup's parse tree.

Selenium

% … % try: % div = WebDriverWait( driver, 5 ).until( % EC.visibility_of_element_located( % ( By.CLASS_NAME, 'person__right' ) % ) % ) % except: % print( 'mainline(), could not find user phone/fax/email' ) % driver.close() % sys.exit( 0 ) % % tok = div.text.split() % email = 'unknown' % % for str in tok: % if '@ncsu.edu' in str: % email = str % break % % print( 'Email address: ' + email ) % % driver.close()

Beautiful Soup

% … % from bs4 import BeautifulSoup % % tree = BeautifulSoup( driver.page_source, 'html.parser' ) % div = tree.find( 'div', class_='person__right' ) % % tok = div.text.split() % email = 'unknown' % % for str in tok: % if '@ncsu.edu' in str: % email = str % break % % print( 'Email address: ' + email ) % % driver.close()

In both cases, the program returns healey@ncsu.edu, which is the correct NC State email for employee Christopher Healey. You might wonder, "This is a lot of work to get someone's email address. Why would we go through all this effort for that result?" In fact, we probably would not. However, suppose we had a list of 1,000 NC State employee first and last names, and we needed an email address for each of them. Doing this manually through the NC State web page would take a significant amount of time. With only a slight modification to the end of our program, however, we could query an email, go back one page, refill the fields to query a new email, and so on until we had all 1,000 emails. Not only would it be fully automated, it would also be much faster than a manual approach. This is the power of Selenium and Beautiful Soup: the ability to automate tedious or manually labourious tasks, even when they involve many dynamic interactions with a web page.

National Weather Service Example

As a practical example, we will use Beautiful Soup to scrape and print the extended forecast for Raleigh reported by the National Weather Service (NWS) web site. Since we have already discussed exploring a web page to identify target elements, we will limit our example to using Beautiful Soup to extract the elements we need. We also provide the URL for Raleigh's weather forecast by entering Raleigh, NC into the NWS homepage, producing a URL of https://forecast.weather.gov/MapClick.php?lat=35.7855&lon=-78.6427.

National Weather Service extended forecast for Raleigh, NC

To start, we scrape the HTML for Raleigh's web site and confirm it was returned properly.

% import requests % import sys % from bs4 import BeautifulSoup % % page = requests.get( 'https://forecast.weather.gov/MapClick.php?lat=35.7855&lon=-78.6427' ) % % if page.status_code != 200: % print( "mainline(), could not retrieve HTML for Raleigh's weather" ) % sys.exit( 0 ) % % tree = BeautifulSoup( page.content, 'html.parser' )

The individual extended weather entries are contained in an unordered list with an id of seven-day-forecast-list. Within each list item is a div with a class of tombstone-container. Within this div are four separate paragraphs: (1) the period name (e.g., This Afternoon); (2) an image whose alt tag contains the detailed forecast (this is the same text as in the detailed forecast list, but it appears as a tooltip when you hover over the image); (3) a short description (e.g., Sunny); and (4) a temperature (e.g., High: 48°F). At this point, we have two options. We can combine together the first, third, and fourth text items to produce a short summary of the extended weather. Or, we can extract the alt text of the img in the second paragraph. Below is code for both options.

% list = tree.find( 'ul', id='seven-day-forecast-list' ) % item = list.find_all( 'li', class_='forecast-tombstone' ) % % for li in item: % txt = '' % para = li.find_all( 'p' ) % for html in para: % html = str( html ).replace( '<br/>', ' ' ) % soup = BeautifulSoup( html, 'html.parser' ) % soup_txt = soup.get_text().strip() % txt += ( soup_txt + '. ' if len( soup_txt ) > 0 else '' ) % print( txt )

Here, we extract the HTML for each list item, convert it to a string, and replace any line break <br/> with a space. Then, we re-use BeautifulSoup to parse the HTML, and ask for the text it contains. Joining these together produces a final set of extended forecast summary lines.

Tonight. Clear. Low: 27 °F. Wednesday. Sunny. High: 50 °F. Wednesday Night. Clear. Low: 29 °F. Thursday. Sunny. High: 56 °F. Thursday Night. Partly Cloudy. Low: 36 °F. Friday. Chance Showers. High: 59 °F. Friday Night. Chance Showers. Low: 42 °F. Saturday. Mostly Sunny. High: 57 °F. Saturday Night. Partly Cloudy. Low: 36 °F.

If we instead wanted to use the alt text for the image embedded in the second paragraph, the following code would extract the alt text.

% list = tree.find( 'ul', id='seven-day-forecast-list' ) % item = list.find_all( 'li', class_='forecast-tombstone' ) % % for li in item: % para = li.find_all( 'p' ) % txt = para[ 1 ].find( 'img' )[ 'alt' ] % print( txt )

This produces a result similar to the first code block, although with slightly more detail and in a grammatically correct format.

Tonight: Clear, with a low around 27. West wind 7 to 9 mph, with gusts as high as 18 mph. Wednesday: Sunny, with a high near 50. West wind 6 to 9 mph. Wednesday Night: Clear, with a low around 29. Light west wind. Thursday: Sunny, with a high near 56. Calm wind. Thursday Night: Partly cloudy, with a low around 36. Calm wind. Friday: A chance of showers, mainly after 1pm. Mostly cloudy, with a high near 59. Chance of precipitation is 30%. New precipitation amounts of less than a tenth of an inch possible. Friday Night: A chance of showers before 1am. Mostly cloudy, with a low around 42. Chance of precipitation is 30%. Saturday: Mostly sunny, with a high near 57. Saturday Night: Partly cloudy, with a low around 36.