Scraping Google Search Results with Python and Selenium using Proxies - A Comprehensive Guide



Scraping Google search results has become one of the most commonly used techniques for gathering data for SEO analysis and research purposes. In this tutorial, we will learn how to scrape Google search results using Python Selenium and proxy.

Introduction to Python Selenium and Proxy

Python Selenium is a web testing framework used for automating web browsers. It is widely used for web scraping as it can simulate user interaction with websites, and perform repetitive tasks such as filling forms, clicking buttons, and navigating through pages. Selenium has support for several web browsers, including Google Chrome, Mozilla Firefox, and Microsoft Edge.

Proxy is an intermediary server that sits between the user's computer and the Internet. It is used to provide security and privacy by masking the user's IP address and location, and to bypass content filters and firewalls. Proxies can be used to scrape data anonymously and avoid IP blocking, throttling, and other restrictions.

Installing Dependencies

Before starting the implementation, we need to install some dependencies. Firstly, we need to install Python Selenium using pip command. Secondly, we need to install Chrome driver to interact with the Google Chrome browser using Selenium and also we need to install selenium webdriver_manager. Finally, we need to install the CSV library to write the scraped data to a CSV file.

To install these dependencies, we can use the following pip command:

pip install selenium
pip install webdriver_manager
pip install chromedriver 

Why Using WebDriver Manager instead of Manually Downloading Driver Manager?

Using WebDriver Manager the `ChromeDriverManager` for Chrome browser ensures that we always have the latest version of the ChromeDriver installed on our system, without having to manually download and manage different versions of the driver.

By passing the `ChromeDriverManager().install()` command to the `webdriver.Chrome()` method, we tell Selenium to download and install the latest version of the ChromeDriver for us. This makes it easy to set up and use Selenium with Chrome, and ensures that our code runs smoothly across different machines and environments.

Setting up Proxy

We can use a proxy to hide our IP address and location and avoid IP blocking and other restrictions. To set up a proxy, we need to create an instance of ChromeOptions class and set the proxy using the add_argument method. In our case, we are using a local proxy server with hostname "http://127.0.0.1" and port "24001". The complete code for setting up proxy is shown below:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
hostname = "http://127.0.0.1"
port = "24001"
options.add_argument('--proxy-server=%s' % hostname + ':' + port)
driver = webdriver.Chrome(options=options, ChromeDriverManager().install()) 

Once we have set up the proxy, we can proceed to scrape the Google search results. To do this, we need to create a function that will extract the relevant information from each search result. In our case, we want to extract the title, link, and description of each search result. We can achieve this using XPath expressions to locate the relevant elements in the HTML DOM.

def scrape_page():
    for element in driver.find_elements(By.XPATH, '//div[@id="search"]//div[@class="g"]'):
      title = element.find_element(By.XPATH, './/h3').text
      link = element.find_element(By.XPATH, './/div[@class="yuRUbf"]/a').get_attribute('href')
      detail = element.find_element(By.XPATH, './/span[@class="aCOpRe"]').text
      print(title, link, detail) 

The above function finds all the search result elements using the XPath expression '//div[@id="search"]//div[@class="g"]', and then extracts the title, link, and detail using more specific XPath expressions. Finally, it prints the extracted information to the console.

We also need to create a function that will loop through a list of keywords and perform the search for each keyword. In our case, we have a list of three keywords: "usa", "pakistan", and "canada". The complete code for the keywords function is shown below:

 def keywords():
    key_words = ['usa', 'pakisan', 'canada']
    for keyword in key_words:
      driver.delete_all_cookies
      driver.get('https://www.google.com/search?&q={}&num=40'.format(keyword))
      sleep(3)
      scrape_page()
keywords()
driver.close()
f.close() 

After deleting all cookies, we use the get method of the webdriver object to navigate to the Google search page for the current keyword. In this case, we are using `https://www.google.com/search?&q={}&num=40` as the URL format and passing in the current keyword in the format method. We also set num=40 as a parameter to ensure that we retrieve the top 40 search results for each keyword.

After navigating to the search page, we call the scrape_page function which scrapes the data from the page. The function finds all the search results on the page using the find_elements By.XPATH, method with the `//div[@id="search"]//div[@class="g"]` XPath expression. This expression finds all the divs with class "g" that are contained within a div with id "search". This is the div that contains the search results on the page.

For each search result element found, the function extracts the title, link, and detail information using XPath expressions and the find_element By.XPATH, and `get_attribute` methods. The extracted information is then printed to the console and written to the output CSV file using the writerow method of the csv.writer object.

Finally, the `driver.close()` method is called to close the web browser and the output CSV file is closed using the close method of the file object.

here is the complete code to scrape google search results using python, selenium and proxy

from selenium import webdriver
import csv
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
options = Options()
hostname = "http://127.0.0.1"
port = "24001"
options.add_argument('--proxy-server=%s' % hostname + ':' + port)
output_file = 'googlesearch.csv'
f = open(output_file, 'w', newline='')
writer = csv.writer(f, delimiter=',')
writer.writerow(['Title', 'Link', 'Detail'])
driver = webdriver.Chrome(options=options, ChromeDriverManager().install())
def scrape_page():
    for element in driver.find_elements(By.XPATH, '//div[@id="search"]//div[@class="g"]'):
      title = element.find_element(By.XPATH, './/h3').text
      link = element.find_element(By.XPATH, './/div[@class="yuRUbf"]/a').get_attribute('href')
      detail = element.find_element(By.XPATH, './/span[@class="aCOpRe"]').text
      print(title, link, detail)
      writer.writerow([title, link, detail])
      #f.write(title.replace(',','|')+','+link.replace(',','|')+','+detail.replace(',','|')+'\n')
def keywords():
    key_words = ['usa', 'pakisan', 'canada']
    for keyword in key_words:
        driver.delete_all_cookies
        driver.get('https://www.google.com/search?&q={}&num=40'.format(keyword))
        sleep(3)
        scrape_page()
keywords()
driver.close()
f.close()

Conclusion

This script demonstrates how to use Python and Selenium to scrape Google search results for multiple keywords using a proxy. By using a proxy, we can avoid IP blocking and ensure that we are not detected as a bot. The script is also easily modifiable to extract other data from the search results such as images or videos, making it a useful tool for a variety of web scraping tasks.