How to web scrape data using Selenium and BeautifulSoup in Python — an example on www.mobile.de (1/3 part)

Introduction

In this tutorial I will show you how I scraped mobile.de to collect data about cars. I hope that even experienced web scrapers will learn something.

The 2nd and 3rd part will be found on the following links once they are published:

  • part 2: link
  • part 3: [link will be here later]

I am planning to make this to a 3 parts article, otherwise it would be too long to digest everything at once. This article is the 1st part of out of the 3 parts series.

All the scripts that I used can be found on the following GitHub link: https://github.com/krinya/mobile_de_scraping

If you want to scrape a website and you need my help you can contact me.

Import some packages that we are going to use

If you do not have any of the package install them by using pip/ pip3

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from random import randrange
from tqdm import tqdm #progress bar

Start web browser using Selenium

By using Selenium we can imitate that we are using a web browser and browsing on a website just like any user would. It is useful for us since using other methods it might be that not every element of a given webpage can be loaded — e.g. JavaScript elements.

We can also set some preferences that can possibly speeds up our web scraping. In this case since I do not want to save images, I force the Selenium browser not to load them.

This is the most easiest way to install and start Selenium from Python. As you will see it it will open a Chrome window. If it is not working please look at other tutorials that are discussing how to install and run Selenium.

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
====== WebDriver manager ======
Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
<ipython-input-2-6baf748b4601>:5: DeprecationWarning: use options instead of chrome_options
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

Scraping

I will show you how to scrape a website based on www.mobile.de which is the leading secondhand (and new) vehicle marketplace in Europe. Mobile.de is from Germany and since the German economy is the biggest in Europe and also they car industry is pretty huge (think about Volkswagen, BMW, Mercedes and Audi) the site is one of the most popular car listing website in the World. It contains millions of advertisements.

My approach to scrape and save car related data is the following:

1st part) Get links for all the make and model combinations. (this is the part that I am presenting in this tutorial)

2nd part) Get all the links for each car advertisements: in this part we can get the car make and model (maybe price too) for each link.

3rd part) Get all car related data for each car based on the links that we collected in step 2.

landing_page

Luckily mobile.de has an English UI — not only German — , so we do not need to translate all the things to English and we can grab the English elements.

Lets look how we can use Selenium combined with BeautifulSoup to get the content of the page:

starting_link_to_scrape = "https://www.mobile.de/?lang=en"
driver.get(starting_link_to_scrape)
time.sleep(1)
base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')

The content of the whole landing page (I commented it out because it is too long):

#base_soup

First we would need to figure out how to scrape all car’s data. Unfortunately if we just list all the cars on the site we won’t be able to assign the make and the brand to cars (or I was not able to figure it how to do that) so we need to look for another method: We need to manually grab all car make and model combination and crate a URL based on them.

If we inspect the “form-group” element on the page we can see that this element contains all the car makes. We need to get this and their id which is indicated with a “value” here in their HTML code.

make_list = base_soup.findAll('div', {'class': 'form-group'})[0]
make_list
<div class="form-group"><label for="qsmakeBuy">Make</label><select "="" class="form-control form-control--dropdown form-control--m mmh-make-incl" id="qsmakeBuy" name="mk"><option selected="selected" value="">Any</option><option class="pmak" value="17200">Mercedes-Benz</option><option class="pmak" value="25200">Volkswagen</option><option class="pmak" value="3500">BMW</option><option class="pmak" value="1900">Audi</option><option disabled="disabled"></option><option value="140">Abarth</option><option value="203">AC</option><option value="375">Acura</option><option value="800">Aixam</option><option value="900">Alfa Romeo</option><option value="1100">ALPINA</option><option value="121">Artega</option><option value="1750">Asia Motors</option><option value="1700">Aston Martin</option><option value="1900">Audi</option><option value="2000">Austin</option><option value="1950">Austin Healey</option><option value="31863">BAIC</option><option value="3100">Bentley</option><option value="3500">BMW</option><option value="3850">Borgward</option><option value="4025">Brilliance</option><option value="4350">Bugatti</option><option value="4400">Buick</option><option value="4700">Cadillac</option><option value="112">Casalini</option><option value="5300">Caterham</option><option value="83">Chatenet</option><option value="5600">Chevrolet</option><option value="5700">Chrysler</option><option value="5900">Citroën</option><option value="6200">Cobra</option><option value="6325">Corvette</option><option value="3">Cupra</option><option value="6600">Dacia</option><option value="6800">Daewoo</option><option value="7000">Daihatsu</option><option value="7400">DeTomaso</option><option value="31864">DFSK</option><option value="7700">Dodge</option><option value="255">Donkervoort</option><option value="235">DS Automobiles</option><option value="8600">Ferrari</option><option value="8800">Fiat</option><option value="172">Fisker</option><option value="9000">Ford</option><option value="205">GAC Gonow</option><option value="204">Gemballa</option><option value="9900">GMC</option><option value="122">Grecav</option><option value="186">Hamann</option><option value="10850">Holden</option><option value="11000">Honda</option><option value="11050">Hummer</option><option value="11600">Hyundai</option><option value="11650">Infiniti</option><option value="11900">Isuzu</option><option value="12100">Iveco</option><option value="12400">Jaguar</option><option value="12600">Jeep</option><option value="13200">Kia</option><option value="13450">Koenigsegg</option><option value="13900">KTM</option><option value="14400">Lada</option><option value="14600">Lamborghini</option><option value="14700">Lancia</option><option value="14800">Land Rover</option><option value="14845">Landwind</option><option value="15200">Lexus</option><option value="15400">Ligier</option><option value="15500">Lincoln</option><option value="15900">Lotus</option><option value="16200">Mahindra</option><option value="16600">Maserati</option><option value="16700">Maybach</option><option value="16800">Mazda</option><option value="137">McLaren</option><option value="17200">Mercedes-Benz</option><option value="17300">MG</option><option value="30011">Microcar</option><option value="17500">MINI</option><option value="17700">Mitsubishi</option><option value="17900">Morgan</option><option value="18700">Nissan</option><option value="18875">NSU</option><option value="18975">Oldsmobile</option><option value="19000">Opel</option><option value="149">Pagani</option><option value="19300">Peugeot</option><option value="19600">Piaggio</option><option value="19800">Plymouth</option><option value="4">Polestar</option><option value="20000">Pontiac</option><option value="20100">Porsche</option><option value="20200">Proton</option><option value="20700">Renault</option><option value="21600">Rolls-Royce</option><option value="21700">Rover</option><option value="125">Ruf</option><option value="21800">Saab</option><option value="22000">Santana</option><option value="22500">Seat</option><option value="22900">Skoda</option><option value="23000">Smart</option><option value="188">speedART</option><option value="100">Spyker</option><option value="23100">Ssangyong</option><option value="23500">Subaru</option><option value="23600">Suzuki</option><option value="23800">Talbot</option><option value="23825">Tata</option><option value="189">TECHART</option><option value="135">Tesla</option><option value="24100">Toyota</option><option value="24200">Trabant</option><option value="24400">Triumph</option><option value="24500">TVR</option><option value="25200">Volkswagen</option><option value="25100">Volvo</option><option value="25300">Wartburg</option><option value="113">Westfield</option><option value="25650">Wiesmann</option><option value="1400">Other</option></select></div>

Get all the make elements:

one_make = make_list.findAll('option')

Put all the make and their corresponding ids to a Pandas DataFrame:

car_make = []
id1 = []

for i in tqdm(range(len(one_make))):
#tqdm is a module that is used to show the progress of the loop, it's easy to use it

car_make.append(one_make[i].text.strip())
try:
#print(one_make[i]['value'])
id1.append(one_make[i]['value'])
except:
#print('')
id1.append('')

car_base_make_data = pd.DataFrame({
'car_make' : car_make,
'id1' : id1
})
100%|██████████| 124/124 [00:00<00:00, 125112.75it/s]

Let’s see that we did right or not:

car_base_make_data
print(list(car_base_make_data['car_make']))
print(list(car_base_make_data['id1']))
['Any', 'Mercedes-Benz', 'Volkswagen', 'BMW', 'Audi', '', 'Abarth', 'AC', 'Acura', 'Aixam', 'Alfa Romeo', 'ALPINA', 'Artega', 'Asia Motors', 'Aston Martin', 'Audi', 'Austin', 'Austin Healey', 'BAIC', 'Bentley', 'BMW', 'Borgward', 'Brilliance', 'Bugatti', 'Buick', 'Cadillac', 'Casalini', 'Caterham', 'Chatenet', 'Chevrolet', 'Chrysler', 'Citroën', 'Cobra', 'Corvette', 'Cupra', 'Dacia', 'Daewoo', 'Daihatsu', 'DeTomaso', 'DFSK', 'Dodge', 'Donkervoort', 'DS Automobiles', 'Ferrari', 'Fiat', 'Fisker', 'Ford', 'GAC Gonow', 'Gemballa', 'GMC', 'Grecav', 'Hamann', 'Holden', 'Honda', 'Hummer', 'Hyundai', 'Infiniti', 'Isuzu', 'Iveco', 'Jaguar', 'Jeep', 'Kia', 'Koenigsegg', 'KTM', 'Lada', 'Lamborghini', 'Lancia', 'Land Rover', 'Landwind', 'Lexus', 'Ligier', 'Lincoln', 'Lotus', 'Mahindra', 'Maserati', 'Maybach', 'Mazda', 'McLaren', 'Mercedes-Benz', 'MG', 'Microcar', 'MINI', 'Mitsubishi', 'Morgan', 'Nissan', 'NSU', 'Oldsmobile', 'Opel', 'Pagani', 'Peugeot', 'Piaggio', 'Plymouth', 'Polestar', 'Pontiac', 'Porsche', 'Proton', 'Renault', 'Rolls-Royce', 'Rover', 'Ruf', 'Saab', 'Santana', 'Seat', 'Skoda', 'Smart', 'speedART', 'Spyker', 'Ssangyong', 'Subaru', 'Suzuki', 'Talbot', 'Tata', 'TECHART', 'Tesla', 'Toyota', 'Trabant', 'Triumph', 'TVR', 'Volkswagen', 'Volvo', 'Wartburg', 'Westfield', 'Wiesmann', 'Other']
['', '17200', '25200', '3500', '1900', '', '140', '203', '375', '800', '900', '1100', '121', '1750', '1700', '1900', '2000', '1950', '31863', '3100', '3500', '3850', '4025', '4350', '4400', '4700', '112', '5300', '83', '5600', '5700', '5900', '6200', '6325', '3', '6600', '6800', '7000', '7400', '31864', '7700', '255', '235', '8600', '8800', '172', '9000', '205', '204', '9900', '122', '186', '10850', '11000', '11050', '11600', '11650', '11900', '12100', '12400', '12600', '13200', '13450', '13900', '14400', '14600', '14700', '14800', '14845', '15200', '15400', '15500', '15900', '16200', '16600', '16700', '16800', '137', '17200', '17300', '30011', '17500', '17700', '17900', '18700', '18875', '18975', '19000', '149', '19300', '19600', '19800', '4', '20000', '20100', '20200', '20700', '21600', '21700', '125', '21800', '22000', '22500', '22900', '23000', '188', '100', '23100', '23500', '23600', '23800', '23825', '189', '135', '24100', '24200', '24400', '24500', '25200', '25100', '25300', '113', '25650', '1400']

Yes, we luckily did. We just need to drop some values that we do not need and remove duplicates:

car_make_filter_out = ['Any', 'Other', '']
car_base_make_data = car_base_make_data[~car_base_make_data.car_make.isin(car_make_filter_out)]
car_base_make_data = car_base_make_data.drop_duplicates()
car_base_make_data
png

We ended up with 117 different car make. Probably you can recognize some of them.

Now as we have all the car makes we need to get all the models and their ids for each of them separately. To do this we need to force our browser using Selenium to select and click on each of the make in the dropdown list of the make separately. This will activate a new list on the dropdown menu next to it which consists all the models for a given make. We need to scrape these separately.

Again I will do this in a loop and crate a Pandas DataFrame:

car_base_model_data = pd.DataFrame()

for one_make in tqdm(car_base_make_data['car_make'], "Progress: "):

make_string = "//select[@name='mk']/option[text()='{}']".format(one_make) #we need find the element that we want to click and input a value to it
driver.find_element_by_xpath(make_string).click() #use selenium to click the button

time.sleep(3) #wait for the page to load

base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')

model_list = base_soup.findAll('div', {'class': 'form-group'})[1]
models = model_list.findAll('option')

car_model = []
id2 = []

for i in range(len(models)):

#print(car_model.append(models[i].text.strip()))
car_model.append(models[i].text.strip())

try:

#print(models[i]['value'])
id2.append(models[i]['value'])
except:
#print('')
id2.append('')

car_base_model_data_aux = pd.DataFrame({'car_model' : car_model, 'id2' : id2})
car_base_model_data_aux['car_make'] = one_make

car_base_model_data = pd.concat([car_base_model_data, car_base_model_data_aux], ignore_index=True)
Progress: 100%|██████████| 117/117 [06:25<00:00, 3.30s/it]car_base_model_data = car_base_model_data.drop_duplicates()

We can look at one make. In this case I looked at Austin cars. We can notice that some of the car models does not have an id. It is fine, we will handle them later.

car_base_model_data[car_base_model_data['car_make'] == "Austin"]
png
car_data_base = pd.merge(car_base_make_data, car_base_model_data, left_on= ['car_make'], right_on=['car_make'], how = 'right')

Drop out rows that do not have ids:

car_data_base = car_data_base[~car_data_base.id2.isin([""])]

We can also notice that some of the rows are not just a number but it has a letter at very beginning of their id strings. I figured out and those are corresponding to a group of models. So lets say there is BMW 1-series, but within this category there are BMW 116, BMW 118, BMW 120 the group BMW 1-series is also listed here as separate categories. These categories are listed with and id that starts with a letter. In this example script I decided to just scrape the more granular category, so i will drop ids that starts with a letter (in this case that are not a numeric string)

car_data_base = car_data_base[car_data_base.id2.apply(lambda x: x.isnumeric())]

Create a link:

car_data_base['link'] = "https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=" + car_data_base['id1'] + ";" + car_data_base['id2'] + "&ref=quickSearch&sfmr=false&vc=Car"
car_data_base = car_data_base.reset_index(drop = True)
car_data_base
png

We are ready with the first part

Look at one random example:

random_number = randrange(len(car_data_base))
print("make: ", car_data_base['car_make'][random_number], " model: ", car_data_base['car_model'][random_number])
print(car_data_base['link'][random_number])
make: Audi model: RS3
https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=1900;36&ref=quickSearch&sfmr=false&vc=Car

As we can see we generated a link that we can use later for going to that page and that is exactly what we wanted to achieve.

Save CSV

We can save our data to a .csv file or upload it to a database where we want to store it, so we do not need to re-scrape if we want to use this list.

car_data_base.to_csv("data/make_and_model_links.csv", encoding='utf-8', index=False)

We are basically finished I will just put everything above together into one function so we just need to call that one to scrape the data.

Put everything together into one function

def get_all_make_model(mobile_de_eng_base_link="https://www.mobile.de/?lang=en", save_filename="make_and_model_links.csv"):

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from random import randrange
from tqdm import tqdm #progress bar

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

driver.get(mobile_de_eng_base_link)
time.sleep(3)
base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')

make_list = base_soup.findAll('div', {'class': 'form-group'})[0]
one_make = make_list.findAll('option')

car_make = []
id1 = []

for i in range(len(one_make)):

car_make.append(one_make[i].text.strip())

try:
id1.append(one_make[i]['value'])
except:
id1.append('')

car_base_make_data = pd.DataFrame({'car_make': car_make, 'id1': id1})

car_make_filter_out = ['Any', 'Other', '']
car_base_make_data = car_base_make_data[~car_base_make_data.car_make.isin(car_make_filter_out)]
car_base_make_data = car_base_make_data.drop_duplicates()
car_base_make_data = car_base_make_data.reset_index(drop=True)

car_base_model_data = pd.DataFrame()

for one_make in tqdm(car_base_make_data['car_make'], "Progress: "):

make_string = "//select[@name='mk']/option[text()='{}']".format(one_make)
driver.find_element_by_xpath(make_string).click()
time.sleep(3)

base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')

model_list = base_soup.findAll('div', {'class': 'form-group'})[1]
models = model_list.findAll('option')

car_model = []
id2 = []

for i in range(len(models)):

car_model.append(models[i].text.strip())

try:
id2.append(models[i]['value'])
except:
id2.append('')

car_base_model_data_aux = pd.DataFrame({'car_model': car_model, 'id2': id2})
car_base_model_data_aux['car_make'] = one_make

car_base_model_data = pd.concat([car_base_model_data, car_base_model_data_aux], ignore_index=True)

car_data_base = pd.merge(car_base_make_data, car_base_model_data, left_on=['car_make'], right_on=['car_make'], how='right')
car_data_base = car_data_base[~car_data_base.id2.isin([""])]
car_data_base = car_data_base[car_data_base.id2.apply(lambda x: x.isnumeric())]
car_data_base = car_data_base.drop_duplicates()

car_data_base['link'] = "https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=" + car_data_base['id1'] + ";" + car_data_base['id2'] + "&ref=quickSearch&sfmr=false&vc=Car"
car_data_base = car_data_base.reset_index(drop=True)

if len(save_filename) > 0:
car_data_base.to_csv(save_filename, encoding='utf-8', index=False)

return(car_data_base)
all_data = get_all_make_model(save_filename="data/make_and_model_links.csv")====== WebDriver manager ======
Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
<ipython-input-19-66173c854f05>:18: DeprecationWarning: use options instead of chrome_options
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
Progress: 100%|██████████| 117/117 [06:17<00:00, 3.23s/it]
all_data
png

And we are ready. We have a table where we have all the make and model combinations with their corresponding links.

Next

In the next part which will be the 2nd part of this web scraping series I will show you how to get the links for each of this make and model combinations. Then it the 3rd part we will collect data about the ads themselves.

I hope you will check out those too.

When it is going to be published you will find it on the following links:

--

--

--

Data Scientist, Python and R enthusiast.

Love podcasts or audiobooks? Learn on the go with our new app.

Setting up Web Development environment in WinNux (Windows + Linux)

[2/4] Push the Spring Boot App Docker Image to Docker Hub and Deploy to Digital Ocean’s Droplet

Class Not Found Exception in Java

Creating best-in-class UI and UX with low code

Part II: Quickly Open a Github Repository Notebook in Google Colab

Middle-distance, low-angle, daytime photograph of a blank, 4-post billboard, in a grass field w/rocks in the foreground.

CS371p Spring 2020: Ian Thorne (Post 10)

❗IMPORTANT_NOTICE❗ (eng/kor)

How to create a Pulse?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kristóf Menyhért

Kristóf Menyhért

Data Scientist, Python and R enthusiast.

More from Medium

Beautiful Soup template and example for Python web scraping

Generating Wordcloud using Python

How to web scrape data using Selenium and BeautifulSoup in Python — an example on www.mobile.de

skoda_roomsters

Scraping IEEE Articles Data in Python using Selenium Web Driver