How to web scrape data using Selenium and BeautifulSoup in Python — an example on www.mobile.de (1/3 part)
Introduction
In this tutorial I will show you how I scraped mobile.de to collect data about cars. I hope that even experienced web scrapers will learn something.
The 2nd and 3rd part will be found on the following links once they are published:
- part 2: link
- part 3: [link will be here later]
I am planning to make this to a 3 parts article, otherwise it would be too long to digest everything at once. This article is the 1st part of out of the 3 parts series.
All the scripts that I used can be found on the following GitHub link: https://github.com/krinya/mobile_de_scraping
If you want to scrape a website and you need my help you can contact me.
Import some packages that we are going to use
If you do not have any of the package install them by using pip/ pip3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from random import randrange
from tqdm import tqdm #progress bar
Start web browser using Selenium
By using Selenium we can imitate that we are using a web browser and browsing on a website just like any user would. It is useful for us since using other methods it might be that not every element of a given webpage can be loaded — e.g. JavaScript elements.
We can also set some preferences that can possibly speeds up our web scraping. In this case since I do not want to save images, I force the Selenium browser not to load them.
This is the most easiest way to install and start Selenium from Python. As you will see it it will open a Chrome window. If it is not working please look at other tutorials that are discussing how to install and run Selenium.
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)====== WebDriver manager ======
Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
<ipython-input-2-6baf748b4601>:5: DeprecationWarning: use options instead of chrome_options
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
Scraping
I will show you how to scrape a website based on www.mobile.de which is the leading secondhand (and new) vehicle marketplace in Europe. Mobile.de is from Germany and since the German economy is the biggest in Europe and also they car industry is pretty huge (think about Volkswagen, BMW, Mercedes and Audi) the site is one of the most popular car listing website in the World. It contains millions of advertisements.
My approach to scrape and save car related data is the following:
1st part) Get links for all the make and model combinations. (this is the part that I am presenting in this tutorial)
2nd part) Get all the links for each car advertisements: in this part we can get the car make and model (maybe price too) for each link.
3rd part) Get all car related data for each car based on the links that we collected in step 2.
Go to the landing page of mobile.de

Luckily mobile.de has an English UI — not only German — , so we do not need to translate all the things to English and we can grab the English elements.
Lets look how we can use Selenium combined with BeautifulSoup to get the content of the page:
starting_link_to_scrape = "https://www.mobile.de/?lang=en"
driver.get(starting_link_to_scrape)
time.sleep(1)
base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')
The content of the whole landing page (I commented it out because it is too long):
#base_soup
First we would need to figure out how to scrape all car’s data. Unfortunately if we just list all the cars on the site we won’t be able to assign the make and the brand to cars (or I was not able to figure it how to do that) so we need to look for another method: We need to manually grab all car make and model combination and crate a URL based on them.
Get all make and their id
If we inspect the “form-group” element on the page we can see that this element contains all the car makes. We need to get this and their id which is indicated with a “value” here in their HTML code.
make_list = base_soup.findAll('div', {'class': 'form-group'})[0]
make_list<div class="form-group"><label for="qsmakeBuy">Make</label><select "="" class="form-control form-control--dropdown form-control--m mmh-make-incl" id="qsmakeBuy" name="mk"><option selected="selected" value="">Any</option><option class="pmak" value="17200">Mercedes-Benz</option><option class="pmak" value="25200">Volkswagen</option><option class="pmak" value="3500">BMW</option><option class="pmak" value="1900">Audi</option><option disabled="disabled"></option><option value="140">Abarth</option><option value="203">AC</option><option value="375">Acura</option><option value="800">Aixam</option><option value="900">Alfa Romeo</option><option value="1100">ALPINA</option><option value="121">Artega</option><option value="1750">Asia Motors</option><option value="1700">Aston Martin</option><option value="1900">Audi</option><option value="2000">Austin</option><option value="1950">Austin Healey</option><option value="31863">BAIC</option><option value="3100">Bentley</option><option value="3500">BMW</option><option value="3850">Borgward</option><option value="4025">Brilliance</option><option value="4350">Bugatti</option><option value="4400">Buick</option><option value="4700">Cadillac</option><option value="112">Casalini</option><option value="5300">Caterham</option><option value="83">Chatenet</option><option value="5600">Chevrolet</option><option value="5700">Chrysler</option><option value="5900">Citroën</option><option value="6200">Cobra</option><option value="6325">Corvette</option><option value="3">Cupra</option><option value="6600">Dacia</option><option value="6800">Daewoo</option><option value="7000">Daihatsu</option><option value="7400">DeTomaso</option><option value="31864">DFSK</option><option value="7700">Dodge</option><option value="255">Donkervoort</option><option value="235">DS Automobiles</option><option value="8600">Ferrari</option><option value="8800">Fiat</option><option value="172">Fisker</option><option value="9000">Ford</option><option value="205">GAC Gonow</option><option value="204">Gemballa</option><option value="9900">GMC</option><option value="122">Grecav</option><option value="186">Hamann</option><option value="10850">Holden</option><option value="11000">Honda</option><option value="11050">Hummer</option><option value="11600">Hyundai</option><option value="11650">Infiniti</option><option value="11900">Isuzu</option><option value="12100">Iveco</option><option value="12400">Jaguar</option><option value="12600">Jeep</option><option value="13200">Kia</option><option value="13450">Koenigsegg</option><option value="13900">KTM</option><option value="14400">Lada</option><option value="14600">Lamborghini</option><option value="14700">Lancia</option><option value="14800">Land Rover</option><option value="14845">Landwind</option><option value="15200">Lexus</option><option value="15400">Ligier</option><option value="15500">Lincoln</option><option value="15900">Lotus</option><option value="16200">Mahindra</option><option value="16600">Maserati</option><option value="16700">Maybach</option><option value="16800">Mazda</option><option value="137">McLaren</option><option value="17200">Mercedes-Benz</option><option value="17300">MG</option><option value="30011">Microcar</option><option value="17500">MINI</option><option value="17700">Mitsubishi</option><option value="17900">Morgan</option><option value="18700">Nissan</option><option value="18875">NSU</option><option value="18975">Oldsmobile</option><option value="19000">Opel</option><option value="149">Pagani</option><option value="19300">Peugeot</option><option value="19600">Piaggio</option><option value="19800">Plymouth</option><option value="4">Polestar</option><option value="20000">Pontiac</option><option value="20100">Porsche</option><option value="20200">Proton</option><option value="20700">Renault</option><option value="21600">Rolls-Royce</option><option value="21700">Rover</option><option value="125">Ruf</option><option value="21800">Saab</option><option value="22000">Santana</option><option value="22500">Seat</option><option value="22900">Skoda</option><option value="23000">Smart</option><option value="188">speedART</option><option value="100">Spyker</option><option value="23100">Ssangyong</option><option value="23500">Subaru</option><option value="23600">Suzuki</option><option value="23800">Talbot</option><option value="23825">Tata</option><option value="189">TECHART</option><option value="135">Tesla</option><option value="24100">Toyota</option><option value="24200">Trabant</option><option value="24400">Triumph</option><option value="24500">TVR</option><option value="25200">Volkswagen</option><option value="25100">Volvo</option><option value="25300">Wartburg</option><option value="113">Westfield</option><option value="25650">Wiesmann</option><option value="1400">Other</option></select></div>
Get all the make elements:
one_make = make_list.findAll('option')
Put all the make and their corresponding ids to a Pandas DataFrame:
car_make = []
id1 = []
for i in tqdm(range(len(one_make))):
#tqdm is a module that is used to show the progress of the loop, it's easy to use it
car_make.append(one_make[i].text.strip())
try:
#print(one_make[i]['value'])
id1.append(one_make[i]['value'])
except:
#print('')
id1.append('')
car_base_make_data = pd.DataFrame({
'car_make' : car_make,
'id1' : id1
})100%|██████████| 124/124 [00:00<00:00, 125112.75it/s]
Let’s see that we did right or not:
car_base_make_data
print(list(car_base_make_data['car_make']))
print(list(car_base_make_data['id1']))['Any', 'Mercedes-Benz', 'Volkswagen', 'BMW', 'Audi', '', 'Abarth', 'AC', 'Acura', 'Aixam', 'Alfa Romeo', 'ALPINA', 'Artega', 'Asia Motors', 'Aston Martin', 'Audi', 'Austin', 'Austin Healey', 'BAIC', 'Bentley', 'BMW', 'Borgward', 'Brilliance', 'Bugatti', 'Buick', 'Cadillac', 'Casalini', 'Caterham', 'Chatenet', 'Chevrolet', 'Chrysler', 'Citroën', 'Cobra', 'Corvette', 'Cupra', 'Dacia', 'Daewoo', 'Daihatsu', 'DeTomaso', 'DFSK', 'Dodge', 'Donkervoort', 'DS Automobiles', 'Ferrari', 'Fiat', 'Fisker', 'Ford', 'GAC Gonow', 'Gemballa', 'GMC', 'Grecav', 'Hamann', 'Holden', 'Honda', 'Hummer', 'Hyundai', 'Infiniti', 'Isuzu', 'Iveco', 'Jaguar', 'Jeep', 'Kia', 'Koenigsegg', 'KTM', 'Lada', 'Lamborghini', 'Lancia', 'Land Rover', 'Landwind', 'Lexus', 'Ligier', 'Lincoln', 'Lotus', 'Mahindra', 'Maserati', 'Maybach', 'Mazda', 'McLaren', 'Mercedes-Benz', 'MG', 'Microcar', 'MINI', 'Mitsubishi', 'Morgan', 'Nissan', 'NSU', 'Oldsmobile', 'Opel', 'Pagani', 'Peugeot', 'Piaggio', 'Plymouth', 'Polestar', 'Pontiac', 'Porsche', 'Proton', 'Renault', 'Rolls-Royce', 'Rover', 'Ruf', 'Saab', 'Santana', 'Seat', 'Skoda', 'Smart', 'speedART', 'Spyker', 'Ssangyong', 'Subaru', 'Suzuki', 'Talbot', 'Tata', 'TECHART', 'Tesla', 'Toyota', 'Trabant', 'Triumph', 'TVR', 'Volkswagen', 'Volvo', 'Wartburg', 'Westfield', 'Wiesmann', 'Other']
['', '17200', '25200', '3500', '1900', '', '140', '203', '375', '800', '900', '1100', '121', '1750', '1700', '1900', '2000', '1950', '31863', '3100', '3500', '3850', '4025', '4350', '4400', '4700', '112', '5300', '83', '5600', '5700', '5900', '6200', '6325', '3', '6600', '6800', '7000', '7400', '31864', '7700', '255', '235', '8600', '8800', '172', '9000', '205', '204', '9900', '122', '186', '10850', '11000', '11050', '11600', '11650', '11900', '12100', '12400', '12600', '13200', '13450', '13900', '14400', '14600', '14700', '14800', '14845', '15200', '15400', '15500', '15900', '16200', '16600', '16700', '16800', '137', '17200', '17300', '30011', '17500', '17700', '17900', '18700', '18875', '18975', '19000', '149', '19300', '19600', '19800', '4', '20000', '20100', '20200', '20700', '21600', '21700', '125', '21800', '22000', '22500', '22900', '23000', '188', '100', '23100', '23500', '23600', '23800', '23825', '189', '135', '24100', '24200', '24400', '24500', '25200', '25100', '25300', '113', '25650', '1400']
Yes, we luckily did. We just need to drop some values that we do not need and remove duplicates:
car_make_filter_out = ['Any', 'Other', '']
car_base_make_data = car_base_make_data[~car_base_make_data.car_make.isin(car_make_filter_out)]
car_base_make_data = car_base_make_data.drop_duplicates()
car_base_make_data

We ended up with 117 different car make. Probably you can recognize some of them.
Get all models for each of the make
Now as we have all the car makes we need to get all the models and their ids for each of them separately. To do this we need to force our browser using Selenium to select and click on each of the make in the dropdown list of the make separately. This will activate a new list on the dropdown menu next to it which consists all the models for a given make. We need to scrape these separately.
Again I will do this in a loop and crate a Pandas DataFrame:
car_base_model_data = pd.DataFrame()
for one_make in tqdm(car_base_make_data['car_make'], "Progress: "):
make_string = "//select[@name='mk']/option[text()='{}']".format(one_make) #we need find the element that we want to click and input a value to it
driver.find_element_by_xpath(make_string).click() #use selenium to click the button
time.sleep(3) #wait for the page to load
base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')
model_list = base_soup.findAll('div', {'class': 'form-group'})[1]
models = model_list.findAll('option')
car_model = []
id2 = []
for i in range(len(models)):
#print(car_model.append(models[i].text.strip()))
car_model.append(models[i].text.strip())
try:
#print(models[i]['value'])
id2.append(models[i]['value'])
except:
#print('')
id2.append('')
car_base_model_data_aux = pd.DataFrame({'car_model' : car_model, 'id2' : id2})
car_base_model_data_aux['car_make'] = one_make
car_base_model_data = pd.concat([car_base_model_data, car_base_model_data_aux], ignore_index=True)Progress: 100%|██████████| 117/117 [06:25<00:00, 3.30s/it]car_base_model_data = car_base_model_data.drop_duplicates()
We can look at one make. In this case I looked at Austin cars. We can notice that some of the car models does not have an id. It is fine, we will handle them later.
car_base_model_data[car_base_model_data['car_make'] == "Austin"]

Join make data and model data together.
car_data_base = pd.merge(car_base_make_data, car_base_model_data, left_on= ['car_make'], right_on=['car_make'], how = 'right')
Drop out rows that do not have ids:
car_data_base = car_data_base[~car_data_base.id2.isin([""])]
Create link that we can use to open for each make and model combinations
We can also notice that some of the rows are not just a number but it has a letter at very beginning of their id strings. I figured out and those are corresponding to a group of models. So lets say there is BMW 1-series, but within this category there are BMW 116, BMW 118, BMW 120 the group BMW 1-series is also listed here as separate categories. These categories are listed with and id that starts with a letter. In this example script I decided to just scrape the more granular category, so i will drop ids that starts with a letter (in this case that are not a numeric string)
car_data_base = car_data_base[car_data_base.id2.apply(lambda x: x.isnumeric())]
Create a link:
car_data_base['link'] = "https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=" + car_data_base['id1'] + ";" + car_data_base['id2'] + "&ref=quickSearch&sfmr=false&vc=Car"
car_data_base = car_data_base.reset_index(drop = True)
car_data_base

We are ready with the first part
Look at one random example:
random_number = randrange(len(car_data_base))
print("make: ", car_data_base['car_make'][random_number], " model: ", car_data_base['car_model'][random_number])
print(car_data_base['link'][random_number])make: Audi model: RS3
https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=1900;36&ref=quickSearch&sfmr=false&vc=Car
As we can see we generated a link that we can use later for going to that page and that is exactly what we wanted to achieve.
Save CSV
We can save our data to a .csv file or upload it to a database where we want to store it, so we do not need to re-scrape if we want to use this list.
car_data_base.to_csv("data/make_and_model_links.csv", encoding='utf-8', index=False)
We are basically finished I will just put everything above together into one function so we just need to call that one to scrape the data.
Put everything together into one function
def get_all_make_model(mobile_de_eng_base_link="https://www.mobile.de/?lang=en", save_filename="make_and_model_links.csv"):
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from random import randrange
from tqdm import tqdm #progress bar
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
driver.get(mobile_de_eng_base_link)
time.sleep(3)
base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')
make_list = base_soup.findAll('div', {'class': 'form-group'})[0]
one_make = make_list.findAll('option')
car_make = []
id1 = []
for i in range(len(one_make)):
car_make.append(one_make[i].text.strip())
try:
id1.append(one_make[i]['value'])
except:
id1.append('')
car_base_make_data = pd.DataFrame({'car_make': car_make, 'id1': id1})
car_make_filter_out = ['Any', 'Other', '']
car_base_make_data = car_base_make_data[~car_base_make_data.car_make.isin(car_make_filter_out)]
car_base_make_data = car_base_make_data.drop_duplicates()
car_base_make_data = car_base_make_data.reset_index(drop=True)
car_base_model_data = pd.DataFrame()
for one_make in tqdm(car_base_make_data['car_make'], "Progress: "):
make_string = "//select[@name='mk']/option[text()='{}']".format(one_make)
driver.find_element_by_xpath(make_string).click()
time.sleep(3)
base_source = driver.page_source
base_soup = BeautifulSoup(base_source, 'html.parser')
model_list = base_soup.findAll('div', {'class': 'form-group'})[1]
models = model_list.findAll('option')
car_model = []
id2 = []
for i in range(len(models)):
car_model.append(models[i].text.strip())
try:
id2.append(models[i]['value'])
except:
id2.append('')
car_base_model_data_aux = pd.DataFrame({'car_model': car_model, 'id2': id2})
car_base_model_data_aux['car_make'] = one_make
car_base_model_data = pd.concat([car_base_model_data, car_base_model_data_aux], ignore_index=True)
car_data_base = pd.merge(car_base_make_data, car_base_model_data, left_on=['car_make'], right_on=['car_make'], how='right')
car_data_base = car_data_base[~car_data_base.id2.isin([""])]
car_data_base = car_data_base[car_data_base.id2.apply(lambda x: x.isnumeric())]
car_data_base = car_data_base.drop_duplicates()
car_data_base['link'] = "https://suchen.mobile.de/fahrzeuge/search.html?dam=0&isSearchRequest=true&ms=" + car_data_base['id1'] + ";" + car_data_base['id2'] + "&ref=quickSearch&sfmr=false&vc=Car"
car_data_base = car_data_base.reset_index(drop=True)
if len(save_filename) > 0:
car_data_base.to_csv(save_filename, encoding='utf-8', index=False)
return(car_data_base)all_data = get_all_make_model(save_filename="data/make_and_model_links.csv")====== WebDriver manager ======
Current google-chrome version is 97.0.4692
Get LATEST driver version for 97.0.4692
Driver [C:\Users\menyh\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe] found in cache
<ipython-input-19-66173c854f05>:18: DeprecationWarning: use options instead of chrome_options
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
Progress: 100%|██████████| 117/117 [06:17<00:00, 3.23s/it]all_data

And we are ready. We have a table where we have all the make and model combinations with their corresponding links.
Next
In the next part which will be the 2nd part of this web scraping series I will show you how to get the links for each of this make and model combinations. Then it the 3rd part we will collect data about the ads themselves.
I hope you will check out those too.
When it is going to be published you will find it on the following links: