Attention! If you're having trouble following the previous step, you can visit relevant lesson , download the archive of the previous step, install it, and start this lesson exactly where the previous one ended!
In this course, we will go through all the stages of creating a new project together:
Of course, the most convenient (and, by the way, more reliable) way will be the interaction of our site with the provider site via API. Indeed, this requires
The Django framework allows us to organize such interaction with your own means, without resorting to installing third-party packages. However, the Django REST ftamework (DRF) package will do the job best.
Nevertheless, in our case, we will use another method - reading and extracting the necessary information directly from the HTML page. This action is called scraping (parsing) of the site.
Two popular Python libraries will be used for this purpose: beautifulsoup4 and requests. You can install them using two terminal commands:
pip install beautifulsoup4
pip install requests
Typically, data on a product page is grouped into blocks. Inside the blocks, the same type of data is under the same selectors (see figure):
If we download and parse an HTML page with a list of products, we can get a structured list of data. Specifically for our case, for each data block, we need to get the following dictionary:
{
'name': 'Труба профильная 40х20 2 мм 3м',
'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg',
'price': Decimal('493.00'),
'unit': 'за шт',
'code': '38140012'
}
The plan is ready - let's start its implementation!
Obviously, the script responsible for reading information from another site should be placed in a separate shop application module: shop/scraping.py. The scraping() function will be responsible for sending a request to URL_SCRAPING, reading data and writing this data to the Product table of the project database.
First of all, we need to get the HTML code of the product data page for further processing. This task will be assigned to the requests module:
import requests
def scraping():
URL_SCRAPING = 'https://www.some-site.com'
resp = requests.get(URL_SCRAPING, timeout=10.0)
if resp.status_code != 200:
raise Exception('HTTP error access!')
data_list = []
html = resp.text
It makes sense to immediately see what you got. To do this, let's add code that will count the number of characters in the html object and at the same time print this object itself:
html = resp.text
print(f'HTML text consists of {len(html)} symbols')
print(html)
if __name__ == '__main__':
scraping()
The shop/scraaping.py module does not require any Django settings (at least not yet), so you can run it like a regular Python script:
HTML text consists of 435395 symbols
<!DOCTYPE html>
<html lang="ru">
<head>
<link rel="shortcut icon" type="image/x-icon" href="/bitrix/templates/elektro_light/favicon_new.ico"/>
<meta name="robots" content="index, follow">
<meta name="keywords" content="Профильные трубы, уголки">
<meta name="description" content="Цены на профильные трубы, уголки от руб. Описание. Характеристики. Отзывы. Скидки на профильные трубы, уголки.">
<meta name="viewport" content="width=device-width, initial-scale=1.0 user-scalable=no"/>
<meta name="msapplication-TileColor" content="#ffffff">
As you can see, the result really looks like an HTML page.
The first part of the task is solved - access to the site data is obtained, and those 435,395 characters that are displayed on the screen contain all the information we need. All we now need is to simply extract this information and store the result in the database.
Further processing will be most conveniently carried out using the beautifulsoup4 module. To do this, we first need to create a soup object, which is a nested data structure of an HTML document:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
More information on how to get started with this package can be found on the man page: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
You can also read more about the beautifulsoup4 CSS selectors here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors Further on the supplier's page, we will be most interested in the product block - layout of repeating product cards with a similar data structure. You can get a list of elements of the same type from the soup object using the select() method, where the CSS selector of this block is specified as an argument. In our case it will be class=”catalog-item-card”:
blocks = soup.select('.catalog-item-card ')
In the loop, we can access each block and at the same time see what is inside the block object. This is how the modified code will look like:
html = resp.text
soup = BeautifulSoup(html, 'html.parser')
blocks = soup.select('.catalog-item-card ')
for block in blocks:
print(f'HTML text consists of {len(block.text)} symbols')
print(50 * '=')
print(block.text)
break
And this is how the printed block.text object will look like:
HTML text consists of 382 symbols
==================================================
<div class="catalog-item-card" itemprop="itemListElement" itemscope="" itemtype="http://schema.org/Product">
<div class="catalog-item-info">
<div class="item-all-title">
<a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">
<span itemprop="name">Труба профильная 40х20 2 мм 3м</span>
</a>
As you can see, the number of characters in the block has been reduced to 382. Which greatly simplifies our task.
We can parse these blocks into elements of interest to us using the soup.select_one() method, which, unlike the select() method, does not select all elements of the page, that satisfies the condition (method argument), but only the first matched element. It is also important to remember that the text obtained with the soup.select_one() object can be extracted using the text method. Thus, applying this method with certain arguments, we fill almost the entire data dictionary, with the exception of the code field:
soup = BeautifulSoup(html, 'html.parser')
blocks = soup.select('.catalog-item-card ')
for block in blocks:
"""{
'name': 'Труба профильная 40х20 2 мм 3м',
'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg',
'price': Decimal('493.00'),
'unit': 'за шт',
'code': '38140012'
}
"""
data = {}
name = block.select_one('.item-title[title]').get_text().strip()
data['name'] = name
image_url = URL_SCRAPING_DOMAIN + block.select_one('img')['src']
data['image_url'] = image_url
price_raw = block.select_one('.item-price ').text
# '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\t493.00\t\t\t\t\t\t\t\t\t\t\t\t руб. '
price = re.findall(r'\S\d+\.\d+\S', price_raw)[0]
price = Decimal(price)
data['price'] = price # 493.00
unit = block.select_one('.unit ').text.strip()
# '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\tза шт\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'
data['unit'] = unit # 'за шт'
Upon closer examination of the product block, it turned out that some of the information on the product is located on another page of the supplier's website - on the product detail page. The link itself to go to this page is in the block block.
Therefore, we will have to repeat here the same algorithm that we used a few steps ago to get data for the block block:
# find and open detail page
url_detail = block.select_one('.item-title')
# <a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">
url_detail = url_detail['href']
# '/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/'
url_detail = URL_SCRAPING_DOMAIN + url_detail
html_detail = requests.get(url_detail).text
soup = BeautifulSoup(html_detail, 'html.parser')
code_block = soup.select_one('.catalog-detail-property')
code = code_block.select_one('b').text
data['code'] = code
data_list.append(data)
print(data)
If we do everything right, we will end up with a list of dictionaries with data for each block.
The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:
The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:
class ScrapingError(Exception):
pass
class ScrapingTimeoutError(ScrapingError):
pass
class ScrapingHTTPError(ScrapingError):
pass
class ScrapingOtherError(ScrapingError):
pass
And then we make changes to the code:
try:
resp = requests.get(URL_SCRAPING, timeout=10.0)
except requests.exceptions.Timeout:
raise ScrapingTimeoutError("request timed out")
except Exception as e:
raise ScrapingOtherError(f'{e}')
if resp.status_code != 200:
raise ScrapingHTTPError(f"HTTP {resp.status_code}: {resp.text}")
As you can see, the product is added only if it is not already in the database. The search is performed by the product code number (field value code ).
Despite the fact that in the scraping.py function itself, the data is already written to the database, we still return the data_list list. Just in case).
However, if we now try to reproduce this script, we will get an error:
"/home/su/Projects/django-apps/Projects/drop-ship-store/venv/lib/python3.8/site-packages/django/conf/__init__.py", line 67, in _setup
raise ImproperlyConfigured(
django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
Process finished with exit code 1
The thing is that now the script accesses the database, which means that you need to get the Django settings. You can run this code to check in management/commands (more on this can be found here: https://docs.djangoproject.com/en/4.0/howto/custom-management-commands/) But we will do otherwise: we will immediately add the launch page and check the operation of the scraping() function already there.
The algorithm for adding a new page remains the same:
If, after the successful completion of all these points, we go to the admin panel, we will see that, after running the script, the Product table is filled with new values:
Now everything is ready for the last step: adding the pages of the online store directly and the code that will manage them. But we will deal with this in the next seventh and last lesson.
You can learn more about all the details of this stage from this video (RU voice):