Blog

Drop Shipping Online Store on Django (Part 6)

6. Adding a Scraping (Parsing) Module and Autofilling the Database Directly From Another Website!

Attention! If you're having trouble following the previous step, you can visit relevant lesson , download the archive of the previous step, install it, and start this lesson exactly where the previous one ended!

In this course, we will go through all the stages of creating a new project together:

How to get information from another site?

Of course, the most convenient (and, by the way, more reliable) way will be the interaction of our site with the provider site via API. Indeed, this requires

  • on the supplier site such an opportunity was organized (access to it via API),
  • and that the administration of the supplier site give us a login and password for this access.

The Django framework allows us to organize such interaction with your own means, without resorting to installing third-party packages. However, the Django REST ftamework (DRF) package will do the job best.

Nevertheless, in our case, we will use another method - reading and extracting the necessary information directly from the HTML page. This action is called scraping (parsing) of the site.

Two popular Python libraries will be used for this purpose: beautifulsoup4 and requests. You can install them using two terminal commands:

pip install beautifulsoup4
pip install requests

Web page structure

Typically, data on a product page is grouped into blocks. Inside the blocks, the same type of data is under the same selectors (see figure):

If we download and parse an HTML page with a list of products, we can get a structured list of data. Specifically for our case, for each data block, we need to get the following dictionary:

{
    'name': 'Труба профильная 40х20 2 мм 3м', 
    'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg', 
    'price': Decimal('493.00'), 
    'unit': 'за шт', 
    'code': '38140012'
 }

Action plan

  • Create scraping.py module in shop app
  • In this module, create a scraping() function that can:
    1. Get page HTML code (by package request)
    2. Process the resulting HTML code (by package beautifulsoup4)
    3. Save result in database
  • Test the function scraping() “manually”
  • Add a button to start scraping on the site page

The plan is ready - let's start its implementation!

Create a scraping (parsing) module and get the HTML code using the requests package

Obviously, the script responsible for reading information from another site should be placed in a separate shop application module: shop/scraping.py. The scraping() function will be responsible for sending a request to URL_SCRAPING, reading data and writing this data to the Product table of the project database.

First of all, we need to get the HTML code of the product data page for further processing. This task will be assigned to the requests module:

import requests

def scraping():
    URL_SCRAPING = 'https://www.some-site.com'
    resp = requests.get(URL_SCRAPING, timeout=10.0)
    if resp.status_code != 200:
        raise Exception('HTTP error access!')

    data_list = []
    html = resp.text

It makes sense to immediately see what you got. To do this, let's add code that will count the number of characters in the html object and at the same time print this object itself:

html = resp.text
    print(f'HTML text consists of {len(html)} symbols')
    print(html)


if __name__ == '__main__':
    scraping()

The shop/scraaping.py module does not require any Django settings (at least not yet), so you can run it like a regular Python script:

HTML text consists of 435395 symbols
<!DOCTYPE html>
<html lang="ru">
  <head>
    <link rel="shortcut icon" type="image/x-icon" href="/bitrix/templates/elektro_light/favicon_new.ico"/>
    <meta name="robots" content="index, follow">
<meta name="keywords" content="Профильные трубы, уголки">
<meta name="description" content="Цены на профильные трубы, уголки от  руб. Описание. Характеристики. Отзывы. Скидки на  профильные трубы, уголки.">
    <meta name="viewport" content="width=device-width, initial-scale=1.0 user-scalable=no"/>
    <meta name="msapplication-TileColor" content="#ffffff">

As you can see, the result really looks like an HTML page.

The first part of the task is solved - access to the site data is obtained, and those 435,395 characters that are displayed on the screen contain all the information we need. All we now need is to simply extract this information and store the result in the database.

Processing the resulting HTML code with the BeautifulSoup package

Further processing will be most conveniently carried out using the beautifulsoup4 module. To do this, we first need to create a soup object, which is a nested data structure of an HTML document:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

More information on how to get started with this package can be found on the man page: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

You can also read more about the beautifulsoup4 CSS selectors here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors Further on the supplier's page, we will be most interested in the product block - layout of repeating product cards with a similar data structure. You can get a list of elements of the same type from the soup object using the select() method, where the CSS selector of this block is specified as an argument. In our case it will be class=”catalog-item-card”:

blocks = soup.select('.catalog-item-card ')

In the loop, we can access each block and at the same time see what is inside the block object. This is how the modified code will look like:

html = resp.text

    soup = BeautifulSoup(html, 'html.parser')
    blocks = soup.select('.catalog-item-card ')

    for block in blocks:
        print(f'HTML text consists of {len(block.text)} symbols')
        print(50 * '=')
        print(block.text)
        break

And this is how the printed block.text object will look like:

HTML text consists of 382 symbols
==================================================
<div class="catalog-item-card" itemprop="itemListElement" itemscope="" itemtype="http://schema.org/Product">
<div class="catalog-item-info">
<div class="item-all-title">
<a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">
<span itemprop="name">Труба профильная 40х20 2 мм 3м</span>
</a>

As you can see, the number of characters in the block has been reduced to 382. Which greatly simplifies our task.

We can parse these blocks into elements of interest to us using the soup.select_one() method, which, unlike the select() method, does not select all elements of the page, that satisfies the condition (method argument), but only the first matched element. It is also important to remember that the text obtained with the soup.select_one() object can be extracted using the text method. Thus, applying this method with certain arguments, we fill almost the entire data dictionary, with the exception of the code field:

soup = BeautifulSoup(html, 'html.parser')
    blocks = soup.select('.catalog-item-card ')

    for block in blocks:
        """{
        'name': 'Труба профильная 40х20 2 мм 3м', 
        'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg', 
        'price': Decimal('493.00'), 
        'unit': 'за шт', 
        'code': '38140012'
        }
        """
        data = {}
        name = block.select_one('.item-title[title]').get_text().strip()
        data['name'] = name

        image_url = URL_SCRAPING_DOMAIN + block.select_one('img')['src']
        data['image_url'] = image_url

        price_raw = block.select_one('.item-price ').text
        # '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\t493.00\t\t\t\t\t\t\t\t\t\t\t\t  руб. '
        price = re.findall(r'\S\d+\.\d+\S', price_raw)[0]
        price = Decimal(price)
        data['price'] = price   # 493.00

        unit = block.select_one('.unit ').text.strip()
        # '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\tза шт\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'
        data['unit'] = unit  # 'за шт'

Generating an additional link to go to the detail page and getting the product code

Upon closer examination of the product block, it turned out that some of the information on the product is located on another page of the supplier's website - on the product detail page. The link itself to go to this page is in the block block.

Therefore, we will have to repeat here the same algorithm that we used a few steps ago to get data for the block block:

  • Generate link to detail page
  • Follow this link and read with requests.get() the new HTML code of this page already - the detail page
  • Save the received data in a new object Beautiful Soup
  • Extract the code number using the same method soup.select_one()

# find and open detail page
        url_detail = block.select_one('.item-title')
        # <a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">

        url_detail = url_detail['href']
        # '/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/'

        url_detail = URL_SCRAPING_DOMAIN + url_detail

        html_detail = requests.get(url_detail).text
        soup = BeautifulSoup(html_detail, 'html.parser')
        code_block = soup.select_one('.catalog-detail-property')
        code = code_block.select_one('b').text
        data['code'] = code

        data_list.append(data)

        print(data)

If we do everything right, we will end up with a list of dictionaries with data for each block.

Adding error handling

The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:

  • Availability (or inaccessibility of the provider site)
  • Changing page layout
  • Internet connection problems
  • and so on…

The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:

  • Availability (or inaccessibility of the provider site)
  • Changing page layout
  • Internet connection problems
  • and so on…

class ScrapingError(Exception):
    pass


class ScrapingTimeoutError(ScrapingError):
    pass


class ScrapingHTTPError(ScrapingError):
    pass


class ScrapingOtherError(ScrapingError):
    pass

And then we make changes to the code:

try:
        resp = requests.get(URL_SCRAPING, timeout=10.0)
    except requests.exceptions.Timeout:
        raise ScrapingTimeoutError("request timed out")
    except Exception as e:
        raise ScrapingOtherError(f'{e}')

    if resp.status_code != 200:
        raise ScrapingHTTPError(f"HTTP {resp.status_code}: {resp.text}")

Saving the received data in the database

As you can see, the product is added only if it is not already in the database. The search is performed by the product code number (field value code ).

Despite the fact that in the scraping.py function itself, the data is already written to the database, we still return the data_list list. Just in case).

However, if we now try to reproduce this script, we will get an error:

"/home/su/Projects/django-apps/Projects/drop-ship-store/venv/lib/python3.8/site-packages/django/conf/__init__.py", line 67, in _setup
    raise ImproperlyConfigured(
django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

Process finished with exit code 1

The thing is that now the script accesses the database, which means that you need to get the Django settings. You can run this code to check in management/commands (more on this can be found here: https://docs.djangoproject.com/en/4.0/howto/custom-management-commands/) But we will do otherwise: we will immediately add the launch page and check the operation of the scraping() function already there.

Transferring scraping control to the site page

The algorithm for adding a new page remains the same:

  • Comes up with a url that will call it (shop/fill-database/)
  • Add urls.py configurator to shop
  • application
  • Set urls.py to link url and view (path('fill-database/', views.fill_database, name='fill_database'),
  • Move (copy) the file from the template to the project
  • Create a view in the module shop/views.py
  • Checking the result!

If, after the successful completion of all these points, we go to the admin panel, we will see that, after running the script, the Product table is filled with new values:

Now everything is ready for the last step: adding the pages of the online store directly and the code that will manage them. But we will deal with this in the next seventh and last lesson.

You can learn more about all the details of this stage from this video (RU voice):




To the next stage of the project