Blog

Drop Shipping Online Store on Django (Part 6)

6. Adding a Scraping (Parsing) Module and Autofilling the Database Directly From Another Website!

Attention! If you're having trouble following the previous step, you can visit relevant lesson , download the archive of the previous step, install it, and start this lesson exactly where the previous one ended!

In this course, we will go through all the stages of creating a new project together:

How to get information from another site?

Of course, the most convenient (and, by the way, more reliable) way will be the interaction of our site with the provider site via API. Indeed, this requires

  • on the supplier site such an opportunity was organized (access to it via API),
  • and that the administration of the supplier site give us a login and password for this access.

The Django framework allows us to organize such interaction with your own means, without resorting to installing third-party packages. However, the Django REST ftamework (DRF) package will do the job best.

Nevertheless, in our case, we will use another method - reading and extracting the necessary information directly from the HTML page. This action is called scraping (parsing) of the site.

Two popular Python libraries will be used for this purpose: beautifulsoup4 and requests. You can install them using two terminal commands:

pip install beautifulsoup4
pip install requests

Web page structure

Typically, data on a product page is grouped into blocks. Inside the blocks, the same type of data is under the same selectors (see figure):

If we download and parse an HTML page with a list of products, we can get a structured list of data. Specifically for our case, for each data block, we need to get the following dictionary:

{
    'name': 'Труба профильная 40х20 2 мм 3м', 
    'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg', 
    'price': Decimal('493.00'), 
    'unit': 'за шт', 
    'code': '38140012'
 }

Action plan

  • Create scraping.py module in shop app
  • In this module, create a scraping() function that can:
    1. Get page HTML code (by package request)
    2. Process the resulting HTML code (by package beautifulsoup4)
    3. Save result in database
  • Test the function scraping() “manually”
  • Add a button to start scraping on the site page

The plan is ready - let's start its implementation!

Create a scraping (parsing) module and get the HTML code using the requests package

Obviously, the script responsible for reading information from another site should be placed in a separate shop application module: shop/scraping.py. The scraping() function will be responsible for sending a request to URL_SCRAPING, reading data and writing this data to the Product table of the project database.

First of all, we need to get the HTML code of the product data page for further processing. This task will be assigned to the requests module:

import requests

def scraping():
    URL_SCRAPING = 'https://www.some-site.com'
    resp = requests.get(URL_SCRAPING, timeout=10.0)
    if resp.status_code != 200:
        raise Exception('HTTP error access!')

    data_list = []
    html = resp.text

It makes sense to immediately see what you got. To do this, let's add code that will count the number of characters in the html object and at the same time print this object itself:

html = resp.text
    print(f'HTML text consists of {len(html)} symbols')
    print(html)


if __name__ == '__main__':
    scraping()

The shop/scraaping.py module does not require any Django settings (at least not yet), so you can run it like a regular Python script:

HTML text consists of 435395 symbols
<!DOCTYPE html>
<html lang="ru">
  <head>
    <link rel="shortcut icon" type="image/x-icon" href="/bitrix/templates/elektro_light/favicon_new.ico"/>
    <meta name="robots" content="index, follow">
<meta name="keywords" content="Профильные трубы, уголки">
<meta name="description" content="Цены на профильные трубы, уголки от  руб. Описание. Характеристики. Отзывы. Скидки на  профильные трубы, уголки.">
    <meta name="viewport" content="width=device-width, initial-scale=1.0 user-scalable=no"/>
    <meta name="msapplication-TileColor" content="#ffffff">

As you can see, the result really looks like an HTML page.

The first part of the task is solved - access to the site data is obtained, and those 435,395 characters that are displayed on the screen contain all the information we need. All we now need is to simply extract this information and store the result in the database.

Processing the resulting HTML code with the BeautifulSoup package

Further processing will be most conveniently carried out using the beautifulsoup4 module. To do this, we first need to create a soup object, which is a nested data structure of an HTML document:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

More information on how to get started with this package can be found on the man page: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

You can also read more about the beautifulsoup4 CSS selectors here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors Further on the supplier's page, we will be most interested in the product block - layout of repeating product cards with a similar data structure. You can get a list of elements of the same type from the soup object using the select() method, where the CSS selector of this block is specified as an argument. In our case it will be class=”catalog-item-card”:

blocks = soup.select('.catalog-item-card ')

In the loop, we can access each block and at the same time see what is inside the block object. This is how the modified code will look like:

html = resp.text

    soup = BeautifulSoup(html, 'html.parser')
    blocks = soup.select('.catalog-item-card ')

    for block in blocks:
        print(f'HTML text consists of {len(block.text)} symbols')
        print(50 * '=')
        print(block.text)
        break

And this is how the printed block.text object will look like:

HTML text consists of 382 symbols
==================================================
<div class="catalog-item-card" itemprop="itemListElement" itemscope="" itemtype="http://schema.org/Product">
<div class="catalog-item-info">
<div class="item-all-title">
<a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">
<span itemprop="name">Труба профильная 40х20 2 мм 3м</span>
</a>

As you can see, the number of characters in the block has been reduced to 382. Which greatly simplifies our task.

We can parse these blocks into elements of interest to us using the soup.select_one() method, which, unlike the select() method, does not select all elements of the page, that satisfies the condition (method argument), but only the first matched element. It is also important to remember that the text obtained with the soup.select_one() object can be extracted using the text method. Thus, applying this method with certain arguments, we fill almost the entire data dictionary, with the exception of the code field:

soup = BeautifulSoup(html, 'html.parser')
    blocks = soup.select('.catalog-item-card ')

    for block in blocks:
        """{
        'name': 'Труба профильная 40х20 2 мм 3м', 
        'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg', 
        'price': Decimal('493.00'), 
        'unit': 'за шт', 
        'code': '38140012'
        }
        """
        data = {}
        name = block.select_one('.item-title[title]').get_text().strip()
        data['name'] = name

        image_url = URL_SCRAPING_DOMAIN + block.select_one('img')['src']
        data['image_url'] = image_url

        price_raw = block.select_one('.item-price ').text
        # '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\t493.00\t\t\t\t\t\t\t\t\t\t\t\t  руб. '
        price = re.findall(r'\S\d+\.\d+\S', price_raw)[0]
        price = Decimal(price)
        data['price'] = price   # 493.00

        unit = block.select_one('.unit ').text.strip()
        # '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\tза шт\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'
        data['unit'] = unit  # 'за шт'

Generating an additional link to go to the detail page and getting the product code

Upon closer examination of the product block, it turned out that some of the information on the product is located on another page of the supplier's website - on the product detail page. The link itself to go to this page is in the block block.

Therefore, we will have to repeat here the same algorithm that we used a few steps ago to get data for the block block:

  • Generate link to detail page
  • Follow this link and read with requests.get() the new HTML code of this page already - the detail page
  • Save the received data in a new object Beautiful Soup
  • Extract the code number using the same method soup.select_one()

# find and open detail page
        url_detail = block.select_one('.item-title')
        # <a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">

        url_detail = url_detail['href']
        # '/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/'

        url_detail = URL_SCRAPING_DOMAIN + url_detail

        html_detail = requests.get(url_detail).text
        soup = BeautifulSoup(html_detail, 'html.parser')
        code_block = soup.select_one('.catalog-detail-property')
        code = code_block.select_one('b').text
        data['code'] = code

        data_list.append(data)

        print(data)

If we do everything right, we will end up with a list of dictionaries with data for each block.

Adding error handling

The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:

  • Availability (or inaccessibility of the provider site)
  • Changing page layout
  • Internet connection problems
  • and so on…

The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:

  • Availability (or inaccessibility of the provider site)
  • Changing page layout
  • Internet connection problems
  • and so on…

class ScrapingError(Exception):
    pass


class ScrapingTimeoutError(ScrapingError):
    pass


class ScrapingHTTPError(ScrapingError):
    pass


class ScrapingOtherError(ScrapingError):
    pass

And then we make changes to the code:

try:
        resp = requests.get(URL_SCRAPING, timeout=10.0)
    except requests.exceptions.Timeout:
        raise ScrapingTimeoutError("request timed out")
    except Exception as e:
        raise ScrapingOtherError(f'{e}')

    if resp.status_code != 200:
        raise ScrapingHTTPError(f"HTTP {resp.status_code}: {resp.text}")

Saving the received data in the database

As you can see, the product is added only if it is not already in the database. The search is performed by the product code number (field value code ).

Despite the fact that in the scraping.py function itself, the data is already written to the database, we still return the data_list list. Just in case).

However, if we now try to reproduce this script, we will get an error:

"/home/su/Projects/django-apps/Projects/drop-ship-store/venv/lib/python3.8/site-packages/django/conf/__init__.py", line 67, in _setup
    raise ImproperlyConfigured(
django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

Process finished with exit code 1

The thing is that now the script accesses the database, which means that you need to get the Django settings. You can run this code to check in management/commands (more on this can be found here: https://docs.djangoproject.com/en/4.0/howto/custom-management-commands/) But we will do otherwise: we will immediately add the launch page and check the operation of the scraping() function already there.

Transferring scraping control to the site page

The algorithm for adding a new page remains the same:

  • Comes up with a url that will call it (shop/fill-database/)
  • Add urls.py configurator to shop
  • application
  • Set urls.py to link url and view (path('fill-database/', views.fill_database, name='fill_database'),
  • Move (copy) the file from the template to the project
  • Create a view in the module shop/views.py
  • Checking the result!

If, after the successful completion of all these points, we go to the admin panel, we will see that, after running the script, the Product table is filled with new values:

Now everything is ready for the last step: adding the pages of the online store directly and the code that will manage them. But we will deal with this in the next seventh and last lesson.

You can learn more about all the details of this stage from this video (RU voice):




To the next stage of the project



Read more >>

Parsing the site using Apps Script does not work. Why?

The method of parsing (scraping), discussed in the article Parsing (scraping) using Google Apps Script is incredibly simple and easy to use. There is no need for complex programs, or high qualifications or special knowledge, or a lot of experience. All code is written directly in the Google Sheets editor, and then placed and run there.

In addition to all this, the script can be run on a schedule at any time in 24/7 mode, and the result can be processed right there and, if necessary, sent by email or a new file with the received and processed data can be created on Google disk.

However, there may be problems here. First of all, this script does not work with sites (more precisely, with pages) where pre-registration is required. Since, to register on the site, you need to create a session and place its code in special files - cookies. However, the standard UrlFetchApp function used in Apps Script to access a website does not allow cookies.

There is another problem, which has recently manifested itself more and more often. This is the unwillingness of site owners to share information with bots. Therefore, you can often see a picture when the html-code issued by the server at the request of the script is significantly different from the code that is issued to the browser.

Actually, precisely in order to quickly find out whether it is possible to parse (scrap) the site we need or not, this program was written:

function onOpen() {
  var ui = SpreadsheetApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('Custom Menu')
      .addItem('Check', 'fastAnalyzer')
      .addSeparator()
      .addItem('Create Sheeet CHECK', 'createSheet')
      .addToUi();
}


function fastAnalyzer() {
  const ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Check');
  const urlDomain = ss.getRange(2, 1).getValue();
  
  let url = urlDomain;
  let count = 0;
  let response = UrlFetchApp.fetch(url);
  let lll = response.getContentText().length;
  ss.getRange(2, 2).setValue(response.getResponseCode());
  ss.getRange(2, 3).setValue(lll);
  let html = response.getContentText();
  let whatLookingFor = ss.getRange(3, 2).getValue();
  let pos = html.indexOf(whatLookingFor);
  ss.getRange(3, 3).setValue(pos);

  ss.getRange(4, 1, ss.getLastRow(), 2).clearContent();
  let totalNumber = html.length
  let i = 0;

  // Check defoult value of delta - number of symbols in 1 row
  // If the cell is empty, print the defole value = 5000
  let delta = Number.parseInt(ss.getRange(2, 4).getValue());
  if (isNaN(delta)) {
    delta = 5000
    // Print defoult value of number of symbols in 1 row
    ss.getRange(2, 4).setValue(delta);
  };

  let iStart = 0;
  while (true) {
    let iEnd = iStart + delta;
    if (iEnd > totalNumber) iEnd = totalNumber; 
    ss.getRange(4 + i, 1).setValue("Символы от " + iStart + " до " + iEnd);
    ss.getRange(4 + i, 2).setValue(html.slice(iStart,iEnd));
    // let currentRow = (4+i) + ":" + (4+i)
    // ss.getRange(3, 3).setValue(currentRow);
    // ss.getRange(currentRow).activate();
    // ss.setRowHeight(4+i, 200);
    i++;
    if (iEnd - iStart < delta) break;
    iStart = iEnd;

  };
}

function createSheet() {

// is exist Check sheet?
  let s = SpreadsheetApp.getActiveSpreadsheet();
  
// create if not exist
  if(!s.getSheetByName('Check')) {
    s.insertSheet('Check', 0);
    ssc = s.getSheetByName('Check');

    ssc.getRange(1, 1).setValue("Url сайта");
    ssc.getRange(1, 2).setValue("Код ответа сайта");
    ssc.getRange(1, 3).setValue("Число символов на выбранной странице");
    ssc.getRange(1, 4).setValue("Изменить вывод числа знаков в однй строке");
    ssc.getRange(2, 1).setBackground('ACCENT3');  // yellow background cell
    ssc.getRange(3, 1).setValue("Какой элемент кода ищем?");
    ssc.getRange(3, 2).setBackground('ACCENT3');  // yellow background cell
  };
}

So, all you need to work with this program is

  • create a new Google Spreadsheet
  • open a code editor in this file
  • completely copy the code from this article into the code editor of the file you just created (and don't forget to save it!)
  • refresh the spreadsheet page and select a custom menu item: Create Sheet CHECK

The program is ready to go! All that is needed now is to add the site url to the Google tables page, the code element that you want to find on this page, and run the script. The answer will be ready in a second!

You can find more information in this video (RU voice):

Read more >>

Web Scraping With Google Apps Script

The script below allows you to automatically find and download information from one of the freelance exchanges.

The domain name is taken from the Google Spread Sheets page. And search results are also uploaded there.

The function scraper() , which contains two cycles, is used as a control script. In the first loop ( for-loop ), we access the web pages of the site and save them in the variable html . In the second loop ( while-loop ), this variable is sequentially processed using three auxiliary functions:

  • The getBlock function finds the part of the html code (code block) inside the tag (usually by the unique value of the attributes of this tag), and returns this block as a string value;
  • The deleteBlock function, on the contrary, deletes the found fragment of the html code inside the block and also returns the remainder of this block as a string value.
  • Unlike the first two functions, the getOpenTag function does not delete the found tag, but returns it as a string value. True, not the whole tag, but only the first (opening part) of this tag.

function scraper() {
  
  const ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Sheet1');
  const urlDomain = ss.getRange(1, 1).getValue();
  
  let url = urlDomain;
  let count = 0;
  for (let page = 1; page < 5; page++) {
    url = urlDomain + page + '/';
    if (page == 1) url = urlDomain;
    let response = UrlFetchApp.fetch(url);
    ss.getRange(2, 1).setValue(response.getResponseCode());
    let html = response.getContentText();
    let p = 0;
    while (true) {
    
      let out = getBlock(html, 'div', html.indexOf('class="JobSearchCard-primary"', p));
      let block = out[0];
      p = out[1] + 1;
      if (p == 0) break;
      
      let title1 = getBlock(block, 'div', 0)[0];
      let title = getBlock(title1, 'a', 0)[0];
      
      let link = getOpenTag(title1, 'a', 0);
      link = getAttrName(link, 'href', 0)
      let formula = '=HYPERLINK("https://www.freelancer.com' +link + '", "' + title + '")';
      ss.getRange(3 + 3 * count, 2).setValue(formula);
      
      let price = getBlock(block, 'div', block.indexOf('class="JobSearchCard-primary-price'))[0];
      if (price.includes('span')) price =  deleteBlock(price, 'span', price.indexOf('span'));
      ss.getRange(3 + 3 * count + 1, 2).setValue(price).setHorizontalAlignment('right');
      
      let description = getBlock(block, 'p', block.indexOf('class="JobSearchCard-primary-description"'))[0];
      ss.getRange(3 + 3 * count, 1, 3).mergeVertically().setValue(description)
          .setBorder(true, true, true, true, null, null, '#000000', SpreadsheetApp.BorderStyle.SOLID)
          .setVerticalAlignment('middle')
          .setWrapStrategy(SpreadsheetApp.WrapStrategy.WRAP);
  
      ss.getRange(3 + 3 * count, 2, 3).setBorder(true, true, true, true, null, null, '#000000', SpreadsheetApp.BorderStyle.SOLID);
      
      let cat = getBlock(block, 'div', block.indexOf('class="JobSearchCard-primary-tags"'))[0];
      cat = cat.split('</a>').map(item => item.split('>')[1]);
      cat.pop();
      cat = cat.join(', ');
      ss.getRange(3 + 3 * count + 2, 2).setValue(cat);
    
      count++;
    
    }; 
  };
}

function getAttrName(html, attr, i) {
  let idxStart = html.indexOf(attr , i);
  if (idxStart == -1) return "Can't to find attr " + attr + ' !';
  idxStart = html.indexOf('"' , idxStart) + 1;
  let idxEnd = html.indexOf('"' , idxStart);
  return html.slice(idxStart,idxEnd).trim();
}

function getOpenTag(html, tag, idxStart) {
  let openTag = '<' + tag;
  let lenOpenTag = openTag.length;
  // where we are?
  if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) {
    idxStart = html.lastIndexOf(openTag, idxStart);
    if (idxStart == -1) return "Can't to find openTag " + openTag + ' !';
  };
  // begin loop after openTag
  let idxEnd = html.indexOf('>', idxStart) + 1;
  if (idxStart == -1) return "Can't to find closing bracket '>' for openTag!";
  return html.slice(idxStart,idxEnd).trim();
}

function deleteBlock(html, tag, idxStart) { // delete opening & closing tag and info between them
  let openTag = '<' + tag;
  let lenOpenTag = openTag.length;
  let closeTag = '</' + tag + '>';
  let lenCloseTag = closeTag.length;
  let countCloseTags = 0;
  let iMax = html.length;
  let idxEnd = 0;
  // where we are?
  if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) {
    idxStart = html.lastIndexOf(openTag, idxStart);
    if (idxStart == -1) return ["Can't to find openTag " + openTag + ' !', -1];
  };
  // begin loop after openTag
  let i = html.indexOf('>') + 1;
  
  while (i <= iMax) {
    i++;
    if (i === iMax) {
      return ['Could not find closing tag for ' + tag, -1];
    };
    let carrentValue = html[i];
    if (html[i] === '<'){
      let closingTag = html.slice(i, i + lenCloseTag);
      let openingTag = html.slice(i, i + lenOpenTag);
      if (html.slice(i, i + lenCloseTag) === closeTag) {
        if (countCloseTags === 0) {
          idxEnd = i + lenCloseTag;
          break;
        } else {
          countCloseTags -= 1;
        };
      } else if (html.slice(i, i + lenOpenTag) === openTag) {
        countCloseTags += 1;
      };
    };
  };
  return (html.slice(0, idxStart) + html.slice(idxEnd, iMax)).trim();
}

function getBlock(html, tag, idxStart) {  // <tag .... > Block </tag>
  let openTag = '<' + tag;
  let lenOpenTag = openTag.length;
  let closeTag = '</' + tag + '>';
  let lenCloseTag = closeTag.length;
  let countCloseTags = 0;
  let iMax = html.length;
  let idxEnd = 0;
  // where we are?
  if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) {
    idxStart = html.lastIndexOf(openTag, idxStart);
    if (idxStart == -1) return ["Can't to find openTag " + openTag + ' !', -1];
  };
  // change start - will start after openTag!
  idxStart = html.indexOf('>', idxStart) + 1;
  let i = idxStart;
  
  while (i <= iMax) {
    i++;
    if (i === iMax) {
      return ['Could not find closing tag for ' + tag, -1];
    };
    let carrentValue = html[i];
    if (html[i] === '<'){
      let closingTag = html.slice(i, i + lenCloseTag);
      let openingTag = html.slice(i, i + lenOpenTag);
      if (html.slice(i, i + lenCloseTag) === closeTag) {
        if (countCloseTags === 0) {
          idxEnd = i - 1;
          break;
        } else {
          countCloseTags -= 1;
        };
      } else if (html.slice(i, i + lenOpenTag) === openTag) {
        countCloseTags += 1;
      };
    };
  };
  return [html.slice(idxStart,idxEnd + 1).trim(), idxEnd];
}

You can find more information in this video (RU voice):

Read more >>

Tags list

    Apps Script      Arrays Java Script      asynchronous code      asyncio      coroutine      Django      Dropdown List      Drop Shipping      Exceptions      GitHub      Google API      Google Apps Script      Google Docs      Google Drive      Google Sheets      multiprocessing      Parsing      Python      regex      Scraping      ssh      Test Driven Development (TDD)      threading      website monitoring      zip