Attention! If you're having trouble following the previous step, you can visit relevant lesson , download the archive of the previous step, install it, and start this lesson exactly where the previous one ended!
In this course, we will go through all the stages of creating a new project together:
Of course, the most convenient (and, by the way, more reliable) way will be the interaction of our site with the provider site via API. Indeed, this requires
The Django framework allows us to organize such interaction with your own means, without resorting to installing third-party packages. However, the Django REST ftamework (DRF) package will do the job best.
Nevertheless, in our case, we will use another method - reading and extracting the necessary information directly from the HTML page. This action is called scraping (parsing) of the site.
Two popular Python libraries will be used for this purpose: beautifulsoup4 and requests. You can install them using two terminal commands:
pip install beautifulsoup4
pip install requests
Typically, data on a product page is grouped into blocks. Inside the blocks, the same type of data is under the same selectors (see figure):
If we download and parse an HTML page with a list of products, we can get a structured list of data. Specifically for our case, for each data block, we need to get the following dictionary:
{
'name': 'Труба профильная 40х20 2 мм 3м',
'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg',
'price': Decimal('493.00'),
'unit': 'за шт',
'code': '38140012'
}
The plan is ready - let's start its implementation!
Obviously, the script responsible for reading information from another site should be placed in a separate shop application module: shop/scraping.py. The scraping() function will be responsible for sending a request to URL_SCRAPING, reading data and writing this data to the Product table of the project database.
First of all, we need to get the HTML code of the product data page for further processing. This task will be assigned to the requests module:
import requests
def scraping():
URL_SCRAPING = 'https://www.some-site.com'
resp = requests.get(URL_SCRAPING, timeout=10.0)
if resp.status_code != 200:
raise Exception('HTTP error access!')
data_list = []
html = resp.text
It makes sense to immediately see what you got. To do this, let's add code that will count the number of characters in the html object and at the same time print this object itself:
html = resp.text
print(f'HTML text consists of {len(html)} symbols')
print(html)
if __name__ == '__main__':
scraping()
The shop/scraaping.py module does not require any Django settings (at least not yet), so you can run it like a regular Python script:
HTML text consists of 435395 symbols
<!DOCTYPE html>
<html lang="ru">
<head>
<link rel="shortcut icon" type="image/x-icon" href="/bitrix/templates/elektro_light/favicon_new.ico"/>
<meta name="robots" content="index, follow">
<meta name="keywords" content="Профильные трубы, уголки">
<meta name="description" content="Цены на профильные трубы, уголки от руб. Описание. Характеристики. Отзывы. Скидки на профильные трубы, уголки.">
<meta name="viewport" content="width=device-width, initial-scale=1.0 user-scalable=no"/>
<meta name="msapplication-TileColor" content="#ffffff">
As you can see, the result really looks like an HTML page.
The first part of the task is solved - access to the site data is obtained, and those 435,395 characters that are displayed on the screen contain all the information we need. All we now need is to simply extract this information and store the result in the database.
Further processing will be most conveniently carried out using the beautifulsoup4 module. To do this, we first need to create a soup object, which is a nested data structure of an HTML document:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
More information on how to get started with this package can be found on the man page: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
You can also read more about the beautifulsoup4 CSS selectors here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors Further on the supplier's page, we will be most interested in the product block - layout of repeating product cards with a similar data structure. You can get a list of elements of the same type from the soup object using the select() method, where the CSS selector of this block is specified as an argument. In our case it will be class=”catalog-item-card”:
blocks = soup.select('.catalog-item-card ')
In the loop, we can access each block and at the same time see what is inside the block object. This is how the modified code will look like:
html = resp.text
soup = BeautifulSoup(html, 'html.parser')
blocks = soup.select('.catalog-item-card ')
for block in blocks:
print(f'HTML text consists of {len(block.text)} symbols')
print(50 * '=')
print(block.text)
break
And this is how the printed block.text object will look like:
HTML text consists of 382 symbols
==================================================
<div class="catalog-item-card" itemprop="itemListElement" itemscope="" itemtype="http://schema.org/Product">
<div class="catalog-item-info">
<div class="item-all-title">
<a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">
<span itemprop="name">Труба профильная 40х20 2 мм 3м</span>
</a>
As you can see, the number of characters in the block has been reduced to 382. Which greatly simplifies our task.
We can parse these blocks into elements of interest to us using the soup.select_one() method, which, unlike the select() method, does not select all elements of the page, that satisfies the condition (method argument), but only the first matched element. It is also important to remember that the text obtained with the soup.select_one() object can be extracted using the text method. Thus, applying this method with certain arguments, we fill almost the entire data dictionary, with the exception of the code field:
soup = BeautifulSoup(html, 'html.parser')
blocks = soup.select('.catalog-item-card ')
for block in blocks:
"""{
'name': 'Труба профильная 40х20 2 мм 3м',
'image_url': 'https://my-website.com/30C39890-D527-427E-B573-504969456BF5.jpg',
'price': Decimal('493.00'),
'unit': 'за шт',
'code': '38140012'
}
"""
data = {}
name = block.select_one('.item-title[title]').get_text().strip()
data['name'] = name
image_url = URL_SCRAPING_DOMAIN + block.select_one('img')['src']
data['image_url'] = image_url
price_raw = block.select_one('.item-price ').text
# '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\t493.00\t\t\t\t\t\t\t\t\t\t\t\t руб. '
price = re.findall(r'\S\d+\.\d+\S', price_raw)[0]
price = Decimal(price)
data['price'] = price # 493.00
unit = block.select_one('.unit ').text.strip()
# '\r\n \t\t\t\t\t\t\t\t\t\t\t\t\t\tза шт\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'
data['unit'] = unit # 'за шт'
Upon closer examination of the product block, it turned out that some of the information on the product is located on another page of the supplier's website - on the product detail page. The link itself to go to this page is in the block block.
Therefore, we will have to repeat here the same algorithm that we used a few steps ago to get data for the block block:
# find and open detail page
url_detail = block.select_one('.item-title')
# <a class="item-title" href="/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/" itemprop="url" title="Труба профильная 40х20 2 мм 3м">
url_detail = url_detail['href']
# '/catalog/profilnye_truby_ugolki/truba_profilnaya_40kh20_2_mm_3m/'
url_detail = URL_SCRAPING_DOMAIN + url_detail
html_detail = requests.get(url_detail).text
soup = BeautifulSoup(html_detail, 'html.parser')
code_block = soup.select_one('.catalog-detail-property')
code = code_block.select_one('b').text
data['code'] = code
data_list.append(data)
print(data)
If we do everything right, we will end up with a list of dictionaries with data for each block.
The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:
The success of site scraping depends on some parameters and circumstances. And most of them do not depend on our Django code, namely:
class ScrapingError(Exception):
pass
class ScrapingTimeoutError(ScrapingError):
pass
class ScrapingHTTPError(ScrapingError):
pass
class ScrapingOtherError(ScrapingError):
pass
And then we make changes to the code:
try:
resp = requests.get(URL_SCRAPING, timeout=10.0)
except requests.exceptions.Timeout:
raise ScrapingTimeoutError("request timed out")
except Exception as e:
raise ScrapingOtherError(f'{e}')
if resp.status_code != 200:
raise ScrapingHTTPError(f"HTTP {resp.status_code}: {resp.text}")
As you can see, the product is added only if it is not already in the database. The search is performed by the product code number (field value code ).
Despite the fact that in the scraping.py function itself, the data is already written to the database, we still return the data_list list. Just in case).
However, if we now try to reproduce this script, we will get an error:
"/home/su/Projects/django-apps/Projects/drop-ship-store/venv/lib/python3.8/site-packages/django/conf/__init__.py", line 67, in _setup
raise ImproperlyConfigured(
django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
Process finished with exit code 1
The thing is that now the script accesses the database, which means that you need to get the Django settings. You can run this code to check in management/commands (more on this can be found here: https://docs.djangoproject.com/en/4.0/howto/custom-management-commands/) But we will do otherwise: we will immediately add the launch page and check the operation of the scraping() function already there.
The algorithm for adding a new page remains the same:
If, after the successful completion of all these points, we go to the admin panel, we will see that, after running the script, the Product table is filled with new values:
Now everything is ready for the last step: adding the pages of the online store directly and the code that will manage them. But we will deal with this in the next seventh and last lesson.
You can learn more about all the details of this stage from this video (RU voice):
The method of parsing (scraping), discussed in the article Parsing (scraping) using Google Apps Script is incredibly simple and easy to use. There is no need for complex programs, or high qualifications or special knowledge, or a lot of experience. All code is written directly in the Google Sheets editor, and then placed and run there.
In addition to all this, the script can be run on a schedule at any time in 24/7 mode, and the result can be processed right there and, if necessary, sent by email or a new file with the received and processed data can be created on Google disk.
However, there may be problems here. First of all, this script does not work with sites (more precisely, with pages) where pre-registration is required. Since, to register on the site, you need to create a session and place its code in special files - cookies. However, the standard UrlFetchApp function used in Apps Script to access a website does not allow cookies.
There is another problem, which has recently manifested itself more and more often. This is the unwillingness of site owners to share information with bots. Therefore, you can often see a picture when the html-code issued by the server at the request of the script is significantly different from the code that is issued to the browser.
Actually, precisely in order to quickly find out whether it is possible to parse (scrap) the site we need or not, this program was written:
function onOpen() {
var ui = SpreadsheetApp.getUi();
// Or DocumentApp or FormApp.
ui.createMenu('Custom Menu')
.addItem('Check', 'fastAnalyzer')
.addSeparator()
.addItem('Create Sheeet CHECK', 'createSheet')
.addToUi();
}
function fastAnalyzer() {
const ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Check');
const urlDomain = ss.getRange(2, 1).getValue();
let url = urlDomain;
let count = 0;
let response = UrlFetchApp.fetch(url);
let lll = response.getContentText().length;
ss.getRange(2, 2).setValue(response.getResponseCode());
ss.getRange(2, 3).setValue(lll);
let html = response.getContentText();
let whatLookingFor = ss.getRange(3, 2).getValue();
let pos = html.indexOf(whatLookingFor);
ss.getRange(3, 3).setValue(pos);
ss.getRange(4, 1, ss.getLastRow(), 2).clearContent();
let totalNumber = html.length
let i = 0;
// Check defoult value of delta - number of symbols in 1 row
// If the cell is empty, print the defole value = 5000
let delta = Number.parseInt(ss.getRange(2, 4).getValue());
if (isNaN(delta)) {
delta = 5000
// Print defoult value of number of symbols in 1 row
ss.getRange(2, 4).setValue(delta);
};
let iStart = 0;
while (true) {
let iEnd = iStart + delta;
if (iEnd > totalNumber) iEnd = totalNumber;
ss.getRange(4 + i, 1).setValue("Символы от " + iStart + " до " + iEnd);
ss.getRange(4 + i, 2).setValue(html.slice(iStart,iEnd));
// let currentRow = (4+i) + ":" + (4+i)
// ss.getRange(3, 3).setValue(currentRow);
// ss.getRange(currentRow).activate();
// ss.setRowHeight(4+i, 200);
i++;
if (iEnd - iStart < delta) break;
iStart = iEnd;
};
}
function createSheet() {
// is exist Check sheet?
let s = SpreadsheetApp.getActiveSpreadsheet();
// create if not exist
if(!s.getSheetByName('Check')) {
s.insertSheet('Check', 0);
ssc = s.getSheetByName('Check');
ssc.getRange(1, 1).setValue("Url сайта");
ssc.getRange(1, 2).setValue("Код ответа сайта");
ssc.getRange(1, 3).setValue("Число символов на выбранной странице");
ssc.getRange(1, 4).setValue("Изменить вывод числа знаков в однй строке");
ssc.getRange(2, 1).setBackground('ACCENT3'); // yellow background cell
ssc.getRange(3, 1).setValue("Какой элемент кода ищем?");
ssc.getRange(3, 2).setBackground('ACCENT3'); // yellow background cell
};
}
So, all you need to work with this program is
The program is ready to go! All that is needed now is to add the site url to the Google tables page, the code element that you want to find on this page, and run the script. The answer will be ready in a second!
You can find more information in this video (RU voice):
The script below allows you to automatically find and download information from one of the freelance exchanges.
The domain name is taken from the Google Spread Sheets page. And search results are also uploaded there.
The function scraper() , which contains two cycles, is used as a control script. In the first loop ( for-loop ), we access the web pages of the site and save them in the variable html . In the second loop ( while-loop ), this variable is sequentially processed using three auxiliary functions:
function scraper() {
const ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Sheet1');
const urlDomain = ss.getRange(1, 1).getValue();
let url = urlDomain;
let count = 0;
for (let page = 1; page < 5; page++) {
url = urlDomain + page + '/';
if (page == 1) url = urlDomain;
let response = UrlFetchApp.fetch(url);
ss.getRange(2, 1).setValue(response.getResponseCode());
let html = response.getContentText();
let p = 0;
while (true) {
let out = getBlock(html, 'div', html.indexOf('class="JobSearchCard-primary"', p));
let block = out[0];
p = out[1] + 1;
if (p == 0) break;
let title1 = getBlock(block, 'div', 0)[0];
let title = getBlock(title1, 'a', 0)[0];
let link = getOpenTag(title1, 'a', 0);
link = getAttrName(link, 'href', 0)
let formula = '=HYPERLINK("https://www.freelancer.com' +link + '", "' + title + '")';
ss.getRange(3 + 3 * count, 2).setValue(formula);
let price = getBlock(block, 'div', block.indexOf('class="JobSearchCard-primary-price'))[0];
if (price.includes('span')) price = deleteBlock(price, 'span', price.indexOf('span'));
ss.getRange(3 + 3 * count + 1, 2).setValue(price).setHorizontalAlignment('right');
let description = getBlock(block, 'p', block.indexOf('class="JobSearchCard-primary-description"'))[0];
ss.getRange(3 + 3 * count, 1, 3).mergeVertically().setValue(description)
.setBorder(true, true, true, true, null, null, '#000000', SpreadsheetApp.BorderStyle.SOLID)
.setVerticalAlignment('middle')
.setWrapStrategy(SpreadsheetApp.WrapStrategy.WRAP);
ss.getRange(3 + 3 * count, 2, 3).setBorder(true, true, true, true, null, null, '#000000', SpreadsheetApp.BorderStyle.SOLID);
let cat = getBlock(block, 'div', block.indexOf('class="JobSearchCard-primary-tags"'))[0];
cat = cat.split('</a>').map(item => item.split('>')[1]);
cat.pop();
cat = cat.join(', ');
ss.getRange(3 + 3 * count + 2, 2).setValue(cat);
count++;
};
};
}
function getAttrName(html, attr, i) {
let idxStart = html.indexOf(attr , i);
if (idxStart == -1) return "Can't to find attr " + attr + ' !';
idxStart = html.indexOf('"' , idxStart) + 1;
let idxEnd = html.indexOf('"' , idxStart);
return html.slice(idxStart,idxEnd).trim();
}
function getOpenTag(html, tag, idxStart) {
let openTag = '<' + tag;
let lenOpenTag = openTag.length;
// where we are?
if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) {
idxStart = html.lastIndexOf(openTag, idxStart);
if (idxStart == -1) return "Can't to find openTag " + openTag + ' !';
};
// begin loop after openTag
let idxEnd = html.indexOf('>', idxStart) + 1;
if (idxStart == -1) return "Can't to find closing bracket '>' for openTag!";
return html.slice(idxStart,idxEnd).trim();
}
function deleteBlock(html, tag, idxStart) { // delete opening & closing tag and info between them
let openTag = '<' + tag;
let lenOpenTag = openTag.length;
let closeTag = '</' + tag + '>';
let lenCloseTag = closeTag.length;
let countCloseTags = 0;
let iMax = html.length;
let idxEnd = 0;
// where we are?
if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) {
idxStart = html.lastIndexOf(openTag, idxStart);
if (idxStart == -1) return ["Can't to find openTag " + openTag + ' !', -1];
};
// begin loop after openTag
let i = html.indexOf('>') + 1;
while (i <= iMax) {
i++;
if (i === iMax) {
return ['Could not find closing tag for ' + tag, -1];
};
let carrentValue = html[i];
if (html[i] === '<'){
let closingTag = html.slice(i, i + lenCloseTag);
let openingTag = html.slice(i, i + lenOpenTag);
if (html.slice(i, i + lenCloseTag) === closeTag) {
if (countCloseTags === 0) {
idxEnd = i + lenCloseTag;
break;
} else {
countCloseTags -= 1;
};
} else if (html.slice(i, i + lenOpenTag) === openTag) {
countCloseTags += 1;
};
};
};
return (html.slice(0, idxStart) + html.slice(idxEnd, iMax)).trim();
}
function getBlock(html, tag, idxStart) { // <tag .... > Block </tag>
let openTag = '<' + tag;
let lenOpenTag = openTag.length;
let closeTag = '</' + tag + '>';
let lenCloseTag = closeTag.length;
let countCloseTags = 0;
let iMax = html.length;
let idxEnd = 0;
// where we are?
if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) {
idxStart = html.lastIndexOf(openTag, idxStart);
if (idxStart == -1) return ["Can't to find openTag " + openTag + ' !', -1];
};
// change start - will start after openTag!
idxStart = html.indexOf('>', idxStart) + 1;
let i = idxStart;
while (i <= iMax) {
i++;
if (i === iMax) {
return ['Could not find closing tag for ' + tag, -1];
};
let carrentValue = html[i];
if (html[i] === '<'){
let closingTag = html.slice(i, i + lenCloseTag);
let openingTag = html.slice(i, i + lenOpenTag);
if (html.slice(i, i + lenCloseTag) === closeTag) {
if (countCloseTags === 0) {
idxEnd = i - 1;
break;
} else {
countCloseTags -= 1;
};
} else if (html.slice(i, i + lenOpenTag) === openTag) {
countCloseTags += 1;
};
};
};
return [html.slice(idxStart,idxEnd + 1).trim(), idxEnd];
}
You can find more information in this video (RU voice):