配置Scraping Browser
如何配置 Scraping Browser?
在使用Scraping Browser之前,需要進行一些配置。
本文將指導您在頁面操作控制臺設置憑證配置Scraping Browser、運行示例腳本和即時流覽器會話。請按照我們的詳細說明進行操作,以確保有效使用 Scraping Browser 進行網頁刮擦。
在開始使用 Scraping Browser 之前,請獲取您的憑據--您將用於網路自動化工具的用戶名和密碼。我們假定您已獲得有效憑證。如果沒有,請從 ABCproxy 獲取。
示例代碼
我們提供了一些抓取示例,幫助您更高效地開始使用我們的Scraping Browser。您只需替換您的憑據和目標 URL,然後根據您的業務場景自定義腳本即可。
要在本地環境中運行腳本,可以參考以下示例。確保已在本地安裝了所需的依賴項,不要忘記配置憑據,並執行腳本以獲取所需數據。
如果您訪問的網頁可能會遇到驗證碼或驗證難題,不用擔心,我們會為您無縫處理。
掃描流覽器初始導航和工作流程管理 Scraping Browser會話架構允許每個會話只執行一次初始導航。該初始導航是指加載目標網站的第一次實例,該網站將用於以下子網站
import asyncio
from playwright.async_api import async_playwright
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';
const SBR_WS_SERVER = `wss://${AUTH}@upg-scbr.abcproxy.com`;
async def run(pw):
print('Connecting to Scraping Browser...')
browser = await pw.chromium.connect_over_cdp(SBR_WS_SERVER)
try:
print('Connected! Navigating to Target...')
page = await browser.new_page()
await page.goto('https://example.com', timeout= 2 * 60 * 1000)
# Screenshot
print('To Screenshot from page')
await page.screenshot(path='./remote_screenshot_page.png')
# html content
print('Scraping page content...')
html = await page.content()
print(html)
finally:
# In order to better use the Scraping browser, be sure to close the browser
await browser.close()
async def main():
async with async_playwright() as playwright:
await run(playwright)
if _name_ == '_main_':
asyncio.run(main())
from selenium.webdriver import Remote, ChromeOptions
from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection
from selenium.webdriver.common.by import By
# Enter your credentials - the zone name and password
AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD'
REMOTE_WEBDRIVER = f'https://{AUTH}@hs-scbr.abcproxy.com'
def main():
print('Connecting to Scraping Browser...')
sbr_connection = ChromiumRemoteConnection(REMOTE_WEBDRIVER, 'goog', 'chrome')
with Remote(sbr_connection, options=ChromeOptions()) as driver:
# get target URL
print('Connected! Navigating to target ...')
driver.get('https://example.com')
# screenshot
print('screenshot to png')
driver.get_screenshot_as_file('./remote_page.png')
# html content
print('Get page content...')
html = driver.page_source
print(html)
if __name__ == '__main__':
main()
const puppeteer = require('puppeteer-core');
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';
const WS_ENDPOINT = `wss://${AUTH}@upg-scbr.abcproxy.com`;
(async () => {
console.log('Connecting to Scraping Browser...');
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
defaultViewport: {width: 1920, height: 1080}
});
try {
console.log('Connected! Navigating to Target URL');
const page = await browser.newPage();
await page.goto('https://example.com', { timeout: 2 * 60 * 1000 });
//1.Screenshot
console.log('Screenshot to page.png');
await page.screenshot({ path: 'remote_screenshot.png' });
console.log('Screenshot be saved');
//2.Get content
console.log('Get page content...');
const html = await page.content();
console.log("source Htmml: ", html)
} finally {
// In order to better use the Scraping browser, be sure to close the browser after the script is executed
await browser.close();
}
})();
const pw = require('playwright');
const AUTH = 'PROXY-FULL-ACCOUNT:PASSWORD';
const SBR_CDP = `wss://${AUTH}@upg-scbr.abcproxy.com`;
async function main() {
console.log('Connecting to Scraping Browser...');
const browser = await pw.chromium.connectOverCDP(SBR_CDP);
try {
console.log('Connected! Navigating to target...');
const page = await browser.newPage();
// Target URL
await page.goto('https://www.windows.com', { timeout: 2 * 60 * 1000 });
// Screenshot
console.log('To Screenshot from page');
await page.screenshot({ path: './remote_screenshot_page.png'});
// html content
console.log('Scraping page content...');
const html = await page.content();
console.log(html);
} finally {
// In order to better use the Scraping browser, be sure to close the browser after the script is executed
await browser.close();
}
}
if (require.main === module) {
main().catch(err => {
console.error(err.stack || err);
process.exit(1);
});
}
Scraping Browser初始導航和工作管理
掃描流覽器會話架構允許每個會話只執行一次初始導航。
初始導航是指加載目標網站的第一個實例,將用於後續的數據提取。初始階段結束後,用戶可以在同一會話中通過點擊、滾動和其他交互操作自由流覽網站。不過,要從初始導航階段開始新的搜索任務,無論是針對同一網站還是不同網站,都必須創建新的會話。
會話的時間限制
1.無論您採用哪種操作方法,都要注意會話超時限制。如果沒有在腳本中明確關閉流覽器會話,系統將在最長 60 分鐘後自動終止會話。
2.通過網路控制臺使用掃描流覽器時,系統會嚴格執行每個帳戶只有一個活動會話的規則。為確保最佳性能和體驗,請務必在腳本中明確關閉流覽器會話。
最后更新于