FreshRSS

πŸ”’
❌ Secure Planet Training Courses Updated For 2019 - Click Here
There are new available articles, click to refresh the page.
☐ β˜† βœ‡ KitPloit - PenTest Tools!

Firecrawl-Mcp-Server - Official Firecrawl MCP Server - Adds Powerful Web Scraping To Cursor, Claude And Any Other LLM Clients

By: Unknown β€” May 6th 2025 at 12:30


A Model Context Protocol (MCP) server implementation that integrates with Firecrawl for web scraping capabilities.

Big thanks to @vrknetha, @cawstudios for the initial implementation!

You can also play around with our MCP Server on MCP.so's playground. Thanks to MCP.so for hosting and @gstarwd for integrating our server.

Β 

Features

  • Scrape, crawl, search, extract, deep research and batch scrape support
  • Web scraping with JS rendering
  • URL discovery and crawling
  • Web search with content extraction
  • Automatic retries with exponential backoff
  • Efficient batch processing with built-in rate limiting
  • Credit usage monitoring for cloud API
  • Comprehensive logging system
  • Support for cloud and self-hosted Firecrawl instances
  • Mobile/Desktop viewport support
  • Smart content filtering with tag inclusion/exclusion

Installation

Running with npx

env FIRECRAWL_API_KEY=fc-YOUR_API_KEY npx -y firecrawl-mcp

Manual Installation

npm install -g firecrawl-mcp

Running on Cursor

Configuring Cursor πŸ–₯️ Note: Requires Cursor version 0.45.6+ For the most up-to-date configuration instructions, please refer to the official Cursor documentation on configuring MCP servers: Cursor MCP Server Configuration Guide

To configure Firecrawl MCP in Cursor v0.45.6

  1. Open Cursor Settings
  2. Go to Features > MCP Servers
  3. Click "+ Add New MCP Server"
  4. Enter the following:
  5. Name: "firecrawl-mcp" (or your preferred name)
  6. Type: "command"
  7. Command: env FIRECRAWL_API_KEY=your-api-key npx -y firecrawl-mcp

To configure Firecrawl MCP in Cursor v0.48.6

  1. Open Cursor Settings
  2. Go to Features > MCP Servers
  3. Click "+ Add new global MCP server"
  4. Enter the following code: json { "mcpServers": { "firecrawl-mcp": { "command": "npx", "args": ["-y", "firecrawl-mcp"], "env": { "FIRECRAWL_API_KEY": "YOUR-API-KEY" } } } }

If you are using Windows and are running into issues, try cmd /c "set FIRECRAWL_API_KEY=your-api-key && npx -y firecrawl-mcp"

Replace your-api-key with your Firecrawl API key. If you don't have one yet, you can create an account and get it from https://www.firecrawl.dev/app/api-keys

After adding, refresh the MCP server list to see the new tools. The Composer Agent will automatically use Firecrawl MCP when appropriate, but you can explicitly request it by describing your web scraping needs. Access the Composer via Command+L (Mac), select "Agent" next to the submit button, and enter your query.

Running on Windsurf

Add this to your ./codeium/windsurf/model_config.json:

{
"mcpServers": {
"mcp-server-firecrawl": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "YOUR_API_KEY"
}
}
}
}

Installing via Smithery (Legacy)

To install Firecrawl for Claude Desktop automatically via Smithery:

npx -y @smithery/cli install @mendableai/mcp-server-firecrawl --client claude

Configuration

Environment Variables

Required for Cloud API

  • FIRECRAWL_API_KEY: Your Firecrawl API key
  • Required when using cloud API (default)
  • Optional when using self-hosted instance with FIRECRAWL_API_URL
  • FIRECRAWL_API_URL (Optional): Custom API endpoint for self-hosted instances
  • Example: https://firecrawl.your-domain.com
  • If not provided, the cloud API will be used (requires API key)

Optional Configuration

Retry Configuration
  • FIRECRAWL_RETRY_MAX_ATTEMPTS: Maximum number of retry attempts (default: 3)
  • FIRECRAWL_RETRY_INITIAL_DELAY: Initial delay in milliseconds before first retry (default: 1000)
  • FIRECRAWL_RETRY_MAX_DELAY: Maximum delay in milliseconds between retries (default: 10000)
  • FIRECRAWL_RETRY_BACKOFF_FACTOR: Exponential backoff multiplier (default: 2)
Credit Usage Monitoring
  • FIRECRAWL_CREDIT_WARNING_THRESHOLD: Credit usage warning threshold (default: 1000)
  • FIRECRAWL_CREDIT_CRITICAL_THRESHOLD: Credit usage critical threshold (default: 100)

Configuration Examples

For cloud API usage with custom retry and credit monitoring:

# Required for cloud API
export FIRECRAWL_API_KEY=your-api-key

# Optional retry configuration
export FIRECRAWL_RETRY_MAX_ATTEMPTS=5 # Increase max retry attempts
export FIRECRAWL_RETRY_INITIAL_DELAY=2000 # Start with 2s delay
export FIRECRAWL_RETRY_MAX_DELAY=30000 # Maximum 30s delay
export FIRECRAWL_RETRY_BACKOFF_FACTOR=3 # More aggressive backoff

# Optional credit monitoring
export FIRECRAWL_CREDIT_WARNING_THRESHOLD=2000 # Warning at 2000 credits
export FIRECRAWL_CREDIT_CRITICAL_THRESHOLD=500 # Critical at 500 credits

For self-hosted instance:

# Required for self-hosted
export FIRECRAWL_API_URL=https://firecrawl.your-domain.com

# Optional authentication for self-hosted
export FIRECRAWL_API_KEY=your-api-key # If your instance requires auth

# Custom retry configuration
export FIRECRAWL_RETRY_MAX_ATTEMPTS=10
export FIRECRAWL_RETRY_INITIAL_DELAY=500 # Start with faster retries

Usage with Claude Desktop

Add this to your claude_desktop_config.json:

{
"mcpServers": {
"mcp-server-firecrawl": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE",

"FIRECRAWL_RETRY_MAX_ATTEMPTS": "5",
"FIRECRAWL_RETRY_INITIAL_DELAY": "2000",
"FIRECRAWL_RETRY_MAX_DELAY": "30000",
"FIRECRAWL_RETRY_BACKOFF_FACTOR": "3",

"FIRECRAWL_CREDIT_WARNING_THRESHOLD": "2000",
"FIRECRAWL_CREDIT_CRITICAL_THRESHOLD": "500"
}
}
}
}

System Configuration

The server includes several configurable parameters that can be set via environment variables. Here are the default values if not configured:

const CONFIG = {
retry: {
maxAttempts: 3, // Number of retry attempts for rate-limited requests
initialDelay: 1000, // Initial delay before first retry (in milliseconds)
maxDelay: 10000, // Maximum delay between retries (in milliseconds)
backoffFactor: 2, // Multiplier for exponential backoff
},
credit: {
warningThreshold: 1000, // Warn when credit usage reaches this level
criticalThreshold: 100, // Critical alert when credit usage reaches this level
},
};

These configurations control:

  1. Retry Behavior

  2. Automatically retries failed requests due to rate limits

  3. Uses exponential backoff to avoid overwhelming the API
  4. Example: With default settings, retries will be attempted at:

    • 1st retry: 1 second delay
    • 2nd retry: 2 seconds delay
    • 3rd retry: 4 seconds delay (capped at maxDelay)
  5. Credit Usage Monitoring

  6. Tracks API credit consumption for cloud API usage
  7. Provides warnings at specified thresholds
  8. Helps prevent unexpected service interruption
  9. Example: With default settings:
    • Warning at 1000 credits remaining
    • Critical alert at 100 credits remaining

Rate Limiting and Batch Processing

The server utilizes Firecrawl's built-in rate limiting and batch processing capabilities:

  • Automatic rate limit handling with exponential backoff
  • Efficient parallel processing for batch operations
  • Smart request queuing and throttling
  • Automatic retries for transient errors

Available Tools

1. Scrape Tool (firecrawl_scrape)

Scrape content from a single URL with advanced options.

{
"name": "firecrawl_scrape",
"arguments": {
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": true,
"waitFor": 1000,
"timeout": 30000,
"mobile": false,
"includeTags": ["article", "main"],
"excludeTags": ["nav", "footer"],
"skipTlsVerification": false
}
}

2. Batch Scrape Tool (firecrawl_batch_scrape)

Scrape multiple URLs efficiently with built-in rate limiting and parallel processing.

{
"name": "firecrawl_batch_scrape",
"arguments": {
"urls": ["https://example1.com", "https://example2.com"],
"options": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
}

Response includes operation ID for status checking:

{
"content": [
{
"type": "text",
"text": "Batch operation queued with ID: batch_1. Use firecrawl_check_batch_status to check progress."
}
],
"isError": false
}

3. Check Batch Status (firecrawl_check_batch_status)

Check the status of a batch operation.

{
"name": "firecrawl_check_batch_status",
"arguments": {
"id": "batch_1"
}
}

4. Search Tool (firecrawl_search)

Search the web and optionally extract content from search results.

{
"name": "firecrawl_search",
"arguments": {
"query": "your search query",
"limit": 5,
"lang": "en",
"country": "us",
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
}

5. Crawl Tool (firecrawl_crawl)

Start an asynchronous crawl with advanced options.

{
"name": "firecrawl_crawl",
"arguments": {
"url": "https://example.com",
"maxDepth": 2,
"limit": 100,
"allowExternalLinks": false,
"deduplicateSimilarURLs": true
}
}

6. Extract Tool (firecrawl_extract)

Extract structured information from web pages using LLM capabilities. Supports both cloud AI and self-hosted LLM extraction.

{
"name": "firecrawl_extract",
"arguments": {
"urls": ["https://example.com/page1", "https://example.com/page2"],
"prompt": "Extract product information including name, price, and description",
"systemPrompt": "You are a helpful assistant that extracts product information",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"description": { "type": "string" }
},
"required": ["name", "price"]
},
"allowExternalLinks": false,
"enableWebSearch": false,
"includeSubdomains": false
}
}

Example response:

{
"content": [
{
"type": "text",
"text": {
"name": "Example Product",
"price": 99.99,
"description": "This is an example product description"
}
}
],
"isError": false
}

Extract Tool Options:

  • urls: Array of URLs to extract information from
  • prompt: Custom prompt for the LLM extraction
  • systemPrompt: System prompt to guide the LLM
  • schema: JSON schema for structured data extraction
  • allowExternalLinks: Allow extraction from external links
  • enableWebSearch: Enable web search for additional context
  • includeSubdomains: Include subdomains in extraction

When using a self-hosted instance, the extraction will use your configured LLM. For cloud API, it uses Firecrawl's managed LLM service.

7. Deep Research Tool (firecrawl_deep_research)

Conduct deep web research on a query using intelligent crawling, search, and LLM analysis.

{
"name": "firecrawl_deep_research",
"arguments": {
"query": "how does carbon capture technology work?",
"maxDepth": 3,
"timeLimit": 120,
"maxUrls": 50
}
}

Arguments:

  • query (string, required): The research question or topic to explore.
  • maxDepth (number, optional): Maximum recursive depth for crawling/search (default: 3).
  • timeLimit (number, optional): Time limit in seconds for the research session (default: 120).
  • maxUrls (number, optional): Maximum number of URLs to analyze (default: 50).

Returns:

  • Final analysis generated by an LLM based on research. (data.finalAnalysis)
  • May also include structured activities and sources used in the research process.

8. Generate LLMs.txt Tool (firecrawl_generate_llmstxt)

Generate a standardized llms.txt (and optionally llms-full.txt) file for a given domain. This file defines how large language models should interact with the site.

{
"name": "firecrawl_generate_llmstxt",
"arguments": {
"url": "https://example.com",
"maxUrls": 20,
"showFullText": true
}
}

Arguments:

  • url (string, required): The base URL of the website to analyze.
  • maxUrls (number, optional): Max number of URLs to include (default: 10).
  • showFullText (boolean, optional): Whether to include llms-full.txt contents in the response.

Returns:

  • Generated llms.txt file contents and optionally the llms-full.txt (data.llmstxt and/or data.llmsfulltxt)

Logging System

The server includes comprehensive logging:

  • Operation status and progress
  • Performance metrics
  • Credit usage monitoring
  • Rate limit tracking
  • Error conditions

Example log messages:

[INFO] Firecrawl MCP Server initialized successfully
[INFO] Starting scrape for URL: https://example.com
[INFO] Batch operation queued with ID: batch_1
[WARNING] Credit usage has reached warning threshold
[ERROR] Rate limit exceeded, retrying in 2s...

Error Handling

The server provides robust error handling:

  • Automatic retries for transient errors
  • Rate limit handling with backoff
  • Detailed error messages
  • Credit usage warnings
  • Network resilience

Example error response:

{
"content": [
{
"type": "text",
"text": "Error: Rate limit exceeded. Retrying in 2 seconds..."
}
],
"isError": true
}

Development

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Run tests: npm test
  4. Submit a pull request

License

MIT License - see LICENSE file for details



☐ β˜† βœ‡ KitPloit - PenTest Tools!

Scrapling - An Undetectable, Powerful, Flexible, High-Performance Python Library That Makes Web Scraping Simple And Easy Again!

By: Unknown β€” April 28th 2025 at 12:30


Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
# Fetch websites' source under the radar!
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
>> print(page.status)
200
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!

Key Features

Fetch websites as you prefer with async support

  • HTTP Requests: Fast and stealthy HTTP requests with the Fetcher class.
  • Dynamic Loading & Automation: Fetch dynamic websites with the PlayWrightFetcher class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or NSTbrowser's browserless!
  • Anti-bot Protections Bypass: Easily bypass protections with StealthyFetcher and PlayWrightFetcher classes.

Adaptive Scraping

  • πŸ”„ Smart Element Tracking: Relocate elements after website changes, using an intelligent similarity system and integrated storage.
  • 🎯 Flexible Selection: CSS selectors, XPath selectors, filters-based search, text search, regex search and more.
  • πŸ” Find Similar Elements: Automatically locate elements similar to the element you found!
  • 🧠 Smart Content Scraping: Extract data from multiple websites without specific selectors using Scrapling powerful features.

High Performance

  • πŸš€ Lightning Fast: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
  • πŸ”‹ Memory Efficient: Optimized data structures for minimal memory footprint.
  • ⚑ Fast JSON serialization: 10x faster than standard library.

Developer Friendly

  • πŸ› οΈ Powerful Navigation API: Easy DOM traversal in all directions.
  • 🧬 Rich Text Processing: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that takes less memory than standard dictionaries with added methods.
  • πŸ“ Auto Selectors Generation: Generate robust short and full CSS/XPath selectors for any element.
  • πŸ”Œ Familiar API: Similar to Scrapy/BeautifulSoup and the same pseudo-elements used in Scrapy.
  • πŸ“˜ Type hints: Complete type/doc-strings coverage for future-proofing and best autocompletion support.

Getting Started

from scrapling.fetchers import Fetcher

fetcher = Fetcher(auto_match=False)

# Do http GET request to a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Get all text content from all HTML tags in the page except `script` and `style` tags
page.get_all_text(ignore_tags=('script', 'style'))

# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
quotes = page.css('.quote .text::text') # CSS selector
quotes = page.xpath('//span[@class="text"]/text()') # XPath
quotes = page.css('.quote').css('.text::text') # Chained selectors
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above

# Get the first quote element
quote = page.css_first('.quote') # same as page.css('.quote').first or page.css('.quote')[0]

# Tired of selectors? Use find_all/find
# Get all 'div' HTML tags that one of its 'class' values is 'quote'
quotes = page.find_all('div', {'class': 'quote'})
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...

# Working with elements
quote.html_content # Get Inner HTML of this element
quote.prettify() # Prettified version of Inner HTML above
quote.attrib # Get that element's attributes
quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)

To keep it simple, all methods can be chained on top of each other!

Parsing Performance

Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents. Here are benchmarks comparing Scrapling to popular Python libraries in two tests.

Text Extraction Speed Test (5000 nested elements).

# Library Time (ms) vs Scrapling
1 Scrapling 5.44 1.0x
2 Parsel/Scrapy 5.53 1.017x
3 Raw Lxml 6.76 1.243x
4 PyQuery 21.96 4.037x
5 Selectolax 67.12 12.338x
6 BS4 with Lxml 1307.03 240.263x
7 MechanicalSoup 1322.64 243.132x
8 BS4 with html5lib 3373.75 620.175x

As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.

Extraction By Text Speed Test

Library Time (ms) vs Scrapling
Scrapling 2.51 1.0x
AutoScraper 11.41 4.546x

Scrapling can find elements with more methods and it returns full element Adaptor objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.

All benchmarks' results are an average of 100 runs. See our benchmarks.py for methodology and to run your comparisons.

Installation

Scrapling is a breeze to get started with; Starting from version 0.2.9, we require at least Python 3.9 to work.

pip3 install scrapling

Then run this command to install browsers' dependencies needed to use Fetcher classes

scrapling install

If you have any installation issues, please open an issue.

Fetching Websites

Fetchers are interfaces built on top of other libraries with added features that do requests or fetch pages for you in a single request fashion and then return an Adaptor object. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the Adaptor class to create an Adaptor instance and start playing around with the page.

Features

You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way

from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher

All of them can take these initialization arguments: auto_match, huge_tree, keep_comments, keep_cdata, storage, and storage_args, which are the same ones you give to the Adaptor class.

If you don't want to pass arguments to the generated Adaptor object and want to use the default values, you can use this import instead for cleaner code:

from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher

then use it right away without initializing like:

page = StealthyFetcher.fetch('https://example.com') 

Also, the Response object returned from all fetchers is the same as the Adaptor object except it has these added attributes: status, reason, cookies, headers, history, and request_headers. All cookies, headers, and request_headers are always of type dictionary.

[!NOTE] The auto_match argument is enabled by default which is the one you should care about the most as you will see later.

Fetcher

This class is built on top of httpx with additional configuration options, here you can do GET, POST, PUT, and DELETE requests.

For all methods, you have stealthy_headers which makes Fetcher create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument retries for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all Fetcher methods is 3.

Hence: All headers generated by stealthy_headers argument can be overwritten by you through the headers argument

You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format http://username:password@localhost:8030

>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')

For Async requests, you will just replace the import like below:

>> from scrapling.fetchers import AsyncFetcher
>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = await AsyncFetcher().delete('https://httpbin.org/delete')

StealthyFetcher

This class is built on top of Camoufox, bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.

>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
>> page.status == 200
True
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True

Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

For the sake of simplicity, expand this for the complete list of arguments | Argument | Description | Optional | |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| | url | Target url | ❌ | | headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | βœ”οΈ | | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ | | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | βœ”οΈ | | block_webrtc | Blocks WebRTC entirely. | βœ”οΈ | | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ | | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | βœ”οΈ | | humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | βœ”οΈ | | allow_webgl | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. | βœ”οΈ | | geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | βœ”οΈ | | disable_ads | Disabled by default, this installs `uBlock Origin` addon on the browser if enabled. | βœ”οΈ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ | | wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ | | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ | | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. | βœ”οΈ | | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

PlayWrightFetcher

This class is built on top of Playwright which currently provides 4 main run options but they can be mixed as you want.

>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # the async version of fetch
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'

Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

Using this Fetcher class, you can make requests with: 1) Vanilla Playwright without any modifications other than the ones you chose. 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like Sannysoft's. Some of the things this fetcher's stealth mode does include: * Patching the CDP runtime fingerprint. * Mimics some of the real browsers' properties by injecting several JS files and using custom options. * Using custom flags on launch to hide Playwright even more and make it faster. * Generates real browser's headers of the same type and same user OS then append it to the request's headers. 3) Real browsers by passing the real_chrome argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it. 4) NSTBrowser's docker browserless option by passing the CDP URL and enabling nstbrowser_mode option.

Hence using the real_chrome argument requires that you have Chrome browser installed on your device

Add that to a lot of controlling/hiding options as you will see in the arguments list below.

Expand this for the complete list of arguments | Argument | Description | Optional | |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| | url | Target url | ❌ | | headless | Pass `True` to run the browser in headless/hidden (**default**), or `False` for headful/visible mode. | βœ”οΈ | | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | βœ”οΈ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ | | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ | | wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ | | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ | | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ | | extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | βœ”οΈ | | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ | | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | βœ”οΈ | | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | βœ”οΈ | | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | βœ”οΈ | | real_chrome | If you have Chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | βœ”οΈ | | locale | Set the locale for the browser if wanted. The default value is `en-US`. | βœ”οΈ | | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | βœ”οΈ | | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | βœ”οΈ | | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

Advanced Parsing Features

Smart Navigation

>>> quote.tag
'div'

>>> quote.parent
<data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>

>>> quote.parent.tag
'div'

>>> quote.children
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]

>>> quote.siblings
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

>>> quote.next # gets the next element, the same logic applies to `quote.previous`
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>

>>> quote.children.css_first(".author::text")
'Albert Einstein'

>>> quote.has_class('quote')
True

# Generate new selectors for any element
>>> quote.generate_css_selector
'body > div > div:nth-of-type(2) > div > div'

# Test these selectors on your favorite browser or reuse them again in the library's methods!
>>> quote.generate_xpath_selector
'//body/div/div[2]/div/div'

If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like below

for ancestor in quote.iterancestors():
# do something with it...

You can search for a specific ancestor of an element that satisfies a function, all you need to do is to pass a function that takes an Adaptor object as an argument and return True if the condition satisfies or False otherwise like below:

>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))
<data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>

Content-based Selection & Finding Similar Elements

You can select elements by their text content in multiple ways, here's a full example on another website:

>>> page = Fetcher().get('https://books.toscrape.com/index.html')

>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>

>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href']) # We use `page.urljoin` to return the full URL from the relative `href`
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'

>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]

>>> page.find_by_regex(r'Β£[\d\.]+') # Get the first element that its text content matches my price regex
<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>

>>> page.find_by_regex(r'Β£[\d\.]+', first_match=False) # Get all elements that matches my price regex
[<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
...]

Find all elements that are similar to the current element in location and attributes

# For this case, ignore the 'title' attribute while matching
>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
<data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
...]

# You will notice that the number of elements is 19 not 20 because the current element is not included.
>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))
19

# Get the `href` attribute from all similar elements
>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]
['catalogue/a-light-in-the-attic_1000/index.html',
'catalogue/soumission_998/index.html',
'catalogue/sharp-objects_997/index.html',
...]

To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason

>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():
print({
"name": product.css_first('h3 a::text'),
"price": product.css_first('.price_color').re_first(r'[\d\.]+'),
"stock": product.css('.availability::text')[-1].clean()
})
{'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
...

The documentation will provide more advanced examples.

Handling Structural Changes

Let's say you are scraping a page with a structure like this:

<div class="container">
<section class="products">
<article class="product" id="p1">
<h3>Product 1</h3>
<p class="description">Description 1</p>
</article>
<article class="product" id="p2">
<h3>Product 2</h3>
<p class="description">Description 2</p>
</article>
</section>
</div>

And you want to scrape the first product, the one with the p1 ID. You will probably write a selector like this

page.css('#p1')

When website owners implement structural changes like

<div class="new-container">
<div class="product-wrapper">
<section class="products">
<article class="product new-class" data-id="p1">
<div class="product-info">
<h3>Product 1</h3>
<p class="new-description">Description 1</p>
</div>
</article>
<article class="product new-class" data-id="p2">
<div class="product-info">
<h3>Product 2</h3>
<p class="new-description">Description 2</p>
</div>
</article>
</section>
</div>
</div>

The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.

from scrapling.parser import Adaptor
# Before the change
page = Adaptor(page_source, url='example.com')
element = page.css('#p1' auto_save=True)
if not element: # One day website changes?
element = page.css('#p1', auto_match=True) # Scrapling still finds it!
# the rest of the code...

How does the auto-matching work? Check the FAQs section for that and other possible issues while auto-matching.

Real-World Scenario

Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.

To solve this issue, I will use The Web Archive's Wayback Machine. Here is a copy of StackOverFlow's website in 2010, pretty old huh?Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)

If I want to extract the Questions button from the old design I can use a selector like this #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a This selector is too specific because it was generated by Google Chrome. Now let's test the same selector in both versions

>> from scrapling.fetchers import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>>
>> page = Fetcher(automatch_domain='stackoverflow.com').get(old_url, timeout=30)
>> element1 = page.css_first(selector, auto_save=True)
>>
>> # Same selector but used in the updated website
>> page = Fetcher(automatch_domain="stackoverflow.com").get(new_url)
>> element2 = page.css_first(selector, auto_match=True)
>>
>> if element1.text == element2.text:
... print('Scrapling found the same element in the old design and the new design!')
'Scrapling found the same element in the old design and the new design!'

Note that I used a new argument called automatch_domain, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.

In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the automatch_domain argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)

Notes: 1. For the two examples above I used one time the Adaptor class and the second time the Fetcher class just to show you that you can create the Adaptor object by yourself if you have the source or fetch the source using any Fetcher class then it will create the Adaptor object for you. 2. Passing the auto_save argument with the auto_match argument set to False while initializing the Adaptor/Fetcher object will only result in ignoring the auto_save argument value and the following warning message text Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info. This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the auto_match argument.

  1. The auto_match parameter works only for Adaptor instances not Adaptors so if you do something like this you will get an error python page.css('body').css('#p1', auto_match=True) because you can't auto-match a whole list, you have to be specific and do something like python page.css_first('body').css('#p1', auto_match=True)

Find elements by filters

Inspired by BeautifulSoup's find_all function you can find elements by using find_all/find methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.

  • To be more specific:
  • Any string passed is considered a tag name
  • Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
  • Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
  • Any regex patterns passed are used as filters to elements by their text content
  • Any functions passed are used as filters
  • Any keyword argument passed is considered as an HTML element attribute with its value.

So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
It filters all elements in the current page/element in the following order:

  1. All elements with the passed tag name(s).
  2. All elements that match all passed attribute(s).
  3. All elements that its text content match all passed regex patterns.
  4. All elements that fulfill all passed function(s).

Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. But the order in which you pass the arguments doesn't matter.

Examples to clear any confusion :)

>> from scrapling.fetchers import Fetcher
>> page = Fetcher().get('https://quotes.toscrape.com/')
# Find all elements with tag name `div`.
>> page.find_all('div')
[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
...]

# Find all div elements with a class that equals `quote`.
>> page.find_all('div', class_='quote')
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Same as above.
>> page.find_all('div', {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Find all elements with a class that equals `quote`.
>> page.find_all({'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]

# Find all elements that don't have children.
>> page.find_all(lambda element: len(element.children) > 0)
[<data='<html lang="en"><head><meta charset="UTF...'>,
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
...]

# Find all elements that contain the word 'world' in its content.
>> page.find_all(lambda element: "world" in element.text)
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]

# Find all span elements that match the given regex
>> page.find_all('span', re.compile(r'world'))
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>]

# Find all div and span elements with class 'quote' (No span elements like that so only div returned)
>> page.find_all(['div', 'span'], {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Mix things up
>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
['Albert Einstein',
'J.K. Rowling',
...]

Is That All?

Here's what else you can do with Scrapling:

  • Accessing the lxml.etree object itself of any element directly python >>> quote._root <Element div at 0x107f98870>
  • Saving and retrieving elements manually to auto-match them outside the css and the xpath methods but you have to set the identifier by yourself.

  • To save an element to the database: python >>> element = page.find_by_text('Tipping the Velvet', first_match=True) >>> page.save(element, 'my_special_element')

  • Now later when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this python >>> element_dict = page.retrieve('my_special_element') >>> page.relocate(element_dict, adaptor_type=True) [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>] >>> page.relocate(element_dict, adaptor_type=True).css('::text') ['Tipping the Velvet']
  • if you want to keep it as lxml.etree object, leave the adaptor_type argument python >>> page.relocate(element_dict) [<Element a at 0x105a2a7b0>]

  • Filtering results based on a function

# Find all products over $50
expensive_products = page.css('.product_pod').filter(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
)
  • Searching results for the first one that matches a function
# Find all the products with price '53.23'
page.css('.product_pod').search(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
)
  • Doing operations on element content is the same as scrapy python quote.re(r'regex_pattern') # Get all strings (TextHandlers) that match the regex pattern quote.re_first(r'regex_pattern') # Get the first string (TextHandler) only quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options except that you can do more with them like python quote.re( r'regex_pattern', replace_entities=True, # Character entity references are replaced by their corresponding character clean_match=True, # This will ignore all whitespaces and consecutive spaces while matching case_sensitive= False, # Set the regex to ignore letters case while compiling it ) Hence all of these methods are methods from the TextHandler within that contains the text content so the same can be done directly if you call the .text property or equivalent selector function.

  • Doing operations on the text content itself includes

  • Cleaning the text from any white spaces and replacing consecutive spaces with single space python quote.clean()
  • You already know about the regex matching and the fast json parsing but did you know that all strings returned from the regex search are actually TextHandler objects too? so in cases where you have for example a JS object assigned to a JS variable inside JS code and want to extract it with regex and then convert it to json object, in other libraries, these would be more than 1 line of code but here you can do it in 1 line like this python page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
  • Sort all characters in the string as if it were a list and return the new string python quote.sort(reverse=False)

    To be clear, TextHandler is a sub-class of Python's str so all normal operations/methods that work with Python strings will work with it.

  • Any element's attributes are not exactly a dictionary but a sub-class of mapping called AttributesHandler that's read-only so it's faster and string values returned are actually TextHandler objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)

  • Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches) python >>> for item in element.attrib.search_values('catalogue', partial=True): print(item) {'href': 'catalogue/tipping-the-velvet_999/index.html'}
  • Serialize the current attributes to JSON bytes: python >>> element.attrib.json_string b'{"href":"catalogue/tipping-the-velvet_999/index.html","title":"Tipping the Velvet"}'
  • Converting it to a normal dictionary python >>> dict(element.attrib) {'href': 'catalogue/tipping-the-velvet_999/index.html', 'title': 'Tipping the Velvet'}

Scrapling is under active development so expect many more features coming soon :)

More Advanced Usage

There are a lot of deep details skipped here to make this as short as possible so to take a deep dive, head to the docs section. I will try to keep it updated as possible and add complex examples. There I will explain points like how to write your storage system, write spiders that don't depend on selectors at all, and more...

Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.

[!IMPORTANT] A website is needed to provide detailed library documentation.
I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. I have been working on Scrapling for months for free after all.

If you like Scrapling and want it to keep improving then this is a friendly reminder that you can help by supporting me through the sponsor button.

⚑ Enlightening Questions and FAQs

This section addresses common questions about Scrapling, please read this section before opening an issue.

How does auto-matching work?

  1. You need to get a working selector and run it at least once with methods css or xpath with the auto_save parameter set to True before structural changes happen.
  2. Before returning results for you, Scrapling uses its configured database and saves unique properties about that element.
  3. Now because everything about the element can be changed or removed, nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:

    1. The domain of the URL you gave while initializing the first Adaptor object
    2. The identifier parameter you passed to the method while selecting. If you didn't pass one, then the selector string itself will be used as an identifier but remember you will have to use it as an identifier value later when the structure changes and you want to pass the new selector.

    Together both are used to retrieve the element's unique properties from the database later. 4. Now later when you enable the auto_match parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one. 5. Comparing elements is not exact but more about finding how similar these values are, so everything is taken into consideration, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now. 6. The score for each element is stored in the table, and the element(s) with the highest combined similarity scores are returned.

How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?

Not a big problem as it depends on your usage. The word default will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.

If all things about an element can change or get removed, what are the unique properties to be saved?

For each element, Scrapling will extract: - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only). - Element's parent tag name, attributes (names and values), and text.

I have enabled the auto_save/auto_match parameter while selecting and it got completely ignored with a warning message

That's because passing the auto_save/auto_match argument without setting auto_match to True while initializing the Adaptor object will only result in ignoring the auto_save/auto_match argument value. This behavior is purely for performance reasons so the database gets created only when you are planning to use the auto-matching features.

I have done everything as the docs but the auto-matching didn't return anything, what's wrong?

It could be one of these reasons: 1. No data were saved/stored for this element before. 2. The selector passed is not the one used while storing element data. The solution is simple - Pass the old selector again as an identifier to the method called. - Retrieve the element with the retrieve method using the old selector as identifier then save it again with the save method and the new selector as identifier. - Start using the identifier argument more often if you are planning to use every new selector from now on. 3. The website had some extreme structural changes like a new full design. If this happens a lot with this website, the solution would be to make your code as selector-free as possible using Scrapling features.

Can Scrapling replace code built on top of BeautifulSoup4?

Pretty much yeah, almost all features you get from BeautifulSoup can be found or achieved in Scrapling one way or another. In fact, if you see there's a feature in bs4 that is missing in Scrapling, please make a feature request from the issues tab to let me know.

Can Scrapling replace code built on top of AutoScraper?

Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.

Is Scrapling thread-safe?

Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.

More Sponsors!

Contributing

Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!

Please read the contributing file before doing anything.

Disclaimer for Scrapling Project

[!CAUTION] This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the robots.txt file, for example.

License

This work is licensed under BSD-3

Acknowledgments

This project includes code adapted from: - Parsel (BSD License) - Used for translator submodule

Thanks and References

Known Issues

  • In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.

Designed & crafted with ❀️ by Karim Shoair.



☐ β˜† βœ‡ WIRED

Amazon Is Investigating Perplexity Over Claims of Scraping Abuse

By: Dhruv Mehrotra, Andrew Couts β€” June 27th 2024 at 22:15
AWS hosted a server linked to the Bezos family- and Nvidia-backed search startup that appears to have been used to scrape the sites of major outlets, prompting an inquiry into potential rules violations.
☐ β˜† βœ‡ KitPloit - PenTest Tools!

JAW - A Graph-based Security Analysis Framework For Client-side JavaScript

By: Zion3R β€” May 19th 2024 at 12:30

An open-source, prototype implementation of property graphs for JavaScript based on the esprima parser, and the EsTree SpiderMonkey Spec. JAW can be used for analyzing the client-side of web applications and JavaScript-based programs.

This project is licensed under GNU AFFERO GENERAL PUBLIC LICENSE V3.0. See here for more information.

JAW has a Github pages website available at https://soheilkhodayari.github.io/JAW/.

Release Notes:


Overview of JAW

The architecture of the JAW is shown below.

Test Inputs

JAW can be used in two distinct ways:

  1. Arbitrary JavaScript Analysis: Utilize JAW for modeling and analyzing any JavaScript program by specifying the program's file system path.

  2. Web Application Analysis: Analyze a web application by providing a single seed URL.

Data Collection

  • JAW features several JavaScript-enabled web crawlers for collecting web resources at scale.

HPG Construction

  • Use the collected web resources to create a Hybrid Program Graph (HPG), which will be imported into a Neo4j database.

  • Optionally, supply the HPG construction module with a mapping of semantic types to custom JavaScript language tokens, facilitating the categorization of JavaScript functions based on their purpose (e.g., HTTP request functions).

Analysis and Outputs

  • Query the constructed Neo4j graph database for various analyses. JAW offers utility traversals for data flow analysis, control flow analysis, reachability analysis, and pattern matching. These traversals can be used to develop custom security analyses.

  • JAW also includes built-in traversals for detecting client-side CSRF, DOM Clobbering and request hijacking vulnerabilities.

  • The outputs will be stored in the same folder as that of input.

Setup

The installation script relies on the following prerequisites: - Latest version of npm package manager (node js) - Any stable version of python 3.x - Python pip package manager

Afterwards, install the necessary dependencies via:

$ ./install.sh

For detailed installation instructions, please see here.

Quick Start

Running the Pipeline

You can run an instance of the pipeline in a background screen via:

$ python3 -m run_pipeline --conf=config.yaml

The CLI provides the following options:

$ python3 -m run_pipeline -h

usage: run_pipeline.py [-h] [--conf FILE] [--site SITE] [--list LIST] [--from FROM] [--to TO]

This script runs the tool pipeline.

optional arguments:
-h, --help show this help message and exit
--conf FILE, -C FILE pipeline configuration file. (default: config.yaml)
--site SITE, -S SITE website to test; overrides config file (default: None)
--list LIST, -L LIST site list to test; overrides config file (default: None)
--from FROM, -F FROM the first entry to consider when a site list is provided; overrides config file (default: -1)
--to TO, -T TO the last entry to consider when a site list is provided; overrides config file (default: -1)

Input Config: JAW expects a .yaml config file as input. See config.yaml for an example.

Hint. The config file specifies different passes (e.g., crawling, static analysis, etc) which can be enabled or disabled for each vulnerability class. This allows running the tool building blocks individually, or in a different order (e.g., crawl all webapps first, then conduct security analysis).

Quick Example

For running a quick example demonstrating how to build a property graph and run Cypher queries over it, do:

$ python3 -m analyses.example.example_analysis --input=$(pwd)/data/test_program/test.js

Crawling and Data Collection

This module collects the data (i.e., JavaScript code and state values of web pages) needed for testing. If you want to test a specific JavaScipt file that you already have on your file system, you can skip this step.

JAW has crawlers based on Selenium (JAW-v1), Puppeteer (JAW-v2, v3) and Playwright (JAW-v3). For most up-to-date features, it is recommended to use the Puppeteer- or Playwright-based versions.

Playwright CLI with Foxhound

This web crawler employs foxhound, an instrumented version of Firefox, to perform dynamic taint tracking as it navigates through webpages. To start the crawler, do:

$ cd crawler
$ node crawler-taint.js --seedurl=https://google.com --maxurls=100 --headless=true --foxhoundpath=<optional-foxhound-executable-path>

The foxhoundpath is by default set to the following directory: crawler/foxhound/firefox which contains a binary named firefox.

Note: you need a build of foxhound to use this version. An ubuntu build is included in the JAW-v3 release.

Puppeteer CLI

To start the crawler, do:

$ cd crawler
$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true

See here for more information.

Selenium CLI

To start the crawler, do:

$ cd crawler/hpg_crawler
$ vim docker-compose.yaml # set the websites you want to crawl here and save
$ docker-compose build
$ docker-compose up -d

Please refer to the documentation of the hpg_crawler here for more information.

Graph Construction

HPG Construction CLI

To generate an HPG for a given (set of) JavaScript file(s), do:

$ node engine/cli.js  --lang=js --graphid=graph1 --input=/in/file1.js --input=/in/file2.js --output=$(pwd)/data/out/ --mode=csv

optional arguments:
--lang: language of the input program
--graphid: an identifier for the generated HPG
--input: path of the input program(s)
--output: path of the output HPG, must be i
--mode: determines the output format (csv or graphML)

HPG Import CLI

To import an HPG inside a neo4j graph database (docker instance), do:

$ python3 -m hpg_neo4j.hpg_import --rpath=<path-to-the-folder-of-the-csv-files> --id=<xyz> --nodes=<nodes.csv> --edges=<rels.csv>
$ python3 -m hpg_neo4j.hpg_import -h

usage: hpg_import.py [-h] [--rpath P] [--id I] [--nodes N] [--edges E]

This script imports a CSV of a property graph into a neo4j docker database.

optional arguments:
-h, --help show this help message and exit
--rpath P relative path to the folder containing the graph CSV files inside the `data` directory
--id I an identifier for the graph or docker container
--nodes N the name of the nodes csv file (default: nodes.csv)
--edges E the name of the relations csv file (default: rels.csv)

HPG Construction and Import CLI (v1)

In order to create a hybrid property graph for the output of the hpg_crawler and import it inside a local neo4j instance, you can also do:

$ python3 -m engine.api <path> --js=<program.js> --import=<bool> --hybrid=<bool> --reqs=<requests.out> --evts=<events.out> --cookies=<cookies.pkl> --html=<html_snapshot.html>

Specification of Parameters:

  • <path>: absolute path to the folder containing the program files for analysis (must be under the engine/outputs folder).
  • --js=<program.js>: name of the JavaScript program for analysis (default: js_program.js).
  • --import=<bool>: whether the constructed property graph should be imported to an active neo4j database (default: true).
  • --hybrid=bool: whether the hybrid mode is enabled (default: false). This implies that the tester wants to enrich the property graph by inputing files for any of the HTML snapshot, fired events, HTTP requests and cookies, as collected by the JAW crawler.
  • --reqs=<requests.out>: for hybrid mode only, name of the file containing the sequence of obsevered network requests, pass the string false to exclude (default: request_logs_short.out).
  • --evts=<events.out>: for hybrid mode only, name of the file containing the sequence of fired events, pass the string false to exclude (default: events.out).
  • --cookies=<cookies.pkl>: for hybrid mode only, name of the file containing the cookies, pass the string false to exclude (default: cookies.pkl).
  • --html=<html_snapshot.html>: for hybrid mode only, name of the file containing the DOM tree snapshot, pass the string false to exclude (default: html_rendered.html).

For more information, you can use the help CLI provided with the graph construction API:

$ python3 -m engine.api -h

Security Analysis

The constructed HPG can then be queried using Cypher or the NeoModel ORM.

Running Custom Graph traversals

You should place and run your queries in analyses/<ANALYSIS_NAME>.

Option 1: Using the NeoModel ORM (Deprecated)

You can use the NeoModel ORM to query the HPG. To write a query:

  • (1) Check out the HPG data model and syntax tree.
  • (2) Check out the ORM model for HPGs
  • (3) See the example query file provided; example_query_orm.py in the analyses/example folder.
$ python3 -m analyses.example.example_query_orm  

For more information, please see here.

Option 2: Using Cypher Queries

You can use Cypher to write custom queries. For this:

  • (1) Check out the HPG data model and syntax tree.
  • (2) See the example query file provided; example_query_cypher.py in the analyses/example folder.
$ python3 -m analyses.example.example_query_cypher

For more information, please see here.

Vulnerability Detection

This section describes how to configure and use JAW for vulnerability detection, and how to interpret the output. JAW contains, among others, self-contained queries for detecting client-side CSRF and DOM Clobbering

Step 1. enable the analysis component for the vulnerability class in the input config.yaml file:

request_hijacking:
enabled: true
# [...]
#
domclobbering:
enabled: false
# [...]

cs_csrf:
enabled: false
# [...]

Step 2. Run an instance of the pipeline with:

$ python3 -m run_pipeline --conf=config.yaml

Hint. You can run multiple instances of the pipeline under different screens:

$ screen -dmS s1 bash -c 'python3 -m run_pipeline --conf=conf1.yaml; exec sh'
$ screen -dmS s2 bash -c 'python3 -m run_pipeline --conf=conf2.yaml; exec sh'
$ # [...]

To generate parallel configuration files automatically, you may use the generate_config.py script.

How to Interpret the Output of the Analysis?

The outputs will be stored in a file called sink.flows.out in the same folder as that of the input. For Client-side CSRF, for example, for each HTTP request detected, JAW outputs an entry marking the set of semantic types (a.k.a, semantic tags or labels) associated with the elements constructing the request (i.e., the program slices). For example, an HTTP request marked with the semantic type ['WIN.LOC'] is forgeable through the window.location injection point. However, a request marked with ['NON-REACH'] is not forgeable.

An example output entry is shown below:

[*] Tags: ['WIN.LOC']
[*] NodeId: {'TopExpression': '86', 'CallExpression': '87', 'Argument': '94'}
[*] Location: 29
[*] Function: ajax
[*] Template: ajaxloc + "/bearer1234/"
[*] Top Expression: $.ajax({ xhrFields: { withCredentials: "true" }, url: ajaxloc + "/bearer1234/" })

1:['WIN.LOC'] variable=ajaxloc
0 (loc:6)- var ajaxloc = window.location.href

This entry shows that on line 29, there is a $.ajax call expression, and this call expression triggers an ajax request with the url template value of ajaxloc + "/bearer1234/, where the parameter ajaxloc is a program slice reading its value at line 6 from window.location.href, thus forgeable through ['WIN.LOC'].

Test Web Application

In order to streamline the testing process for JAW and ensure that your setup is accurate, we provide a simple node.js web application which you can test JAW with.

First, install the dependencies via:

$ cd tests/test-webapp
$ npm install

Then, run the application in a new screen:

$ screen -dmS jawwebapp bash -c 'PORT=6789 npm run devstart; exec sh'

Detailed Documentation.

For more information, visit our wiki page here. Below is a table of contents for quick access.

The Web Crawler of JAW

Data Model of Hybrid Property Graphs (HPGs)

Graph Construction

Graph Traversals

Contribution and Code Of Conduct

Pull requests are always welcomed. This project is intended to be a safe, welcoming space, and contributors are expected to adhere to the contributor code of conduct.

Academic Publication

If you use the JAW for academic research, we encourage you to cite the following paper:

@inproceedings{JAW,
title = {JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals},
author= {Soheil Khodayari and Giancarlo Pellegrino},
booktitle = {30th {USENIX} Security Symposium ({USENIX} Security 21)},
year = {2021},
address = {Vancouver, B.C.},
publisher = {{USENIX} Association},
}

Acknowledgements

JAW has come a long way and we want to give our contributors a well-deserved shoutout here!

@tmbrbr, @c01gide, @jndre, and Sepehr Mirzaei.



☐ β˜† βœ‡ KitPloit - PenTest Tools!

Huntr-Com-Bug-Bounties-Collector - Keep Watching New Bug Bounty (Vulnerability) Postings

By: Zion3R β€” February 27th 2024 at 11:30


New bug bounty(vulnerabilities) collector


Requirements
  • Chrome with GUI (If you encounter trouble with script execution, check the status of VMs GPU features, if available.)
  • Chrome WebDriver

Preview
# python3 main.py

*2024-02-20 16:14:47.836189*

1. Arbitrary File Reading due to Lack of Input Filepath Validation
- Feb 6th 2024 / High (CVE-2024-0964)
- gradio-app/gradio
- https://huntr.com/bounties/25e25501-5918-429c-8541-88832dfd3741/

2. View Barcode Image leads to Remote Code Execution
- Jan 31st 2024 / Critical (CVE: Not yet)
- dolibarr/dolibarr
- https://huntr.com/bounties/f0ffd01e-8054-4e43-96f7-a0d2e652ac7e/

(delimiter-based file database)

# vim feeds.db

1|2024-02-20 16:17:40.393240|7fe14fd58ca2582d66539b2fe178eeaed3524342|CVE-2024-0964|https://huntr.com/bounties/25e25501-5918-429c-8541-88832dfd3741/
2|2024-02-20 16:17:40.393987|c6b84ac808e7f229a4c8f9fbd073b4c0727e07e1|CVE: Not yet|https://huntr.com/bounties/f0ffd01e-8054-4e43-96f7-a0d2e652ac7e/
3|2024-02-20 16:17:40.394582|7fead9658843919219a3b30b8249700d968d0cc9|CVE: Not yet|https://huntr.com/bounties/d6cb06dc-5d10-4197-8f89-847c3203d953/
4|2024-02-20 16:17:40.395094|81fecdd74318ce7da9bc29e81198e62f3225bd44|CVE: Not yet|https://huntr.com/bounties/d875d1a2-7205-4b2b-93cf-439fa4c4f961/
5|2024-02-20 16:17:40.395613|111045c8f1a7926174243db403614d4a58dc72ed|CVE: Not yet|https://huntr.com/bounties/10e423cd-7051-43fd-b736-4e18650d0172/

Notes
  • This code is designed to parse HTML elements from huntr.com, so it may not function correctly if the HTML page structure changes.
  • In case of errors during parsing, exception handling has been included, so if it doesn't work as expected, please inspect the HTML source for any changes.
  • If get in trouble In a typical cloud environment, scripts may not function properly within virtual machines (VMs).


☐ β˜† βœ‡ KitPloit - PenTest Tools!

PhantomCrawler - Boost Website Hits By Generating Requests From Multiple Proxy IPs

By: Zion3R β€” January 4th 2024 at 11:30


PhantomCrawler allows users to simulate website interactions through different proxy IP addresses. It leverages Python, requests, and BeautifulSoup to offer a simple and effective way to test website behaviour under varied proxy configurations.

Features:

  • Utilizes a list of proxy IP addresses from a specified file.
  • Supports both HTTP and HTTPS proxies.
  • Allows users to input the target website URL, proxy file path, and a static port.
  • Makes HTTP requests to the specified website using each proxy.
  • Parses HTML content to extract and visit links on the webpage.

Usage:

  • POC Testing: Simulate website interactions to assess functionality under different proxy setups.
  • Web Traffic Increase: Boost website hits by generating requests from multiple proxy IPs.
  • Proxy Rotation Testing: Evaluate the effectiveness of rotating proxy IPs.
  • Web Scraping Testing: Assess web scraping tasks under different proxy configurations.
  • DDoS Awareness: Caution: The tool has the potential for misuse as a DDoS tool. Ensure responsible and ethical use.

Get New Proxies with port and add in proxies.txt in this format 50.168.163.176:80
  • You can add it from here: https://free-proxy-list.net/ these free proxies are not validated some might not work so first validate these proxies before adding.

How to Use:

  1. Clone the repository:
git clone https://github.com/spyboy-productions/PhantomCrawler.git
  1. Install dependencies:
pip3 install -r requirements.txt
  1. Run the script:
python3 PhantomCrawler.py

Disclaimer: PhantomCrawler is intended for educational and testing purposes only. Users are cautioned against any misuse, including potential DDoS activities. Always ensure compliance with the terms of service of websites being tested and adhere to ethical standards.


Snapshots:

If you find this GitHub repo useful, please consider giving it a star!Β 



☐ β˜† βœ‡ KitPloit - PenTest Tools!

Crawlector - Threat Hunting Framework Designed For Scanning Websites For Malicious Objects

By: Zion3R β€” November 12th 2023 at 11:30


Crawlector (the name Crawlector is a combination of Crawler & Detector) is a threat hunting framework designed for scanning websites for malicious objects.

Note-1: The framework was first presented at the No Hat conference in Bergamo, Italy on October 22nd, 2022 (Slides, YouTube Recording). Also, it was presented for the second time at the AVAR conference, in Singapore, on December 2nd, 2022.

Note-2: The accompanying tool EKFiddle2Yara (is a tool that takes EKFiddle rules and converts them into Yara rules) mentioned in the talk, was also released at both conferences.


Features

  • Supports spidering websites for findings additional links for scanning (up to 2 levels only)
  • Integrates Yara as a backend engine for rule scanning
  • Supports online and offline scanning
  • Supports crawling for domains/sites digital certificate
  • Supports querying URLhaus for finding malicious URLs on the page
  • Supports hashing the page's content with TLSH (Trend Micro Locality Sensitive Hash), and other standard cryptographic hash functions such as md5, sha1, sha256, and ripemd128, among others
    • TLSH won't return a value if the page size is less than 50 bytes or not "enough amount of randomness" is present in the data
  • Supports querying the rating and category of every URL
  • Supports expanding on a given site, by attempting to find all available TLDs and/or subdomains for the same domain
    • This feature uses the Omnisint Labs API (this site is down as of March 10, 2023) and RapidAPI APIs
    • TLD expansion implementation is native
    • This feature along with the rating and categorization, provides the capability to find scam/phishing/malicious domains for the original domain
  • Supports domain resolution (IPv4 and IPv6)
  • Saves scanned websites pages for later scanning (can be saved as a zip compressed)
  • The entirety of the framework’s settings is controlled via a single customizable configuration file
  • All scanning sessions are saved into a well-structured CSV file with a plethora of information about the website being scanned, in addition to information about the Yara rules that have triggered
  • All HTTP(S) communications are proxy-aware
  • One executable
  • Written in C++

URLHaus Scanning & API Integration

This is for checking for malicious urls against every page being scanned. The framework could either query the list of malicious URLs from URLHaus server (configuration: url_list_web), or from a file on disk (configuration: url_list_file), and if the latter is specified, then, it takes precedence over the former.

It works by searching the content of every page against all URL entries in url_list_web or url_list_file, checking for all occurrences. Additionally, upon a match, and if the configuration option check_url_api is set to true, Crawlector will send a POST request to the API URL set in the url_api configuration option, which returns a JSON object with extra information about a matching URL. Such information includes urlh_status (ex., online, offline, unknown), urlh_threat (ex., malware_download), urlh_tags (ex., elf, Mozi), and urlh_reference (ex., https://urlhaus.abuse.ch/url/1116455/). This information will be included in the log file cl_mlog_<current_date><current_time><(pm|am)>.csv (check below), only if check_url_api is set to true. Otherwise, the log file will include the columns urlh_url (list o f matching malicious URLs) and urlh_hit (number of occurrences for every matching malicious URL), conditional on whether check_url is set to true.

URLHaus feature could be disabled in its entirety by setting the configuration option check_url to false.

It is important to note that this feature could slow scanning considering the huge number of malicious urls (~ 130 million entries at the time of this writing) that need to be checked, and the time it takes to get extra information from the URLHaus server (if the option check_url_api is set to true).

Files and Folders Structures

  1. \cl_sites
    • this is where the list of sites to be visited or crawled is stored.
    • supports multiple files and directories.
  2. \crawled
    • where all crawled/spidered URLs are saved to a text file.
  3. \certs
    • where all domains/sites digital certificates are stored (in .der format).
  4. \results
    • where visited websites are saved.
  5. \pg_cache
    • program cache for sites that are not part of the spider functionality.
  6. \cl_cache
    • crawler cache for sites that are part of the spider functionality.
  7. \yara_rules
    • this is where all Yara rules are stored. All rules that exist in this directory will be loaded by the engine, parsed, validated, and evaluated before execution.
  8. cl_config.ini
    • this file contains all the configuration parameters that can be adjusted to influence the behavior of the framework.
  9. cl_mlog_<current_date><current_time><(pm|am)>.csv
    • log file that contains a plethora of information about visited websites
    • date, time, the status of Yara scanning, list of fired Yara rules with the offsets and lengths of each of the matches, id, URL, HTTP status code, connection status, HTTP headers, page size, the path to a saved page on disk, and other columns related to URLHaus results.
    • file name is unique per session.
  10. cl_offl_mlog_<current_date><current_time><(pm|am)>.csv
    • log file that contains information about files scanned offline.
    • list of fired Yara rules with the offsets and lengths of the matches, and path to a saved page on disk.
    • file name is unique per session.
  11. cl_certs_<current_date><current_time><(pm|am)>.csv
    • log file that contains a plethora of information about found digital certificates
  12. \expanded\exp_subdomain_<pm|am>.txt
    • contains discovered subdomains (part of the [site] section)
  13. \expanded\exp_tld_<pm|am>.txt
    • contains discovered domains (part of the [site] section)

Configuration File (cl_config.ini)

It is very important that you familiarize yourself with the configuration file cl_config.ini before running any session. All of the sections and parameters are documented in the configuration file itself.

The Yara offline scanning feature is a standalone option, meaning, if enabled, Crawlector will execute this feature only irrespective of other enabled features. And, the same is true for the crawling for domains/sites digital certificate feature. Either way, it is recommended that you disable all non-used features in the configuration file.

  • Depending on the configuration settings (log_to_file or log_to_cons), if a Yara rule references only a module's attributes (ex., PE, ELF, Hash, etc...), then Crawlector will display only the rule's name upon a match, excluding offset and length data.

Sites Format Pattern

To visit/scan a website, the list of URLs must be stored in text files, in the directory β€œcl_sites”.

Crawlector accepts three types of URLs:

  1. Type 1: one URL per line
    • Crawlector will assign a unique name to every URL, derived from the URL hostname
  2. Type 2: one URL per line, with a unique name [a-zA-Z0-9_-]{1,128} = <url>
  3. Type 3: for the spider functionality, a unique format is used. One URL per line is as follows:

<id>[depth:<0|1>-><\d+>,total:<\d+>,sleep:<\d+>] = <url>

For example,

mfmokbel[depth:1->3,total:10,sleep:0] = https://www.mfmokbel.com

which is equivalent to: mfmokbel[d:1->3,t:10,s:0] = https://www.mfmokbel.com

where, <id> := [a-zA-Z0-9_-]{1,128}

depth, total and sleep, can also be replaced with their shortened versions d, t and s, respectively.

  • depth: the spider supports going two levels deep for finding additional URLs (this is a design decision).
  • A value of 0 indicates a depth of level 1, with the value that comes after the β€œ->” ignored.
  • A depth of level-1 is controlled by the total parameter. So, first, the spider tries to find that many additional URLs off of the specified URL.
  • The value after the β€œ->” represents the maximum number of URLs to spider for each of the URLs found (as per the total parameter value).
  • A value of 1, indicates a depth of level 2, with the value that comes after the β€œ->” representing the maximum number of URLs to find, for every URL found per the total parameter. For clarification, and as shown in the example above, first, the spider will look for 10 URLs (as specified in the total parameter), and then, each of those found URLs will be spidered up to a max of 3 URLs; therefore, and in the best-case scenario, we would end up with 40 (10 + (10*3)) URLs.
  • The sleep parameter takes an integer value representing the number of milliseconds to sleep between every HTTP request.

Note 1: Type 3 URL could be turned into type 1 URL by setting the configuration parameter live_crawler to false, in the configuration file, in the spider section.

Note 2: Empty lines and lines that start with β€œ;” or β€œ//” are ignored.

The Spider Functionality

The spider functionality is what gives Crawlector the capability to find additional links on the targeted page. The Spider supports the following featuers:

  • The domain has to be of Type 3, for the Spider functionality to work
  • You may specify a list of wildcarded patterns (pipe delimited) to prevent spidering matching urls via the exclude_url config. option. For example, *.zip|*.exe|*.rar|*.zip|*.7z|*.pdf|.*bat|*.db
  • You may specify a list of wildcarded patterns (pipe delimited) to spider only urls that match the pattern via the include_url config. option. For example, */checkout/*|*/products/*
  • You may exclude HTTPS urls via the config. option exclude_https
  • You may account for outbound/external links as well, for the main page only, via the config. option add_ext_links. This feature honours the exclude_url and include_url config. option.
  • You may account for outbound/external links of the main page only, excluding all other urls, via the config. option ext_links_only. This feature honours the exclude_url and include_url config. option.

Site Ranking Functionality

  • This is for checking the ranking of the website
  • You give it a file with a list of websites, with their ranking, in a csv file format
  • Services that provide lists of websites ranking include, Alexa top-1m (discontinued as of May 2022), Cisco Umbrella, Majestic, Quantcast, Farsight and Tranco, among others
  • CSV file format (2 columns only): first column holds the ranking, and the second column holds the domain name
  • If a cell to contain quoted data, it'll be automatically dequoted
  • Line breaks aren't allowed in quoted text
  • Leading and trailing spaces are trimmed from cells read
  • Empty and comment lines are skipped
  • The section site_ranking in the configuration file provides some options to alter how the CSV file is to be read
  • The performance of this query is dependent on the number of records in the CSV file
  • Crawlector compares every entry in the CSV file against the domain being investigated, and not the other way around
  • Only the registered/pay-level domain is compared

Finding TLDs and Subdomains - [site] Section

  • The site section provides the capability to expand on a given site, by attempting to find all available top-level domains (TLDs) and/or subdomains for the same domain. If found, new tlds/subdomains will be checked like any other domain
  • This feature uses the Omnisint Labs (https://omnisint.io/) and RapidAPI APIs
  • Omnisint Labs API returns subdomains and tlds, whereas RapidAPI returns only subdomains (the Omnisint Labs API is down as of March 10, 2023, however, the implementation is still available in case the site is back up)
  • For RapidAPI, you need a valid "Domains records" API key that you can request from RapidAPI, and plug it into the key rapid_api_key in the configuration file
  • With find_tlds enabled, in addition to Omnisint Labs API tlds results, the framework attempts to find other active/registered domains by going through every tld entry, either, in the tlds_file or tlds_url
  • If tlds_url is set, it should point to a url that hosts tlds, each one on a new line (lines that start with either of the characters ';', '#' or '//' are ignored)
  • tlds_file, holds the filename that contains the list of tlds (same as for tlds_url; only the tld is present, excluding the '.', for ex., "com", "org")
  • If tlds_file is set, it takes precedence over tlds_url
  • tld_dl_time_out, this is for setting the maximum timeout for the dnslookup function when attempting to check if the domain in question resolves or not
  • tld_use_connect, this option enables the functionality to connect to the domain in question over a list of ports, defined in the option tlds_connect_ports
  • The option tlds_connect_ports accepts a list of ports, comma separated, or a list of ranges, such as 25-40,90-100,80,443,8443 (range start and end are inclusive)
    • tld_con_time_out, this is for setting the maximum timeout for the connect function
  • tld_con_use_ssl, enable/disable the use of ssl when attempting to connect to the domain
  • If save_to_file_subd is set to true, discovered subdomains will be saved to "\expanded\exp_subdomain_<pm|am>.txt"
  • If save_to_file_tld is set to true, discovered domains will be saved to "\expanded\exp_tld_<pm|am>.txt"
  • If exit_here is set to true, then Crawlector bails out after executing this [site] function, irrespective of other enabled options. It means found sites won't be crawled/spidered

Design Considerations

  • A URL page is retrieved by sending a GET request to the server, reading the server response body, and passing it to Yara engine for detection.
  • Some of the GET request attributes are defined in the [default] section in the configuration file, including, the User-Agent and Referer headers, and connection timeout, among other options.
  • Although Crawlector logs a session's data to a CSV file, converting it to an SQL file is recommended for better performance, manipulation and retrieval of the data. This becomes evident when you’re crawling thousands of domains.
  • Repeated domains/urls in the cl_sites are allowed.

Limitations

  • Single threaded
  • Static detection (no dynamic evaluation of a given page's content)
  • No headless browser support, yet!

Third-party libraries used

Contributing

Open for pull requests and issues. Comments and suggestions are greatly appreciated.

Author

Mohamad Mokbel (@MFMokbel)



☐ β˜† βœ‡ KitPloit - PenTest Tools!

Cve-Collector - Simple Latest CVE Collector

By: Zion3R β€” November 1st 2023 at 11:30


Simple Latest CVE Collector Written in Python

  • There are various methods for collecting the latest CVE (Common Vulnerabilities and Exposures) information.
  • This code was created to provide guidance on how to collect, what information to include, and how to code when creating a CVE collector.
  • The code provided here is one of many ways to implement a CVE collector.
  • It is written using a method that involves crawling a specific website, parsing HTML elements, and retrieving the data.

This collector uses a search query on https://www.cvedetails.com to collect information on vulnerabilities with a severity score of 6 or higher.

  • It creates a simple delimiter-based file to function as a database (no DBMS required).
  • When a new CVE is discovered, it retrieves "vulnerability details" as well.

  1. Set the cvss_min_score variable.
  1. Add addtional code to receive results, such as a webhook.
  • The location for calling this code is marked as "Send the result to webhook."
  1. If you want to run it automatically, register it in crontab or a similar scheduler.

# python3 main.py

*2023-10-10 11:05:33.370262*

1. CVE-2023-44832 / CVSS: 7.5 (HIGH)
- Published: 2023-10-05 16:15:12
- Updated: 2023-10-07 03:15:47
- CWE: CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')

D-Link DIR-823G A1V1.0.2B05 was discovered to contain a buffer overflow via the MacAddress parameter in the SetWanSettings function. Th...
>> https://www.cve.org/CVERecord?id=CVE-2023-44832

- Ref.
(1) https://www.dlink.com/en/security-bulletin/
(2) https://github.com/bugfinder0/public_bug/tree/main/dlink/dir823g/SetWanSettings_MacAddress



2. CVE-2023-44831 / CVSS: 7.5 (HIGH)
- Published: 2023-10-05 16:15:12
- Updated: 2023-10-07 03:16:56
- CWE: CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')

D-Lin k DIR-823G A1V1.0.2B05 was discovered to contain a buffer overflow via the Type parameter in the SetWLanRadioSettings function. Th...
>> https://www.cve.org/CVERecord?id=CVE-2023-44831

- Ref.
(1) https://www.dlink.com/en/security-bulletin/
(2) https://github.com/bugfinder0/public_bug/tree/main/dlink/dir823g/SetWLanRadioSettings_Type

(delimiter-based file database)

# vim feeds.db

1|2023-10-10 09:24:21.496744|0d239fa87be656389c035db1c3f5ec6ca3ec7448|CVE-2023-45613|2023-10-09 11:15:11|6.8|MEDIUM|CWE-295 Improper Certificate Validation
2|2023-10-10 09:24:27.073851|30ebff007cca946a16e5140adef5a9d5db11eee8|CVE-2023-45612|2023-10-09 11:15:11|8.6|HIGH|CWE-611 Improper Restriction of XML External Entity Reference
3|2023-10-10 09:24:32.650234|815b51259333ed88193fb3beb62c9176e07e4bd8|CVE-2023-45303|2023-10-06 19:15:13|8.4|HIGH|Not found CWE ids for CVE-2023-45303
4|2023-10-10 09:24:38.369632|39f98184087b8998547bba41c0ccf2f3ad61f527|CVE-2023-45248|2023-10-09 12:15:10|6.6|MEDIUM|CWE-427 Uncontrolled Search Path Element
5|2023-10-10 09:24:43.936863|60083d8626b0b1a59ef6fa16caec2b4fd1f7a6d7|CVE-2023-45247|2023-10-09 12:15:10|7.1|HIGH|CWE-862 Missing Authorization
6|2023-10-10 09:24:49.472179|82611add9de44e5807b8f8324bdfb065f6d4177a|CVE-2023-45246|2023-10-06 11:15:11|7.1|HIGH|CWE-287 Improper Authentication
7|20 23-10-10 09:24:55.049191|b78014cd7ca54988265b19d51d90ef935d2362cf|CVE-2023-45244|2023-10-06 10:15:18|7.1|HIGH|CWE-862 Missing Authorization

The methods for collecting CVE (Common Vulnerabilities and Exposures) information are divided into different stages. They are primarily categorized into two

(1) Method for retrieving CVE information after vulnerability analysis and risk assessment have been completed.

  • This method involves collecting CVE information after all the processes have been completed.
  • Naturally, there is a time lag of several days (it is slower).

(2) Method for retrieving CVE information at the stage when it is included as a vulnerability.

  • This refers to the stage immediately after a CVE ID has been assigned and the vulnerability has been publicly disclosed.
  • At this stage, there may only be basic information about the vulnerability, or the CVSS score may not have been evaluated, and there may be a lack of necessary content such as reference documents.

  • This code is designed to parse HTML elements from cvedetails.com, so it may not function correctly if the HTML page structure changes.
  • In case of errors during parsing, exception handling has been included, so if it doesn't work as expected, please inspect the HTML source for any changes.

  • Get free latest infomation. If useful to someone, Free for all to the last. (absolutely no paid)
  • ID 2 is the channel created using this repository source code.
  • If you find this helpful, please the "star" to support further improvements.


☐ β˜† βœ‡ KitPloit - PenTest Tools!

Pinkerton - An JavaScript File Crawler And Secret Finder Developed In Python

By: Zion3R β€” September 28th 2023 at 11:30


️️ Pinkerton is a Python tool created to crawl JavaScript files and search for secrets


Installing / Getting started

A quick guide of how to install and use Pinkerton.

1. Clone the repository with: git clone https://github.com/oppsec/pinkerton.git
2. Install the libraries with: pip3 install -r requirements.txt
3. Run Pinkerton with: python3 main.py -u https://example.com

Docker

If you want to use pinkerton in a Docker container, follow this commands:

1. Clone the repository - git clone https://github.com/oppsec/pinkerton.git
2. Build the image - sudo docker build -t pinkerton:latest .
3. Run container - sudo docker run pinkerton:latest



Pre-requisites

  • Python 3 installed on your machine.
  • Install the libraries with pip3 install -r requirements.txt

Features

  • Works with ProxyChains
  • Fast scan
  • Low RAM and CPU usage
  • Open-Source
  • Python ❀️

To-Do

  • Add more secrets regex pattern
  • Improve JavaScript file extract function
  • Improve pattern match system
  • Add pass list file method

Contributing

A quick guide of how to contribute with the project.

1. Create a fork from Pinkerton repository
2. Clone the repository with git clone https://github.com/your/pinkerton.git
3. Type cd pinkerton/
4. Create a branch and make your changes
5. Commit and make a git push
6. Open a pull request


Credits


Warning

  • The developer is not responsible for any malicious use of this tool.


☐ β˜† βœ‡ KitPloit - PenTest Tools!

Xcrawl3R - A CLI Utility To Recursively Crawl Webpages

By: Zion3R β€” August 11th 2023 at 12:30


xcrawl3r is a command-line interface (CLI) utility to recursively crawl webpages i.e systematically browse webpages' URLs and follow links to discover linked webpages' URLs.


Features

  • Recursively crawls webpages for URLs.
  • Parses URLs from files (.js, .json, .xml, .csv, .txt & .map).
  • Parses URLs from robots.txt.
  • Parses URLs from sitemaps.
  • Renders pages (including Single Page Applications such as Angular and React).
  • Cross-Platform (Windows, Linux & macOS)

Installation

Install release binaries (Without Go Installed)

Visit the releases page and find the appropriate archive for your operating system and architecture. Download the archive from your browser or copy its URL and retrieve it with wget or curl:

  • ...with wget:

     wget https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz
  • ...or, with curl:

     curl -OL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz

...then, extract the binary:

tar xf xcrawl3r-<version>-linux-amd64.tar.gz

TIP: The above steps, download and extract, can be combined into a single step with this onliner

curl -sL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz | tar -xzv

NOTE: On Windows systems, you should be able to double-click the zip archive to extract the xcrawl3r executable.

...move the xcrawl3r binary to somewhere in your PATH. For example, on GNU/Linux and OS X systems:

sudo mv xcrawl3r /usr/local/bin/

NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r to their PATH.

Install source (With Go Installed)

Before you install from source, you need to make sure that Go is installed on your system. You can install Go by following the official instructions for your operating system. For this, we will assume that Go is already installed.

go install ...

go install -v github.com/hueristiq/xcrawl3r/cmd/xcrawl3r@latest

go build ... the development Version

  • Clone the repository

     git clone https://github.com/hueristiq/xcrawl3r.git 
  • Build the utility

     cd xcrawl3r/cmd/xcrawl3r && \
    go build .
  • Move the xcrawl3r binary to somewhere in your PATH. For example, on GNU/Linux and OS X systems:

     sudo mv xcrawl3r /usr/local/bin/

    NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r to their PATH.

NOTE: While the development version is a good way to take a peek at xcrawl3r's latest features before they get released, be aware that it may have bugs. Officially released versions will generally be more stable.

Usage

To display help message for xcrawl3r use the -h flag:

xcrawl3r -h

help message:

                             _ _____      
__ _____ _ __ __ ___ _| |___ / _ __
\ \/ / __| '__/ _` \ \ /\ / / | |_ \| '__|
> < (__| | | (_| |\ V V /| |___) | |
/_/\_\___|_| \__,_| \_/\_/ |_|____/|_| v0.1.0

A CLI utility to recursively crawl webpages.

USAGE:
xcrawl3r [OPTIONS]

INPUT:
-d, --domain string domain to match URLs
--include-subdomains bool match subdomains' URLs
-s, --seeds string seed URLs file (use `-` to get from stdin)
-u, --url string URL to crawl

CONFIGURATION:
--depth int maximum depth to crawl (default 3)
TIP: set it to `0` for infinite recursion
--headless bool If true the browser will be displayed while crawling.
-H, --headers string[] custom header to include in requests
e.g. -H 'Referer: http://example.com/'
TIP: use multiple flag to set multiple headers
--proxy string[] Proxy URL (e.g: http://127.0.0.1:8080)
TIP: use multiple flag to set multiple proxies
--render bool utilize a headless chrome instance to render pages
--timeout int time to wait for request in seconds (default: 10)
--user-agent string User Agent to use (default: web)
TIP: use `web` for a random web user-agent,
`mobile` for a random mobile user-agent,
or you can set your specific user-agent.

RATE LIMIT:
-c, --concurrency int number of concurrent fetchers to use (default 10)
--delay int delay between each request in seconds
--max-random-delay int maximux extra randomized delay added to `--dalay` (default: 1s)
-p, --parallelism int number of concurrent URLs to process (default: 10)

OUTPUT:
--debug bool enable debug mode (default: false)
-m, --monochrome bool coloring: no colored output mode
-o, --output string output file to write found URLs
-v, --verbosity string debug, info, warning, error, fatal or silent (default: debug)

Contributing

Issues and Pull Requests are welcome! Check out the contribution guidelines.

Licensing

This utility is distributed under the MIT license.

Credits



❌