A Model Context Protocol (MCP) server implementation that integrates with Firecrawl for web scraping capabilities.
Big thanks to @vrknetha, @cawstudios for the initial implementation!
You can also play around with our MCP Server on MCP.so's playground. Thanks to MCP.so for hosting and @gstarwd for integrating our server.
Β
env FIRECRAWL_API_KEY=fc-YOUR_API_KEY npx -y firecrawl-mcp
npm install -g firecrawl-mcp
Configuring Cursor π₯οΈ Note: Requires Cursor version 0.45.6+ For the most up-to-date configuration instructions, please refer to the official Cursor documentation on configuring MCP servers: Cursor MCP Server Configuration Guide
To configure Firecrawl MCP in Cursor v0.45.6
env FIRECRAWL_API_KEY=your-api-key npx -y firecrawl-mcp
To configure Firecrawl MCP in Cursor v0.48.6
json { "mcpServers": { "firecrawl-mcp": { "command": "npx", "args": ["-y", "firecrawl-mcp"], "env": { "FIRECRAWL_API_KEY": "YOUR-API-KEY" } } } }
If you are using Windows and are running into issues, try
cmd /c "set FIRECRAWL_API_KEY=your-api-key && npx -y firecrawl-mcp"
Replace your-api-key
with your Firecrawl API key. If you don't have one yet, you can create an account and get it from https://www.firecrawl.dev/app/api-keys
After adding, refresh the MCP server list to see the new tools. The Composer Agent will automatically use Firecrawl MCP when appropriate, but you can explicitly request it by describing your web scraping needs. Access the Composer via Command+L (Mac), select "Agent" next to the submit button, and enter your query.
Add this to your ./codeium/windsurf/model_config.json
:
{
"mcpServers": {
"mcp-server-firecrawl": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "YOUR_API_KEY"
}
}
}
}
To install Firecrawl for Claude Desktop automatically via Smithery:
npx -y @smithery/cli install @mendableai/mcp-server-firecrawl --client claude
FIRECRAWL_API_KEY
: Your Firecrawl API keyFIRECRAWL_API_URL
FIRECRAWL_API_URL
(Optional): Custom API endpoint for self-hosted instanceshttps://firecrawl.your-domain.com
FIRECRAWL_RETRY_MAX_ATTEMPTS
: Maximum number of retry attempts (default: 3)FIRECRAWL_RETRY_INITIAL_DELAY
: Initial delay in milliseconds before first retry (default: 1000)FIRECRAWL_RETRY_MAX_DELAY
: Maximum delay in milliseconds between retries (default: 10000)FIRECRAWL_RETRY_BACKOFF_FACTOR
: Exponential backoff multiplier (default: 2)FIRECRAWL_CREDIT_WARNING_THRESHOLD
: Credit usage warning threshold (default: 1000)FIRECRAWL_CREDIT_CRITICAL_THRESHOLD
: Credit usage critical threshold (default: 100)For cloud API usage with custom retry and credit monitoring:
# Required for cloud API
export FIRECRAWL_API_KEY=your-api-key
# Optional retry configuration
export FIRECRAWL_RETRY_MAX_ATTEMPTS=5 # Increase max retry attempts
export FIRECRAWL_RETRY_INITIAL_DELAY=2000 # Start with 2s delay
export FIRECRAWL_RETRY_MAX_DELAY=30000 # Maximum 30s delay
export FIRECRAWL_RETRY_BACKOFF_FACTOR=3 # More aggressive backoff
# Optional credit monitoring
export FIRECRAWL_CREDIT_WARNING_THRESHOLD=2000 # Warning at 2000 credits
export FIRECRAWL_CREDIT_CRITICAL_THRESHOLD=500 # Critical at 500 credits
For self-hosted instance:
# Required for self-hosted
export FIRECRAWL_API_URL=https://firecrawl.your-domain.com
# Optional authentication for self-hosted
export FIRECRAWL_API_KEY=your-api-key # If your instance requires auth
# Custom retry configuration
export FIRECRAWL_RETRY_MAX_ATTEMPTS=10
export FIRECRAWL_RETRY_INITIAL_DELAY=500 # Start with faster retries
Add this to your claude_desktop_config.json
:
{
"mcpServers": {
"mcp-server-firecrawl": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE",
"FIRECRAWL_RETRY_MAX_ATTEMPTS": "5",
"FIRECRAWL_RETRY_INITIAL_DELAY": "2000",
"FIRECRAWL_RETRY_MAX_DELAY": "30000",
"FIRECRAWL_RETRY_BACKOFF_FACTOR": "3",
"FIRECRAWL_CREDIT_WARNING_THRESHOLD": "2000",
"FIRECRAWL_CREDIT_CRITICAL_THRESHOLD": "500"
}
}
}
}
The server includes several configurable parameters that can be set via environment variables. Here are the default values if not configured:
const CONFIG = {
retry: {
maxAttempts: 3, // Number of retry attempts for rate-limited requests
initialDelay: 1000, // Initial delay before first retry (in milliseconds)
maxDelay: 10000, // Maximum delay between retries (in milliseconds)
backoffFactor: 2, // Multiplier for exponential backoff
},
credit: {
warningThreshold: 1000, // Warn when credit usage reaches this level
criticalThreshold: 100, // Critical alert when credit usage reaches this level
},
};
These configurations control:
Retry Behavior
Automatically retries failed requests due to rate limits
Example: With default settings, retries will be attempted at:
Credit Usage Monitoring
The server utilizes Firecrawl's built-in rate limiting and batch processing capabilities:
firecrawl_scrape
)Scrape content from a single URL with advanced options.
{
"name": "firecrawl_scrape",
"arguments": {
"url": "https://example.com",
"formats": ["markdown"],
"onlyMainContent": true,
"waitFor": 1000,
"timeout": 30000,
"mobile": false,
"includeTags": ["article", "main"],
"excludeTags": ["nav", "footer"],
"skipTlsVerification": false
}
}
firecrawl_batch_scrape
)Scrape multiple URLs efficiently with built-in rate limiting and parallel processing.
{
"name": "firecrawl_batch_scrape",
"arguments": {
"urls": ["https://example1.com", "https://example2.com"],
"options": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
}
Response includes operation ID for status checking:
{
"content": [
{
"type": "text",
"text": "Batch operation queued with ID: batch_1. Use firecrawl_check_batch_status to check progress."
}
],
"isError": false
}
firecrawl_check_batch_status
)Check the status of a batch operation.
{
"name": "firecrawl_check_batch_status",
"arguments": {
"id": "batch_1"
}
}
firecrawl_search
)Search the web and optionally extract content from search results.
{
"name": "firecrawl_search",
"arguments": {
"query": "your search query",
"limit": 5,
"lang": "en",
"country": "us",
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
}
firecrawl_crawl
)Start an asynchronous crawl with advanced options.
{
"name": "firecrawl_crawl",
"arguments": {
"url": "https://example.com",
"maxDepth": 2,
"limit": 100,
"allowExternalLinks": false,
"deduplicateSimilarURLs": true
}
}
firecrawl_extract
)Extract structured information from web pages using LLM capabilities. Supports both cloud AI and self-hosted LLM extraction.
{
"name": "firecrawl_extract",
"arguments": {
"urls": ["https://example.com/page1", "https://example.com/page2"],
"prompt": "Extract product information including name, price, and description",
"systemPrompt": "You are a helpful assistant that extracts product information",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"description": { "type": "string" }
},
"required": ["name", "price"]
},
"allowExternalLinks": false,
"enableWebSearch": false,
"includeSubdomains": false
}
}
Example response:
{
"content": [
{
"type": "text",
"text": {
"name": "Example Product",
"price": 99.99,
"description": "This is an example product description"
}
}
],
"isError": false
}
urls
: Array of URLs to extract information fromprompt
: Custom prompt for the LLM extractionsystemPrompt
: System prompt to guide the LLMschema
: JSON schema for structured data extractionallowExternalLinks
: Allow extraction from external linksenableWebSearch
: Enable web search for additional contextincludeSubdomains
: Include subdomains in extractionWhen using a self-hosted instance, the extraction will use your configured LLM. For cloud API, it uses Firecrawl's managed LLM service.
Conduct deep web research on a query using intelligent crawling, search, and LLM analysis.
{
"name": "firecrawl_deep_research",
"arguments": {
"query": "how does carbon capture technology work?",
"maxDepth": 3,
"timeLimit": 120,
"maxUrls": 50
}
}
Arguments:
Returns:
Generate a standardized llms.txt (and optionally llms-full.txt) file for a given domain. This file defines how large language models should interact with the site.
{
"name": "firecrawl_generate_llmstxt",
"arguments": {
"url": "https://example.com",
"maxUrls": 20,
"showFullText": true
}
}
Arguments:
Returns:
The server includes comprehensive logging:
Example log messages:
[INFO] Firecrawl MCP Server initialized successfully
[INFO] Starting scrape for URL: https://example.com
[INFO] Batch operation queued with ID: batch_1
[WARNING] Credit usage has reached warning threshold
[ERROR] Rate limit exceeded, retrying in 2s...
The server provides robust error handling:
Example error response:
{
"content": [
{
"type": "text",
"text": "Error: Rate limit exceeded. Retrying in 2 seconds..."
}
],
"isError": true
}
# Install dependencies
npm install
# Build
npm run build
# Run tests
npm test
npm test
MIT License - see LICENSE file for details
Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.
Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
# Fetch websites' source under the radar!
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
>> print(page.status)
200
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!
Fetcher
class.PlayWrightFetcher
class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or NSTbrowser's browserless!StealthyFetcher
and PlayWrightFetcher
classes.from scrapling.fetchers import Fetcher
fetcher = Fetcher(auto_match=False)
# Do http GET request to a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Get all text content from all HTML tags in the page except `script` and `style` tags
page.get_all_text(ignore_tags=('script', 'style'))
# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
quotes = page.css('.quote .text::text') # CSS selector
quotes = page.xpath('//span[@class="text"]/text()') # XPath
quotes = page.css('.quote').css('.text::text') # Chained selectors
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above
# Get the first quote element
quote = page.css_first('.quote') # same as page.css('.quote').first or page.css('.quote')[0]
# Tired of selectors? Use find_all/find
# Get all 'div' HTML tags that one of its 'class' values is 'quote'
quotes = page.find_all('div', {'class': 'quote'})
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...
# Working with elements
quote.html_content # Get Inner HTML of this element
quote.prettify() # Prettified version of Inner HTML above
quote.attrib # Get that element's attributes
quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)
To keep it simple, all methods can be chained on top of each other!
Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents. Here are benchmarks comparing Scrapling to popular Python libraries in two tests.
# | Library | Time (ms) | vs Scrapling |
---|---|---|---|
1 | Scrapling | 5.44 | 1.0x |
2 | Parsel/Scrapy | 5.53 | 1.017x |
3 | Raw Lxml | 6.76 | 1.243x |
4 | PyQuery | 21.96 | 4.037x |
5 | Selectolax | 67.12 | 12.338x |
6 | BS4 with Lxml | 1307.03 | 240.263x |
7 | MechanicalSoup | 1322.64 | 243.132x |
8 | BS4 with html5lib | 3373.75 | 620.175x |
As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.
Library | Time (ms) | vs Scrapling |
---|---|---|
Scrapling | 2.51 | 1.0x |
AutoScraper | 11.41 | 4.546x |
Scrapling can find elements with more methods and it returns full element Adaptor
objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.
All benchmarks' results are an average of 100 runs. See our benchmarks.py for methodology and to run your comparisons.
Scrapling is a breeze to get started with; Starting from version 0.2.9, we require at least Python 3.9 to work.
pip3 install scrapling
Then run this command to install browsers' dependencies needed to use Fetcher classes
scrapling install
If you have any installation issues, please open an issue.
Fetchers are interfaces built on top of other libraries with added features that do requests or fetch pages for you in a single request fashion and then return an Adaptor
object. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the Adaptor
class to create an Adaptor
instance and start playing around with the page.
You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way
from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher
All of them can take these initialization arguments: auto_match
, huge_tree
, keep_comments
, keep_cdata
, storage
, and storage_args
, which are the same ones you give to the Adaptor
class.
If you don't want to pass arguments to the generated Adaptor
object and want to use the default values, you can use this import instead for cleaner code:
from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
then use it right away without initializing like:
page = StealthyFetcher.fetch('https://example.com')
Also, the Response
object returned from all fetchers is the same as the Adaptor
object except it has these added attributes: status
, reason
, cookies
, headers
, history
, and request_headers
. All cookies
, headers
, and request_headers
are always of type dictionary
.
[!NOTE] The
auto_match
argument is enabled by default which is the one you should care about the most as you will see later.
This class is built on top of httpx with additional configuration options, here you can do GET
, POST
, PUT
, and DELETE
requests.
For all methods, you have stealthy_headers
which makes Fetcher
create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument retries
for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all Fetcher
methods is 3.
Hence: All headers generated by
stealthy_headers
argument can be overwritten by you through theheaders
argument
You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format http://username:password@localhost:8030
>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')
For Async requests, you will just replace the import like below:
>> from scrapling.fetchers import AsyncFetcher
>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = await AsyncFetcher().delete('https://httpbin.org/delete')
This class is built on top of Camoufox, bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
>> page.status == 200
True
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True
Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
This class is built on top of Playwright which currently provides 4 main run options but they can be mixed as you want.
>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # the async version of fetch
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)
Using this Fetcher class, you can make requests with: 1) Vanilla Playwright without any modifications other than the ones you chose. 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like Sannysoft's. Some of the things this fetcher's stealth mode does include: * Patching the CDP runtime fingerprint. * Mimics some of the real browsers' properties by injecting several JS files and using custom options. * Using custom flags on launch to hide Playwright even more and make it faster. * Generates real browser's headers of the same type and same user OS then append it to the request's headers. 3) Real browsers by passing the real_chrome
argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it. 4) NSTBrowser's docker browserless option by passing the CDP URL and enabling nstbrowser_mode
option.
Hence using the
real_chrome
argument requires that you have Chrome browser installed on your device
Add that to a lot of controlling/hiding options as you will see in the arguments list below.
This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
>>> quote.tag
'div'
>>> quote.parent
<data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>
>>> quote.parent.tag
'div'
>>> quote.children
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]
>>> quote.siblings
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]
>>> quote.next # gets the next element, the same logic applies to `quote.previous`
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>
>>> quote.children.css_first(".author::text")
'Albert Einstein'
>>> quote.has_class('quote')
True
# Generate new selectors for any element
>>> quote.generate_css_selector
'body > div > div:nth-of-type(2) > div > div'
# Test these selectors on your favorite browser or reuse them again in the library's methods!
>>> quote.generate_xpath_selector
'//body/div/div[2]/div/div'
If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like below
for ancestor in quote.iterancestors():
# do something with it...
You can search for a specific ancestor of an element that satisfies a function, all you need to do is to pass a function that takes an Adaptor
object as an argument and return True
if the condition satisfies or False
otherwise like below:
>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))
<data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>
You can select elements by their text content in multiple ways, here's a full example on another website:
>>> page = Fetcher().get('https://books.toscrape.com/index.html')
>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href']) # We use `page.urljoin` to return the full URL from the relative `href`
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'
>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
>>> page.find_by_regex(r'Β£[\d\.]+') # Get the first element that its text content matches my price regex
<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
>>> page.find_by_regex(r'Β£[\d\.]+', first_match=False) # Get all elements that matches my price regex
[<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
...]
Find all elements that are similar to the current element in location and attributes
# For this case, ignore the 'title' attribute while matching
>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
<data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
...]
# You will notice that the number of elements is 19 not 20 because the current element is not included.
>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))
19
# Get the `href` attribute from all similar elements
>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]
['catalogue/a-light-in-the-attic_1000/index.html',
'catalogue/soumission_998/index.html',
'catalogue/sharp-objects_997/index.html',
...]
To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason
>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():
print({
"name": product.css_first('h3 a::text'),
"price": product.css_first('.price_color').re_first(r'[\d\.]+'),
"stock": product.css('.availability::text')[-1].clean()
})
{'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
...
The documentation will provide more advanced examples.
Let's say you are scraping a page with a structure like this:
<div class="container">
<section class="products">
<article class="product" id="p1">
<h3>Product 1</h3>
<p class="description">Description 1</p>
</article>
<article class="product" id="p2">
<h3>Product 2</h3>
<p class="description">Description 2</p>
</article>
</section>
</div>
And you want to scrape the first product, the one with the p1
ID. You will probably write a selector like this
page.css('#p1')
When website owners implement structural changes like
<div class="new-container">
<div class="product-wrapper">
<section class="products">
<article class="product new-class" data-id="p1">
<div class="product-info">
<h3>Product 1</h3>
<p class="new-description">Description 1</p>
</div>
</article>
<article class="product new-class" data-id="p2">
<div class="product-info">
<h3>Product 2</h3>
<p class="new-description">Description 2</p>
</div>
</article>
</section>
</div>
</div>
The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
from scrapling.parser import Adaptor
# Before the change
page = Adaptor(page_source, url='example.com')
element = page.css('#p1' auto_save=True)
if not element: # One day website changes?
element = page.css('#p1', auto_match=True) # Scrapling still finds it!
# the rest of the code...
How does the auto-matching work? Check the FAQs section for that and other possible issues while auto-matching.
Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.
To solve this issue, I will use The Web Archive's Wayback Machine. Here is a copy of StackOverFlow's website in 2010, pretty old huh?Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
If I want to extract the Questions button from the old design I can use a selector like this #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a
This selector is too specific because it was generated by Google Chrome. Now let's test the same selector in both versions
>> from scrapling.fetchers import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>>
>> page = Fetcher(automatch_domain='stackoverflow.com').get(old_url, timeout=30)
>> element1 = page.css_first(selector, auto_save=True)
>>
>> # Same selector but used in the updated website
>> page = Fetcher(automatch_domain="stackoverflow.com").get(new_url)
>> element2 = page.css_first(selector, auto_match=True)
>>
>> if element1.text == element2.text:
... print('Scrapling found the same element in the old design and the new design!')
'Scrapling found the same element in the old design and the new design!'
Note that I used a new argument called automatch_domain
, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.
In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the automatch_domain
argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)
Notes: 1. For the two examples above I used one time the Adaptor
class and the second time the Fetcher
class just to show you that you can create the Adaptor
object by yourself if you have the source or fetch the source using any Fetcher
class then it will create the Adaptor
object for you. 2. Passing the auto_save
argument with the auto_match
argument set to False
while initializing the Adaptor/Fetcher object will only result in ignoring the auto_save
argument value and the following warning message text Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info.
This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the auto_match
argument.
auto_match
parameter works only for Adaptor
instances not Adaptors
so if you do something like this you will get an error python page.css('body').css('#p1', auto_match=True)
because you can't auto-match a whole list, you have to be specific and do something like python page.css_first('body').css('#p1', auto_match=True)
Inspired by BeautifulSoup's find_all
function you can find elements by using find_all
/find
methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.
So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
It filters all elements in the current page/element in the following order:
Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. But the order in which you pass the arguments doesn't matter.
Examples to clear any confusion :)
>> from scrapling.fetchers import Fetcher
>> page = Fetcher().get('https://quotes.toscrape.com/')
# Find all elements with tag name `div`.
>> page.find_all('div')
[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
...]
# Find all div elements with a class that equals `quote`.
>> page.find_all('div', class_='quote')
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]
# Same as above.
>> page.find_all('div', {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]
# Find all elements with a class that equals `quote`.
>> page.find_all({'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]
# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
# Find all elements that don't have children.
>> page.find_all(lambda element: len(element.children) > 0)
[<data='<html lang="en"><head><meta charset="UTF...'>,
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
...]
# Find all elements that contain the word 'world' in its content.
>> page.find_all(lambda element: "world" in element.text)
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
# Find all span elements that match the given regex
>> page.find_all('span', re.compile(r'world'))
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>]
# Find all div and span elements with class 'quote' (No span elements like that so only div returned)
>> page.find_all(['div', 'span'], {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]
# Mix things up
>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
['Albert Einstein',
'J.K. Rowling',
...]
Here's what else you can do with Scrapling:
lxml.etree
object itself of any element directly python >>> quote._root <Element div at 0x107f98870>
Saving and retrieving elements manually to auto-match them outside the css
and the xpath
methods but you have to set the identifier by yourself.
To save an element to the database: python >>> element = page.find_by_text('Tipping the Velvet', first_match=True) >>> page.save(element, 'my_special_element')
python >>> element_dict = page.retrieve('my_special_element') >>> page.relocate(element_dict, adaptor_type=True) [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>] >>> page.relocate(element_dict, adaptor_type=True).css('::text') ['Tipping the Velvet']
if you want to keep it as lxml.etree
object, leave the adaptor_type
argument python >>> page.relocate(element_dict) [<Element a at 0x105a2a7b0>]
Filtering results based on a function
# Find all products over $50
expensive_products = page.css('.product_pod').filter(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
)
# Find all the products with price '53.23'
page.css('.product_pod').search(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
)
Doing operations on element content is the same as scrapy python quote.re(r'regex_pattern') # Get all strings (TextHandlers) that match the regex pattern quote.re_first(r'regex_pattern') # Get the first string (TextHandler) only quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options
except that you can do more with them like python quote.re( r'regex_pattern', replace_entities=True, # Character entity references are replaced by their corresponding character clean_match=True, # This will ignore all whitespaces and consecutive spaces while matching case_sensitive= False, # Set the regex to ignore letters case while compiling it )
Hence all of these methods are methods from the TextHandler
within that contains the text content so the same can be done directly if you call the .text
property or equivalent selector function.
Doing operations on the text content itself includes
python quote.clean()
TextHandler
objects too? so in cases where you have for example a JS object assigned to a JS variable inside JS code and want to extract it with regex and then convert it to json object, in other libraries, these would be more than 1 line of code but here you can do it in 1 line like this python page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
Sort all characters in the string as if it were a list and return the new string python quote.sort(reverse=False)
To be clear,
TextHandler
is a sub-class of Python'sstr
so all normal operations/methods that work with Python strings will work with it.
Any element's attributes are not exactly a dictionary but a sub-class of mapping called AttributesHandler
that's read-only so it's faster and string values returned are actually TextHandler
objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)
python >>> for item in element.attrib.search_values('catalogue', partial=True): print(item) {'href': 'catalogue/tipping-the-velvet_999/index.html'}
python >>> element.attrib.json_string b'{"href":"catalogue/tipping-the-velvet_999/index.html","title":"Tipping the Velvet"}'
python >>> dict(element.attrib) {'href': 'catalogue/tipping-the-velvet_999/index.html', 'title': 'Tipping the Velvet'}
Scrapling is under active development so expect many more features coming soon :)
There are a lot of deep details skipped here to make this as short as possible so to take a deep dive, head to the docs section. I will try to keep it updated as possible and add complex examples. There I will explain points like how to write your storage system, write spiders that don't depend on selectors at all, and more...
Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
[!IMPORTANT] A website is needed to provide detailed library documentation.
I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. I have been working on Scrapling for months for free after all.
If you likeScrapling
and want it to keep improving then this is a friendly reminder that you can help by supporting me through the sponsor button.
This section addresses common questions about Scrapling, please read this section before opening an issue.
css
or xpath
with the auto_save
parameter set to True
before structural changes happen.Now because everything about the element can be changed or removed, nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
identifier
parameter you passed to the method while selecting. If you didn't pass one, then the selector string itself will be used as an identifier but remember you will have to use it as an identifier value later when the structure changes and you want to pass the new selector.Together both are used to retrieve the element's unique properties from the database later. 4. Now later when you enable the auto_match
parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one. 5. Comparing elements is not exact but more about finding how similar these values are, so everything is taken into consideration, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now. 6. The score for each element is stored in the table, and the element(s) with the highest combined similarity scores are returned.
Not a big problem as it depends on your usage. The word default
will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
For each element, Scrapling will extract: - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only). - Element's parent tag name, attributes (names and values), and text.
auto_save
/auto_match
parameter while selecting and it got completely ignored with a warning messageThat's because passing the auto_save
/auto_match
argument without setting auto_match
to True
while initializing the Adaptor object will only result in ignoring the auto_save
/auto_match
argument value. This behavior is purely for performance reasons so the database gets created only when you are planning to use the auto-matching features.
It could be one of these reasons: 1. No data were saved/stored for this element before. 2. The selector passed is not the one used while storing element data. The solution is simple - Pass the old selector again as an identifier to the method called. - Retrieve the element with the retrieve method using the old selector as identifier then save it again with the save method and the new selector as identifier. - Start using the identifier argument more often if you are planning to use every new selector from now on. 3. The website had some extreme structural changes like a new full design. If this happens a lot with this website, the solution would be to make your code as selector-free as possible using Scrapling features.
Pretty much yeah, almost all features you get from BeautifulSoup can be found or achieved in Scrapling one way or another. In fact, if you see there's a feature in bs4 that is missing in Scrapling, please make a feature request from the issues tab to let me know.
Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.
Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!
Please read the contributing file before doing anything.
[!CAUTION] This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the
robots.txt
file, for example.
This work is licensed under BSD-3
This project includes code adapted from: - Parsel (BSD License) - Used for translator submodule
An open-source, prototype implementation of property graphs for JavaScript based on the esprima parser, and the EsTree SpiderMonkey Spec. JAW can be used for analyzing the client-side of web applications and JavaScript-based programs.
This project is licensed under GNU AFFERO GENERAL PUBLIC LICENSE V3.0
. See here for more information.
JAW has a Github pages website available at https://soheilkhodayari.github.io/JAW/.
Release Notes:
JAW-V2
branch.JAW-V1
branch.The architecture of the JAW is shown below.
JAW can be used in two distinct ways:
Arbitrary JavaScript Analysis: Utilize JAW for modeling and analyzing any JavaScript program by specifying the program's file system path
.
Web Application Analysis: Analyze a web application by providing a single seed URL.
Use the collected web resources to create a Hybrid Program Graph (HPG), which will be imported into a Neo4j database.
Optionally, supply the HPG construction module with a mapping of semantic types to custom JavaScript language tokens, facilitating the categorization of JavaScript functions based on their purpose (e.g., HTTP request functions).
Query the constructed Neo4j
graph database for various analyses. JAW offers utility traversals for data flow analysis, control flow analysis, reachability analysis, and pattern matching. These traversals can be used to develop custom security analyses.
JAW also includes built-in traversals for detecting client-side CSRF, DOM Clobbering and request hijacking vulnerabilities.
The outputs will be stored in the same folder as that of input.
The installation script relies on the following prerequisites: - Latest version of npm package manager
(node js) - Any stable version of python 3.x
- Python pip
package manager
Afterwards, install the necessary dependencies via:
$ ./install.sh
For detailed
installation instructions, please see here.
You can run an instance of the pipeline in a background screen via:
$ python3 -m run_pipeline --conf=config.yaml
The CLI provides the following options:
$ python3 -m run_pipeline -h
usage: run_pipeline.py [-h] [--conf FILE] [--site SITE] [--list LIST] [--from FROM] [--to TO]
This script runs the tool pipeline.
optional arguments:
-h, --help show this help message and exit
--conf FILE, -C FILE pipeline configuration file. (default: config.yaml)
--site SITE, -S SITE website to test; overrides config file (default: None)
--list LIST, -L LIST site list to test; overrides config file (default: None)
--from FROM, -F FROM the first entry to consider when a site list is provided; overrides config file (default: -1)
--to TO, -T TO the last entry to consider when a site list is provided; overrides config file (default: -1)
Input Config: JAW expects a .yaml
config file as input. See config.yaml for an example.
Hint. The config file specifies different passes (e.g., crawling, static analysis, etc) which can be enabled or disabled for each vulnerability class. This allows running the tool building blocks individually, or in a different order (e.g., crawl all webapps first, then conduct security analysis).
For running a quick example demonstrating how to build a property graph and run Cypher queries over it, do:
$ python3 -m analyses.example.example_analysis --input=$(pwd)/data/test_program/test.js
This module collects the data (i.e., JavaScript code and state values of web pages) needed for testing. If you want to test a specific JavaScipt file that you already have on your file system, you can skip this step.
JAW has crawlers based on Selenium (JAW-v1), Puppeteer (JAW-v2, v3) and Playwright (JAW-v3). For most up-to-date features, it is recommended to use the Puppeteer- or Playwright-based versions.
This web crawler employs foxhound, an instrumented version of Firefox, to perform dynamic taint tracking as it navigates through webpages. To start the crawler, do:
$ cd crawler
$ node crawler-taint.js --seedurl=https://google.com --maxurls=100 --headless=true --foxhoundpath=<optional-foxhound-executable-path>
The foxhoundpath
is by default set to the following directory: crawler/foxhound/firefox
which contains a binary named firefox
.
Note: you need a build of foxhound to use this version. An ubuntu build is included in the JAW-v3 release.
To start the crawler, do:
$ cd crawler
$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true
See here for more information.
To start the crawler, do:
$ cd crawler/hpg_crawler
$ vim docker-compose.yaml # set the websites you want to crawl here and save
$ docker-compose build
$ docker-compose up -d
Please refer to the documentation of the hpg_crawler
here for more information.
To generate an HPG for a given (set of) JavaScript file(s), do:
$ node engine/cli.js --lang=js --graphid=graph1 --input=/in/file1.js --input=/in/file2.js --output=$(pwd)/data/out/ --mode=csv
optional arguments:
--lang: language of the input program
--graphid: an identifier for the generated HPG
--input: path of the input program(s)
--output: path of the output HPG, must be i
--mode: determines the output format (csv or graphML)
To import an HPG inside a neo4j graph database (docker instance), do:
$ python3 -m hpg_neo4j.hpg_import --rpath=<path-to-the-folder-of-the-csv-files> --id=<xyz> --nodes=<nodes.csv> --edges=<rels.csv>
$ python3 -m hpg_neo4j.hpg_import -h
usage: hpg_import.py [-h] [--rpath P] [--id I] [--nodes N] [--edges E]
This script imports a CSV of a property graph into a neo4j docker database.
optional arguments:
-h, --help show this help message and exit
--rpath P relative path to the folder containing the graph CSV files inside the `data` directory
--id I an identifier for the graph or docker container
--nodes N the name of the nodes csv file (default: nodes.csv)
--edges E the name of the relations csv file (default: rels.csv)
In order to create a hybrid property graph for the output of the hpg_crawler
and import it inside a local neo4j instance, you can also do:
$ python3 -m engine.api <path> --js=<program.js> --import=<bool> --hybrid=<bool> --reqs=<requests.out> --evts=<events.out> --cookies=<cookies.pkl> --html=<html_snapshot.html>
Specification of Parameters:
<path>
: absolute path to the folder containing the program files for analysis (must be under the engine/outputs
folder).--js=<program.js>
: name of the JavaScript program for analysis (default: js_program.js
).--import=<bool>
: whether the constructed property graph should be imported to an active neo4j database (default: true).--hybrid=bool
: whether the hybrid mode is enabled (default: false
). This implies that the tester wants to enrich the property graph by inputing files for any of the HTML snapshot, fired events, HTTP requests and cookies, as collected by the JAW crawler.--reqs=<requests.out>
: for hybrid mode only, name of the file containing the sequence of obsevered network requests, pass the string false
to exclude (default: request_logs_short.out
).--evts=<events.out>
: for hybrid mode only, name of the file containing the sequence of fired events, pass the string false
to exclude (default: events.out
).--cookies=<cookies.pkl>
: for hybrid mode only, name of the file containing the cookies, pass the string false
to exclude (default: cookies.pkl
).--html=<html_snapshot.html>
: for hybrid mode only, name of the file containing the DOM tree snapshot, pass the string false
to exclude (default: html_rendered.html
).For more information, you can use the help CLI provided with the graph construction API:
$ python3 -m engine.api -h
The constructed HPG can then be queried using Cypher or the NeoModel ORM.
You should place and run your queries in analyses/<ANALYSIS_NAME>
.
You can use the NeoModel ORM to query the HPG. To write a query:
example_query_orm.py
in the analyses/example
folder.$ python3 -m analyses.example.example_query_orm
For more information, please see here.
You can use Cypher to write custom queries. For this:
example_query_cypher.py
in the analyses/example
folder.$ python3 -m analyses.example.example_query_cypher
For more information, please see here.
This section describes how to configure and use JAW for vulnerability detection, and how to interpret the output. JAW contains, among others, self-contained queries for detecting client-side CSRF and DOM Clobbering
Step 1. enable the analysis component for the vulnerability class in the input config.yaml file:
request_hijacking:
enabled: true
# [...]
#
domclobbering:
enabled: false
# [...]
cs_csrf:
enabled: false
# [...]
Step 2. Run an instance of the pipeline with:
$ python3 -m run_pipeline --conf=config.yaml
Hint. You can run multiple instances of the pipeline under different screen
s:
$ screen -dmS s1 bash -c 'python3 -m run_pipeline --conf=conf1.yaml; exec sh'
$ screen -dmS s2 bash -c 'python3 -m run_pipeline --conf=conf2.yaml; exec sh'
$ # [...]
To generate parallel configuration files automatically, you may use the generate_config.py
script.
The outputs will be stored in a file called sink.flows.out
in the same folder as that of the input. For Client-side CSRF, for example, for each HTTP request detected, JAW outputs an entry marking the set of semantic types (a.k.a, semantic tags or labels) associated with the elements constructing the request (i.e., the program slices). For example, an HTTP request marked with the semantic type ['WIN.LOC']
is forgeable through the window.location
injection point. However, a request marked with ['NON-REACH']
is not forgeable.
An example output entry is shown below:
[*] Tags: ['WIN.LOC']
[*] NodeId: {'TopExpression': '86', 'CallExpression': '87', 'Argument': '94'}
[*] Location: 29
[*] Function: ajax
[*] Template: ajaxloc + "/bearer1234/"
[*] Top Expression: $.ajax({ xhrFields: { withCredentials: "true" }, url: ajaxloc + "/bearer1234/" })
1:['WIN.LOC'] variable=ajaxloc
0 (loc:6)- var ajaxloc = window.location.href
This entry shows that on line 29, there is a $.ajax
call expression, and this call expression triggers an ajax
request with the url template value of ajaxloc + "/bearer1234/
, where the parameter ajaxloc
is a program slice reading its value at line 6 from window.location.href
, thus forgeable through ['WIN.LOC']
.
In order to streamline the testing process for JAW and ensure that your setup is accurate, we provide a simple node.js
web application which you can test JAW with.
First, install the dependencies via:
$ cd tests/test-webapp
$ npm install
Then, run the application in a new screen:
$ screen -dmS jawwebapp bash -c 'PORT=6789 npm run devstart; exec sh'
For more information, visit our wiki page here. Below is a table of contents for quick access.
Pull requests are always welcomed. This project is intended to be a safe, welcoming space, and contributors are expected to adhere to the contributor code of conduct.
If you use the JAW for academic research, we encourage you to cite the following paper:
@inproceedings{JAW,
title = {JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals},
author= {Soheil Khodayari and Giancarlo Pellegrino},
booktitle = {30th {USENIX} Security Symposium ({USENIX} Security 21)},
year = {2021},
address = {Vancouver, B.C.},
publisher = {{USENIX} Association},
}
JAW has come a long way and we want to give our contributors a well-deserved shoutout here!
@tmbrbr, @c01gide, @jndre, and Sepehr Mirzaei.
New bug bounty(vulnerabilities) collector
# python3 main.py
*2024-02-20 16:14:47.836189*
1. Arbitrary File Reading due to Lack of Input Filepath Validation
- Feb 6th 2024 / High (CVE-2024-0964)
- gradio-app/gradio
- https://huntr.com/bounties/25e25501-5918-429c-8541-88832dfd3741/
2. View Barcode Image leads to Remote Code Execution
- Jan 31st 2024 / Critical (CVE: Not yet)
- dolibarr/dolibarr
- https://huntr.com/bounties/f0ffd01e-8054-4e43-96f7-a0d2e652ac7e/
(delimiter-based file database)
# vim feeds.db
1|2024-02-20 16:17:40.393240|7fe14fd58ca2582d66539b2fe178eeaed3524342|CVE-2024-0964|https://huntr.com/bounties/25e25501-5918-429c-8541-88832dfd3741/
2|2024-02-20 16:17:40.393987|c6b84ac808e7f229a4c8f9fbd073b4c0727e07e1|CVE: Not yet|https://huntr.com/bounties/f0ffd01e-8054-4e43-96f7-a0d2e652ac7e/
3|2024-02-20 16:17:40.394582|7fead9658843919219a3b30b8249700d968d0cc9|CVE: Not yet|https://huntr.com/bounties/d6cb06dc-5d10-4197-8f89-847c3203d953/
4|2024-02-20 16:17:40.395094|81fecdd74318ce7da9bc29e81198e62f3225bd44|CVE: Not yet|https://huntr.com/bounties/d875d1a2-7205-4b2b-93cf-439fa4c4f961/
5|2024-02-20 16:17:40.395613|111045c8f1a7926174243db403614d4a58dc72ed|CVE: Not yet|https://huntr.com/bounties/10e423cd-7051-43fd-b736-4e18650d0172/
PhantomCrawler allows users to simulate website interactions through different proxy IP addresses. It leverages Python, requests, and BeautifulSoup to offer a simple and effective way to test website behaviour under varied proxy configurations.
Features:
Usage:
proxies.txt
in this format 50.168.163.176:80
How to Use:
git clone https://github.com/spyboy-productions/PhantomCrawler.git
pip3 install -r requirements.txt
python3 PhantomCrawler.py
Disclaimer: PhantomCrawler is intended for educational and testing purposes only. Users are cautioned against any misuse, including potential DDoS activities. Always ensure compliance with the terms of service of websites being tested and adhere to ethical standards.
Crawlector (the name Crawlector is a combination of Crawler & Detector) is a threat hunting framework designed for scanning websites for malicious objects.
Note-1: The framework was first presented at the No Hat conference in Bergamo, Italy on October 22nd, 2022 (Slides, YouTube Recording). Also, it was presented for the second time at the AVAR conference, in Singapore, on December 2nd, 2022.
Note-2: The accompanying tool EKFiddle2Yara (is a tool that takes EKFiddle rules and converts them into Yara rules) mentioned in the talk, was also released at both conferences.
This is for checking for malicious urls against every page being scanned. The framework could either query the list of malicious URLs from URLHaus server (configuration: url_list_web), or from a file on disk (configuration: url_list_file), and if the latter is specified, then, it takes precedence over the former.
It works by searching the content of every page against all URL entries in url_list_web or url_list_file, checking for all occurrences. Additionally, upon a match, and if the configuration option check_url_api is set to true, Crawlector will send a POST request to the API URL set in the url_api configuration option, which returns a JSON object with extra information about a matching URL. Such information includes urlh_status (ex., online, offline, unknown), urlh_threat (ex., malware_download), urlh_tags (ex., elf, Mozi), and urlh_reference (ex., https://urlhaus.abuse.ch/url/1116455/). This information will be included in the log file cl_mlog_<current_date><current_time><(pm|am)>.csv (check below), only if check_url_api is set to true. Otherwise, the log file will include the columns urlh_url (list o f matching malicious URLs) and urlh_hit (number of occurrences for every matching malicious URL), conditional on whether check_url is set to true.
URLHaus feature could be disabled in its entirety by setting the configuration option check_url to false.
It is important to note that this feature could slow scanning considering the huge number of malicious urls (~ 130 million entries at the time of this writing) that need to be checked, and the time it takes to get extra information from the URLHaus server (if the option check_url_api is set to true).
It is very important that you familiarize yourself with the configuration file cl_config.ini before running any session. All of the sections and parameters are documented in the configuration file itself.
The Yara offline scanning feature is a standalone option, meaning, if enabled, Crawlector will execute this feature only irrespective of other enabled features. And, the same is true for the crawling for domains/sites digital certificate feature. Either way, it is recommended that you disable all non-used features in the configuration file.
log_to_file
or log_to_cons
), if a Yara rule references only a module's attributes (ex., PE, ELF, Hash, etc...), then Crawlector will display only the rule's name upon a match, excluding offset and length data.To visit/scan a website, the list of URLs must be stored in text files, in the directory βcl_sitesβ.
Crawlector accepts three types of URLs:
[a-zA-Z0-9_-]{1,128} = <url>
<id>[
depth:<0|1>-><\d+>,
total:<\d+>,
sleep:<\d+>] = <url>
For example,
mfmokbel[depth:1->3,total:10,sleep:0] = https://www.mfmokbel.com
which is equivalent to: mfmokbel[d:1->3,t:10,s:0] = https://www.mfmokbel.com
where, <id> := [a-zA-Z0-9_-]{1,128}
depth, total and sleep, can also be replaced with their shortened versions d, t and s, respectively.
40 (10 + (10*3))
URLs.Note 1: Type 3 URL could be turned into type 1 URL by setting the configuration parameter live_crawler to false, in the configuration file, in the spider section.
Note 2: Empty lines and lines that start with β;β or β//β are ignored.
The spider functionality is what gives Crawlector the capability to find additional links on the targeted page. The Spider supports the following featuers:
Type 3
, for the Spider functionality to workexclude_url
config. option. For example, *.zip|*.exe|*.rar|*.zip|*.7z|*.pdf|.*bat|*.db
include_url
config. option. For example, */checkout/*|*/products/*
exclude_https
add_ext_links
. This feature honours the exclude_url
and include_url
config. option.ext_links_only
. This feature honours the exclude_url
and include_url
config. option.site_ranking
in the configuration file provides some options to alter how the CSV file is to be readsite
section provides the capability to expand on a given site, by attempting to find all available top-level domains (TLDs) and/or subdomains for the same domain. If found, new tlds/subdomains will be checked like any other domainrapid_api_key
in the configuration filefind_tlds
enabled, in addition to Omnisint Labs API tlds results, the framework attempts to find other active/registered domains by going through every tld entry, either, in the tlds_file
or tlds_url
tlds_url
is set, it should point to a url that hosts tlds, each one on a new line (lines that start with either of the characters ';', '#' or '//' are ignored)tlds_file
, holds the filename that contains the list of tlds (same as for tlds_url
; only the tld is present, excluding the '.', for ex., "com", "org")tlds_file
is set, it takes precedence over tlds_url
tld_dl_time_out
, this is for setting the maximum timeout for the dnslookup function when attempting to check if the domain in question resolves or nottld_use_connect
, this option enables the functionality to connect to the domain in question over a list of ports, defined in the option tlds_connect_ports
tlds_connect_ports
accepts a list of ports, comma separated, or a list of ranges, such as 25-40,90-100,80,443,8443 (range start and end are inclusive) tld_con_time_out
, this is for setting the maximum timeout for the connect functiontld_con_use_ssl
, enable/disable the use of ssl when attempting to connect to the domainsave_to_file_subd
is set to true, discovered subdomains will be saved to "\expanded\exp_subdomain_<pm|am>.txt"save_to_file_tld
is set to true, discovered domains will be saved to "\expanded\exp_tld_<pm|am>.txt"exit_here
is set to true, then Crawlector bails out after executing this [site] function, irrespective of other enabled options. It means found sites won't be crawled/spideredcl_sites
are allowed.Open for pull requests and issues. Comments and suggestions are greatly appreciated.
Mohamad Mokbel (@MFMokbel)
Simple Latest CVE Collector Written in Python
This collector uses a search query on https://www.cvedetails.com to collect information on vulnerabilities with a severity score of 6 or higher.
cvss_min_score
variable.webhook
.crontab or a similar scheduler
.# python3 main.py
*2023-10-10 11:05:33.370262*
1. CVE-2023-44832 / CVSS: 7.5 (HIGH)
- Published: 2023-10-05 16:15:12
- Updated: 2023-10-07 03:15:47
- CWE: CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')
D-Link DIR-823G A1V1.0.2B05 was discovered to contain a buffer overflow via the MacAddress parameter in the SetWanSettings function. Th...
>> https://www.cve.org/CVERecord?id=CVE-2023-44832
- Ref.
(1) https://www.dlink.com/en/security-bulletin/
(2) https://github.com/bugfinder0/public_bug/tree/main/dlink/dir823g/SetWanSettings_MacAddress
2. CVE-2023-44831 / CVSS: 7.5 (HIGH)
- Published: 2023-10-05 16:15:12
- Updated: 2023-10-07 03:16:56
- CWE: CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')
D-Lin k DIR-823G A1V1.0.2B05 was discovered to contain a buffer overflow via the Type parameter in the SetWLanRadioSettings function. Th...
>> https://www.cve.org/CVERecord?id=CVE-2023-44831
- Ref.
(1) https://www.dlink.com/en/security-bulletin/
(2) https://github.com/bugfinder0/public_bug/tree/main/dlink/dir823g/SetWLanRadioSettings_Type
(delimiter-based file database)
# vim feeds.db
1|2023-10-10 09:24:21.496744|0d239fa87be656389c035db1c3f5ec6ca3ec7448|CVE-2023-45613|2023-10-09 11:15:11|6.8|MEDIUM|CWE-295 Improper Certificate Validation
2|2023-10-10 09:24:27.073851|30ebff007cca946a16e5140adef5a9d5db11eee8|CVE-2023-45612|2023-10-09 11:15:11|8.6|HIGH|CWE-611 Improper Restriction of XML External Entity Reference
3|2023-10-10 09:24:32.650234|815b51259333ed88193fb3beb62c9176e07e4bd8|CVE-2023-45303|2023-10-06 19:15:13|8.4|HIGH|Not found CWE ids for CVE-2023-45303
4|2023-10-10 09:24:38.369632|39f98184087b8998547bba41c0ccf2f3ad61f527|CVE-2023-45248|2023-10-09 12:15:10|6.6|MEDIUM|CWE-427 Uncontrolled Search Path Element
5|2023-10-10 09:24:43.936863|60083d8626b0b1a59ef6fa16caec2b4fd1f7a6d7|CVE-2023-45247|2023-10-09 12:15:10|7.1|HIGH|CWE-862 Missing Authorization
6|2023-10-10 09:24:49.472179|82611add9de44e5807b8f8324bdfb065f6d4177a|CVE-2023-45246|2023-10-06 11:15:11|7.1|HIGH|CWE-287 Improper Authentication
7|20 23-10-10 09:24:55.049191|b78014cd7ca54988265b19d51d90ef935d2362cf|CVE-2023-45244|2023-10-06 10:15:18|7.1|HIGH|CWE-862 Missing Authorization
The methods for collecting CVE (Common Vulnerabilities and Exposures) information are divided into different stages. They are primarily categorized into two
(1) Method for retrieving CVE information after vulnerability analysis and risk assessment have been completed.
(2) Method for retrieving CVE information at the stage when it is included as a vulnerability.
οΈο΅οΈ Pinkerton is a Python tool created to crawl JavaScript files and search for secrets
A quick guide of how to install and use Pinkerton.
1. Clone the repository with: git clone https://github.com/oppsec/pinkerton.git
2. Install the libraries with: pip3 install -r requirements.txt
3. Run Pinkerton with: python3 main.py -u https://example.com
If you want to use pinkerton in a Docker container, follow this commands:
1. Clone the repository - git clone https://github.com/oppsec/pinkerton.git
2. Build the image - sudo docker build -t pinkerton:latest .
3. Run container - sudo docker run pinkerton:latest
pip3 install -r requirements.txt
A quick guide of how to contribute with the project.
1. Create a fork from Pinkerton repository
2. Clone the repository with git clone https://github.com/your/pinkerton.git
3. Type cd pinkerton/
4. Create a branch and make your changes
5. Commit and make a git push
6. Open a pull request
xcrawl3r
is a command-line interface (CLI) utility to recursively crawl webpages i.e systematically browse webpages' URLs and follow links to discover linked webpages' URLs.
.js
, .json
, .xml
, .csv
, .txt
& .map
).robots.txt
.Visit the releases page and find the appropriate archive for your operating system and architecture. Download the archive from your browser or copy its URL and retrieve it with wget
or curl
:
...with wget
:
wget https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz
...or, with curl
:
curl -OL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz
...then, extract the binary:
tar xf xcrawl3r-<version>-linux-amd64.tar.gz
TIP: The above steps, download and extract, can be combined into a single step with this onliner
curl -sL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz | tar -xzv
NOTE: On Windows systems, you should be able to double-click the zip archive to extract the xcrawl3r
executable.
...move the xcrawl3r
binary to somewhere in your PATH
. For example, on GNU/Linux and OS X systems:
sudo mv xcrawl3r /usr/local/bin/
NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r
to their PATH
.
Before you install from source, you need to make sure that Go is installed on your system. You can install Go by following the official instructions for your operating system. For this, we will assume that Go is already installed.
go install ...
go install -v github.com/hueristiq/xcrawl3r/cmd/xcrawl3r@latest
go build ...
the development VersionClone the repository
git clone https://github.com/hueristiq/xcrawl3r.git
Build the utility
cd xcrawl3r/cmd/xcrawl3r && \
go build .
Move the xcrawl3r
binary to somewhere in your PATH
. For example, on GNU/Linux and OS X systems:
sudo mv xcrawl3r /usr/local/bin/
NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r
to their PATH
.
NOTE: While the development version is a good way to take a peek at xcrawl3r
's latest features before they get released, be aware that it may have bugs. Officially released versions will generally be more stable.
To display help message for xcrawl3r
use the -h
flag:
xcrawl3r -h
help message:
_ _____
__ _____ _ __ __ ___ _| |___ / _ __
\ \/ / __| '__/ _` \ \ /\ / / | |_ \| '__|
> < (__| | | (_| |\ V V /| |___) | |
/_/\_\___|_| \__,_| \_/\_/ |_|____/|_| v0.1.0
A CLI utility to recursively crawl webpages.
USAGE:
xcrawl3r [OPTIONS]
INPUT:
-d, --domain string domain to match URLs
--include-subdomains bool match subdomains' URLs
-s, --seeds string seed URLs file (use `-` to get from stdin)
-u, --url string URL to crawl
CONFIGURATION:
--depth int maximum depth to crawl (default 3)
TIP: set it to `0` for infinite recursion
--headless bool If true the browser will be displayed while crawling.
-H, --headers string[] custom header to include in requests
e.g. -H 'Referer: http://example.com/'
TIP: use multiple flag to set multiple headers
--proxy string[] Proxy URL (e.g: http://127.0.0.1:8080)
TIP: use multiple flag to set multiple proxies
--render bool utilize a headless chrome instance to render pages
--timeout int time to wait for request in seconds (default: 10)
--user-agent string User Agent to use (default: web)
TIP: use `web` for a random web user-agent,
`mobile` for a random mobile user-agent,
or you can set your specific user-agent.
RATE LIMIT:
-c, --concurrency int number of concurrent fetchers to use (default 10)
--delay int delay between each request in seconds
--max-random-delay int maximux extra randomized delay added to `--dalay` (default: 1s)
-p, --parallelism int number of concurrent URLs to process (default: 10)
OUTPUT:
--debug bool enable debug mode (default: false)
-m, --monochrome bool coloring: no colored output mode
-o, --output string output file to write found URLs
-v, --verbosity string debug, info, warning, error, fatal or silent (default: debug)
Issues and Pull Requests are welcome! Check out the contribution guidelines.
This utility is distributed under the MIT license.
Alternatives - Check out projects below, that may fit in your workflow: