An open-source, prototype implementation of property graphs for JavaScript based on the esprima parser, and the EsTree SpiderMonkey Spec. JAW can be used for analyzing the client-side of web applications and JavaScript-based programs.
This project is licensed under GNU AFFERO GENERAL PUBLIC LICENSE V3.0
. See here for more information.
JAW has a Github pages website available at https://soheilkhodayari.github.io/JAW/.
Release Notes:
JAW-V2
branch.JAW-V1
branch.The architecture of the JAW is shown below.
JAW can be used in two distinct ways:
Arbitrary JavaScript Analysis: Utilize JAW for modeling and analyzing any JavaScript program by specifying the program's file system path
.
Web Application Analysis: Analyze a web application by providing a single seed URL.
Use the collected web resources to create a Hybrid Program Graph (HPG), which will be imported into a Neo4j database.
Optionally, supply the HPG construction module with a mapping of semantic types to custom JavaScript language tokens, facilitating the categorization of JavaScript functions based on their purpose (e.g., HTTP request functions).
Query the constructed Neo4j
graph database for various analyses. JAW offers utility traversals for data flow analysis, control flow analysis, reachability analysis, and pattern matching. These traversals can be used to develop custom security analyses.
JAW also includes built-in traversals for detecting client-side CSRF, DOM Clobbering and request hijacking vulnerabilities.
The outputs will be stored in the same folder as that of input.
The installation script relies on the following prerequisites: - Latest version of npm package manager
(node js) - Any stable version of python 3.x
- Python pip
package manager
Afterwards, install the necessary dependencies via:
$ ./install.sh
For detailed
installation instructions, please see here.
You can run an instance of the pipeline in a background screen via:
$ python3 -m run_pipeline --conf=config.yaml
The CLI provides the following options:
$ python3 -m run_pipeline -h
usage: run_pipeline.py [-h] [--conf FILE] [--site SITE] [--list LIST] [--from FROM] [--to TO]
This script runs the tool pipeline.
optional arguments:
-h, --help show this help message and exit
--conf FILE, -C FILE pipeline configuration file. (default: config.yaml)
--site SITE, -S SITE website to test; overrides config file (default: None)
--list LIST, -L LIST site list to test; overrides config file (default: None)
--from FROM, -F FROM the first entry to consider when a site list is provided; overrides config file (default: -1)
--to TO, -T TO the last entry to consider when a site list is provided; overrides config file (default: -1)
Input Config: JAW expects a .yaml
config file as input. See config.yaml for an example.
Hint. The config file specifies different passes (e.g., crawling, static analysis, etc) which can be enabled or disabled for each vulnerability class. This allows running the tool building blocks individually, or in a different order (e.g., crawl all webapps first, then conduct security analysis).
For running a quick example demonstrating how to build a property graph and run Cypher queries over it, do:
$ python3 -m analyses.example.example_analysis --input=$(pwd)/data/test_program/test.js
This module collects the data (i.e., JavaScript code and state values of web pages) needed for testing. If you want to test a specific JavaScipt file that you already have on your file system, you can skip this step.
JAW has crawlers based on Selenium (JAW-v1), Puppeteer (JAW-v2, v3) and Playwright (JAW-v3). For most up-to-date features, it is recommended to use the Puppeteer- or Playwright-based versions.
This web crawler employs foxhound, an instrumented version of Firefox, to perform dynamic taint tracking as it navigates through webpages. To start the crawler, do:
$ cd crawler
$ node crawler-taint.js --seedurl=https://google.com --maxurls=100 --headless=true --foxhoundpath=<optional-foxhound-executable-path>
The foxhoundpath
is by default set to the following directory: crawler/foxhound/firefox
which contains a binary named firefox
.
Note: you need a build of foxhound to use this version. An ubuntu build is included in the JAW-v3 release.
To start the crawler, do:
$ cd crawler
$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true
See here for more information.
To start the crawler, do:
$ cd crawler/hpg_crawler
$ vim docker-compose.yaml # set the websites you want to crawl here and save
$ docker-compose build
$ docker-compose up -d
Please refer to the documentation of the hpg_crawler
here for more information.
To generate an HPG for a given (set of) JavaScript file(s), do:
$ node engine/cli.js --lang=js --graphid=graph1 --input=/in/file1.js --input=/in/file2.js --output=$(pwd)/data/out/ --mode=csv
optional arguments:
--lang: language of the input program
--graphid: an identifier for the generated HPG
--input: path of the input program(s)
--output: path of the output HPG, must be i
--mode: determines the output format (csv or graphML)
To import an HPG inside a neo4j graph database (docker instance), do:
$ python3 -m hpg_neo4j.hpg_import --rpath=<path-to-the-folder-of-the-csv-files> --id=<xyz> --nodes=<nodes.csv> --edges=<rels.csv>
$ python3 -m hpg_neo4j.hpg_import -h
usage: hpg_import.py [-h] [--rpath P] [--id I] [--nodes N] [--edges E]
This script imports a CSV of a property graph into a neo4j docker database.
optional arguments:
-h, --help show this help message and exit
--rpath P relative path to the folder containing the graph CSV files inside the `data` directory
--id I an identifier for the graph or docker container
--nodes N the name of the nodes csv file (default: nodes.csv)
--edges E the name of the relations csv file (default: rels.csv)
In order to create a hybrid property graph for the output of the hpg_crawler
and import it inside a local neo4j instance, you can also do:
$ python3 -m engine.api <path> --js=<program.js> --import=<bool> --hybrid=<bool> --reqs=<requests.out> --evts=<events.out> --cookies=<cookies.pkl> --html=<html_snapshot.html>
Specification of Parameters:
<path>
: absolute path to the folder containing the program files for analysis (must be under the engine/outputs
folder).--js=<program.js>
: name of the JavaScript program for analysis (default: js_program.js
).--import=<bool>
: whether the constructed property graph should be imported to an active neo4j database (default: true).--hybrid=bool
: whether the hybrid mode is enabled (default: false
). This implies that the tester wants to enrich the property graph by inputing files for any of the HTML snapshot, fired events, HTTP requests and cookies, as collected by the JAW crawler.--reqs=<requests.out>
: for hybrid mode only, name of the file containing the sequence of obsevered network requests, pass the string false
to exclude (default: request_logs_short.out
).--evts=<events.out>
: for hybrid mode only, name of the file containing the sequence of fired events, pass the string false
to exclude (default: events.out
).--cookies=<cookies.pkl>
: for hybrid mode only, name of the file containing the cookies, pass the string false
to exclude (default: cookies.pkl
).--html=<html_snapshot.html>
: for hybrid mode only, name of the file containing the DOM tree snapshot, pass the string false
to exclude (default: html_rendered.html
).For more information, you can use the help CLI provided with the graph construction API:
$ python3 -m engine.api -h
The constructed HPG can then be queried using Cypher or the NeoModel ORM.
You should place and run your queries in analyses/<ANALYSIS_NAME>
.
You can use the NeoModel ORM to query the HPG. To write a query:
example_query_orm.py
in the analyses/example
folder.$ python3 -m analyses.example.example_query_orm
For more information, please see here.
You can use Cypher to write custom queries. For this:
example_query_cypher.py
in the analyses/example
folder.$ python3 -m analyses.example.example_query_cypher
For more information, please see here.
This section describes how to configure and use JAW for vulnerability detection, and how to interpret the output. JAW contains, among others, self-contained queries for detecting client-side CSRF and DOM Clobbering
Step 1. enable the analysis component for the vulnerability class in the input config.yaml file:
request_hijacking:
enabled: true
# [...]
#
domclobbering:
enabled: false
# [...]
cs_csrf:
enabled: false
# [...]
Step 2. Run an instance of the pipeline with:
$ python3 -m run_pipeline --conf=config.yaml
Hint. You can run multiple instances of the pipeline under different screen
s:
$ screen -dmS s1 bash -c 'python3 -m run_pipeline --conf=conf1.yaml; exec sh'
$ screen -dmS s2 bash -c 'python3 -m run_pipeline --conf=conf2.yaml; exec sh'
$ # [...]
To generate parallel configuration files automatically, you may use the generate_config.py
script.
The outputs will be stored in a file called sink.flows.out
in the same folder as that of the input. For Client-side CSRF, for example, for each HTTP request detected, JAW outputs an entry marking the set of semantic types (a.k.a, semantic tags or labels) associated with the elements constructing the request (i.e., the program slices). For example, an HTTP request marked with the semantic type ['WIN.LOC']
is forgeable through the window.location
injection point. However, a request marked with ['NON-REACH']
is not forgeable.
An example output entry is shown below:
[*] Tags: ['WIN.LOC']
[*] NodeId: {'TopExpression': '86', 'CallExpression': '87', 'Argument': '94'}
[*] Location: 29
[*] Function: ajax
[*] Template: ajaxloc + "/bearer1234/"
[*] Top Expression: $.ajax({ xhrFields: { withCredentials: "true" }, url: ajaxloc + "/bearer1234/" })
1:['WIN.LOC'] variable=ajaxloc
0 (loc:6)- var ajaxloc = window.location.href
This entry shows that on line 29, there is a $.ajax
call expression, and this call expression triggers an ajax
request with the url template value of ajaxloc + "/bearer1234/
, where the parameter ajaxloc
is a program slice reading its value at line 6 from window.location.href
, thus forgeable through ['WIN.LOC']
.
In order to streamline the testing process for JAW and ensure that your setup is accurate, we provide a simple node.js
web application which you can test JAW with.
First, install the dependencies via:
$ cd tests/test-webapp
$ npm install
Then, run the application in a new screen:
$ screen -dmS jawwebapp bash -c 'PORT=6789 npm run devstart; exec sh'
For more information, visit our wiki page here. Below is a table of contents for quick access.
Pull requests are always welcomed. This project is intended to be a safe, welcoming space, and contributors are expected to adhere to the contributor code of conduct.
If you use the JAW for academic research, we encourage you to cite the following paper:
@inproceedings{JAW,
title = {JAW: Studying Client-side CSRF with Hybrid Property Graphs and Declarative Traversals},
author= {Soheil Khodayari and Giancarlo Pellegrino},
booktitle = {30th {USENIX} Security Symposium ({USENIX} Security 21)},
year = {2021},
address = {Vancouver, B.C.},
publisher = {{USENIX} Association},
}
JAW has come a long way and we want to give our contributors a well-deserved shoutout here!
@tmbrbr, @c01gide, @jndre, and Sepehr Mirzaei.
New bug bounty(vulnerabilities) collector
# python3 main.py
*2024-02-20 16:14:47.836189*
1. Arbitrary File Reading due to Lack of Input Filepath Validation
- Feb 6th 2024 / High (CVE-2024-0964)
- gradio-app/gradio
- https://huntr.com/bounties/25e25501-5918-429c-8541-88832dfd3741/
2. View Barcode Image leads to Remote Code Execution
- Jan 31st 2024 / Critical (CVE: Not yet)
- dolibarr/dolibarr
- https://huntr.com/bounties/f0ffd01e-8054-4e43-96f7-a0d2e652ac7e/
(delimiter-based file database)
# vim feeds.db
1|2024-02-20 16:17:40.393240|7fe14fd58ca2582d66539b2fe178eeaed3524342|CVE-2024-0964|https://huntr.com/bounties/25e25501-5918-429c-8541-88832dfd3741/
2|2024-02-20 16:17:40.393987|c6b84ac808e7f229a4c8f9fbd073b4c0727e07e1|CVE: Not yet|https://huntr.com/bounties/f0ffd01e-8054-4e43-96f7-a0d2e652ac7e/
3|2024-02-20 16:17:40.394582|7fead9658843919219a3b30b8249700d968d0cc9|CVE: Not yet|https://huntr.com/bounties/d6cb06dc-5d10-4197-8f89-847c3203d953/
4|2024-02-20 16:17:40.395094|81fecdd74318ce7da9bc29e81198e62f3225bd44|CVE: Not yet|https://huntr.com/bounties/d875d1a2-7205-4b2b-93cf-439fa4c4f961/
5|2024-02-20 16:17:40.395613|111045c8f1a7926174243db403614d4a58dc72ed|CVE: Not yet|https://huntr.com/bounties/10e423cd-7051-43fd-b736-4e18650d0172/
PhantomCrawler allows users to simulate website interactions through different proxy IP addresses. It leverages Python, requests, and BeautifulSoup to offer a simple and effective way to test website behaviour under varied proxy configurations.
Features:
Usage:
proxies.txt
in this format 50.168.163.176:80
How to Use:
git clone https://github.com/spyboy-productions/PhantomCrawler.git
pip3 install -r requirements.txt
python3 PhantomCrawler.py
Disclaimer: PhantomCrawler is intended for educational and testing purposes only. Users are cautioned against any misuse, including potential DDoS activities. Always ensure compliance with the terms of service of websites being tested and adhere to ethical standards.
Crawlector (the name Crawlector is a combination of Crawler & Detector) is a threat hunting framework designed for scanning websites for malicious objects.
Note-1: The framework was first presented at the No Hat conference in Bergamo, Italy on October 22nd, 2022 (Slides, YouTube Recording). Also, it was presented for the second time at the AVAR conference, in Singapore, on December 2nd, 2022.
Note-2: The accompanying tool EKFiddle2Yara (is a tool that takes EKFiddle rules and converts them into Yara rules) mentioned in the talk, was also released at both conferences.
This is for checking for malicious urls against every page being scanned. The framework could either query the list of malicious URLs from URLHaus server (configuration: url_list_web), or from a file on disk (configuration: url_list_file), and if the latter is specified, then, it takes precedence over the former.
It works by searching the content of every page against all URL entries in url_list_web or url_list_file, checking for all occurrences. Additionally, upon a match, and if the configuration option check_url_api is set to true, Crawlector will send a POST request to the API URL set in the url_api configuration option, which returns a JSON object with extra information about a matching URL. Such information includes urlh_status (ex., online, offline, unknown), urlh_threat (ex., malware_download), urlh_tags (ex., elf, Mozi), and urlh_reference (ex., https://urlhaus.abuse.ch/url/1116455/). This information will be included in the log file cl_mlog_<current_date><current_time><(pm|am)>.csv (check below), only if check_url_api is set to true. Otherwise, the log file will include the columns urlh_url (list o f matching malicious URLs) and urlh_hit (number of occurrences for every matching malicious URL), conditional on whether check_url is set to true.
URLHaus feature could be disabled in its entirety by setting the configuration option check_url to false.
It is important to note that this feature could slow scanning considering the huge number of malicious urls (~ 130 million entries at the time of this writing) that need to be checked, and the time it takes to get extra information from the URLHaus server (if the option check_url_api is set to true).
It is very important that you familiarize yourself with the configuration file cl_config.ini before running any session. All of the sections and parameters are documented in the configuration file itself.
The Yara offline scanning feature is a standalone option, meaning, if enabled, Crawlector will execute this feature only irrespective of other enabled features. And, the same is true for the crawling for domains/sites digital certificate feature. Either way, it is recommended that you disable all non-used features in the configuration file.
log_to_file
or log_to_cons
), if a Yara rule references only a module's attributes (ex., PE, ELF, Hash, etc...), then Crawlector will display only the rule's name upon a match, excluding offset and length data.To visit/scan a website, the list of URLs must be stored in text files, in the directory โcl_sitesโ.
Crawlector accepts three types of URLs:
[a-zA-Z0-9_-]{1,128} = <url>
<id>[
depth:<0|1>-><\d+>,
total:<\d+>,
sleep:<\d+>] = <url>
For example,
mfmokbel[depth:1->3,total:10,sleep:0] = https://www.mfmokbel.com
which is equivalent to: mfmokbel[d:1->3,t:10,s:0] = https://www.mfmokbel.com
where, <id> := [a-zA-Z0-9_-]{1,128}
depth, total and sleep, can also be replaced with their shortened versions d, t and s, respectively.
40 (10 + (10*3))
URLs.Note 1: Type 3 URL could be turned into type 1 URL by setting the configuration parameter live_crawler to false, in the configuration file, in the spider section.
Note 2: Empty lines and lines that start with โ;โ or โ//โ are ignored.
The spider functionality is what gives Crawlector the capability to find additional links on the targeted page. The Spider supports the following featuers:
Type 3
, for the Spider functionality to workexclude_url
config. option. For example, *.zip|*.exe|*.rar|*.zip|*.7z|*.pdf|.*bat|*.db
include_url
config. option. For example, */checkout/*|*/products/*
exclude_https
add_ext_links
. This feature honours the exclude_url
and include_url
config. option.ext_links_only
. This feature honours the exclude_url
and include_url
config. option.site_ranking
in the configuration file provides some options to alter how the CSV file is to be readsite
section provides the capability to expand on a given site, by attempting to find all available top-level domains (TLDs) and/or subdomains for the same domain. If found, new tlds/subdomains will be checked like any other domainrapid_api_key
in the configuration filefind_tlds
enabled, in addition to Omnisint Labs API tlds results, the framework attempts to find other active/registered domains by going through every tld entry, either, in the tlds_file
or tlds_url
tlds_url
is set, it should point to a url that hosts tlds, each one on a new line (lines that start with either of the characters ';', '#' or '//' are ignored)tlds_file
, holds the filename that contains the list of tlds (same as for tlds_url
; only the tld is present, excluding the '.', for ex., "com", "org")tlds_file
is set, it takes precedence over tlds_url
tld_dl_time_out
, this is for setting the maximum timeout for the dnslookup function when attempting to check if the domain in question resolves or nottld_use_connect
, this option enables the functionality to connect to the domain in question over a list of ports, defined in the option tlds_connect_ports
tlds_connect_ports
accepts a list of ports, comma separated, or a list of ranges, such as 25-40,90-100,80,443,8443 (range start and end are inclusive) tld_con_time_out
, this is for setting the maximum timeout for the connect functiontld_con_use_ssl
, enable/disable the use of ssl when attempting to connect to the domainsave_to_file_subd
is set to true, discovered subdomains will be saved to "\expanded\exp_subdomain_<pm|am>.txt"save_to_file_tld
is set to true, discovered domains will be saved to "\expanded\exp_tld_<pm|am>.txt"exit_here
is set to true, then Crawlector bails out after executing this [site] function, irrespective of other enabled options. It means found sites won't be crawled/spideredcl_sites
are allowed.Open for pull requests and issues. Comments and suggestions are greatly appreciated.
Mohamad Mokbel (@MFMokbel)
Simple Latest CVE Collector Written in Python
This collector uses a search query on https://www.cvedetails.com to collect information on vulnerabilities with a severity score of 6 or higher.
cvss_min_score
variable.webhook
.crontab or a similar scheduler
.# python3 main.py
*2023-10-10 11:05:33.370262*
1. CVE-2023-44832 / CVSS: 7.5 (HIGH)
- Published: 2023-10-05 16:15:12
- Updated: 2023-10-07 03:15:47
- CWE: CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')
D-Link DIR-823G A1V1.0.2B05 was discovered to contain a buffer overflow via the MacAddress parameter in the SetWanSettings function. Th...
>> https://www.cve.org/CVERecord?id=CVE-2023-44832
- Ref.
(1) https://www.dlink.com/en/security-bulletin/
(2) https://github.com/bugfinder0/public_bug/tree/main/dlink/dir823g/SetWanSettings_MacAddress
2. CVE-2023-44831 / CVSS: 7.5 (HIGH)
- Published: 2023-10-05 16:15:12
- Updated: 2023-10-07 03:16:56
- CWE: CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')
D-Lin k DIR-823G A1V1.0.2B05 was discovered to contain a buffer overflow via the Type parameter in the SetWLanRadioSettings function. Th...
>> https://www.cve.org/CVERecord?id=CVE-2023-44831
- Ref.
(1) https://www.dlink.com/en/security-bulletin/
(2) https://github.com/bugfinder0/public_bug/tree/main/dlink/dir823g/SetWLanRadioSettings_Type
(delimiter-based file database)
# vim feeds.db
1|2023-10-10 09:24:21.496744|0d239fa87be656389c035db1c3f5ec6ca3ec7448|CVE-2023-45613|2023-10-09 11:15:11|6.8|MEDIUM|CWE-295 Improper Certificate Validation
2|2023-10-10 09:24:27.073851|30ebff007cca946a16e5140adef5a9d5db11eee8|CVE-2023-45612|2023-10-09 11:15:11|8.6|HIGH|CWE-611 Improper Restriction of XML External Entity Reference
3|2023-10-10 09:24:32.650234|815b51259333ed88193fb3beb62c9176e07e4bd8|CVE-2023-45303|2023-10-06 19:15:13|8.4|HIGH|Not found CWE ids for CVE-2023-45303
4|2023-10-10 09:24:38.369632|39f98184087b8998547bba41c0ccf2f3ad61f527|CVE-2023-45248|2023-10-09 12:15:10|6.6|MEDIUM|CWE-427 Uncontrolled Search Path Element
5|2023-10-10 09:24:43.936863|60083d8626b0b1a59ef6fa16caec2b4fd1f7a6d7|CVE-2023-45247|2023-10-09 12:15:10|7.1|HIGH|CWE-862 Missing Authorization
6|2023-10-10 09:24:49.472179|82611add9de44e5807b8f8324bdfb065f6d4177a|CVE-2023-45246|2023-10-06 11:15:11|7.1|HIGH|CWE-287 Improper Authentication
7|20 23-10-10 09:24:55.049191|b78014cd7ca54988265b19d51d90ef935d2362cf|CVE-2023-45244|2023-10-06 10:15:18|7.1|HIGH|CWE-862 Missing Authorization
The methods for collecting CVE (Common Vulnerabilities and Exposures) information are divided into different stages. They are primarily categorized into two
(1) Method for retrieving CVE information after vulnerability analysis and risk assessment have been completed.
(2) Method for retrieving CVE information at the stage when it is included as a vulnerability.
๏ธ๏ต๏ธ Pinkerton is a Python tool created to crawl JavaScript files and search for secrets
A quick guide of how to install and use Pinkerton.
1. Clone the repository with: git clone https://github.com/oppsec/pinkerton.git
2. Install the libraries with: pip3 install -r requirements.txt
3. Run Pinkerton with: python3 main.py -u https://example.com
If you want to use pinkerton in a Docker container, follow this commands:
1. Clone the repository - git clone https://github.com/oppsec/pinkerton.git
2. Build the image - sudo docker build -t pinkerton:latest .
3. Run container - sudo docker run pinkerton:latest
pip3 install -r requirements.txt
A quick guide of how to contribute with the project.
1. Create a fork from Pinkerton repository
2. Clone the repository with git clone https://github.com/your/pinkerton.git
3. Type cd pinkerton/
4. Create a branch and make your changes
5. Commit and make a git push
6. Open a pull request
xcrawl3r
is a command-line interface (CLI) utility to recursively crawl webpages i.e systematically browse webpages' URLs and follow links to discover linked webpages' URLs.
.js
, .json
, .xml
, .csv
, .txt
& .map
).robots.txt
.Visit the releases page and find the appropriate archive for your operating system and architecture. Download the archive from your browser or copy its URL and retrieve it with wget
or curl
:
...with wget
:
wget https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz
...or, with curl
:
curl -OL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz
...then, extract the binary:
tar xf xcrawl3r-<version>-linux-amd64.tar.gz
TIP: The above steps, download and extract, can be combined into a single step with this onliner
curl -sL https://github.com/hueristiq/xcrawl3r/releases/download/v<version>/xcrawl3r-<version>-linux-amd64.tar.gz | tar -xzv
NOTE: On Windows systems, you should be able to double-click the zip archive to extract the xcrawl3r
executable.
...move the xcrawl3r
binary to somewhere in your PATH
. For example, on GNU/Linux and OS X systems:
sudo mv xcrawl3r /usr/local/bin/
NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r
to their PATH
.
Before you install from source, you need to make sure that Go is installed on your system. You can install Go by following the official instructions for your operating system. For this, we will assume that Go is already installed.
go install ...
go install -v github.com/hueristiq/xcrawl3r/cmd/xcrawl3r@latest
go build ...
the development VersionClone the repository
git clone https://github.com/hueristiq/xcrawl3r.git
Build the utility
cd xcrawl3r/cmd/xcrawl3r && \
go build .
Move the xcrawl3r
binary to somewhere in your PATH
. For example, on GNU/Linux and OS X systems:
sudo mv xcrawl3r /usr/local/bin/
NOTE: Windows users can follow How to: Add Tool Locations to the PATH Environment Variable in order to add xcrawl3r
to their PATH
.
NOTE: While the development version is a good way to take a peek at xcrawl3r
's latest features before they get released, be aware that it may have bugs. Officially released versions will generally be more stable.
To display help message for xcrawl3r
use the -h
flag:
xcrawl3r -h
help message:
_ _____
__ _____ _ __ __ ___ _| |___ / _ __
\ \/ / __| '__/ _` \ \ /\ / / | |_ \| '__|
> < (__| | | (_| |\ V V /| |___) | |
/_/\_\___|_| \__,_| \_/\_/ |_|____/|_| v0.1.0
A CLI utility to recursively crawl webpages.
USAGE:
xcrawl3r [OPTIONS]
INPUT:
-d, --domain string domain to match URLs
--include-subdomains bool match subdomains' URLs
-s, --seeds string seed URLs file (use `-` to get from stdin)
-u, --url string URL to crawl
CONFIGURATION:
--depth int maximum depth to crawl (default 3)
TIP: set it to `0` for infinite recursion
--headless bool If true the browser will be displayed while crawling.
-H, --headers string[] custom header to include in requests
e.g. -H 'Referer: http://example.com/'
TIP: use multiple flag to set multiple headers
--proxy string[] Proxy URL (e.g: http://127.0.0.1:8080)
TIP: use multiple flag to set multiple proxies
--render bool utilize a headless chrome instance to render pages
--timeout int time to wait for request in seconds (default: 10)
--user-agent string User Agent to use (default: web)
TIP: use `web` for a random web user-agent,
`mobile` for a random mobile user-agent,
or you can set your specific user-agent.
RATE LIMIT:
-c, --concurrency int number of concurrent fetchers to use (default 10)
--delay int delay between each request in seconds
--max-random-delay int maximux extra randomized delay added to `--dalay` (default: 1s)
-p, --parallelism int number of concurrent URLs to process (default: 10)
OUTPUT:
--debug bool enable debug mode (default: false)
-m, --monochrome bool coloring: no colored output mode
-o, --output string output file to write found URLs
-v, --verbosity string debug, info, warning, error, fatal or silent (default: debug)
Issues and Pull Requests are welcome! Check out the contribution guidelines.
This utility is distributed under the MIT license.
Alternatives - Check out projects below, that may fit in your workflow:
An advance cross-platform and multi-feature GUI web spider/crawler for cyber security proffesionals. Spider Suite can be used for attack surface mapping and analysis. For more information visit SpiderSuite's website.
Spider Suite is designed for easy installation and usage even for first timers.
First, download the package of your choice.
Then install the downloaded SpiderSuite package.
See First time crawling with SpiderSuite article for tutorial on how to get started.
For complete documentation of Spider Suite see wiki.
Can you translate?
Visit SpiderSuite's translation project to make translations to your native language.
Not a developer?
You can help by reporting bugs, requesting new features, improving the documentation, sponsoring the project & writing articles.
For More information see contribution guide.
Contributers
This product includes software developed by the following open source projects:
Features โข Installation โข Usage โข Scope โข Config โข Filters โข Join Discord
katana requires Go 1.18 to install successfully. To install, just run the below command or download pre-compiled binary from release page.
go install github.com/projectdiscovery/katana/cmd/katana@latest
katana -h
This will display help for the tool. Here are all the switches it supports.
Usage:
./katana [flags]
Flags:
INPUT:
-u, -list string[] target url / list to crawl
CONFIGURATION:
-d, -depth int maximum depth to crawl (default 2)
-jc, -js-crawl enable endpoint parsing / crawling in javascript file
-ct, -crawl-duration int maximum duration to crawl the target for
-kf, -known-files string enable crawling of known files (all,robotstxt,sitemapxml)
-mrs, -max-response-size int maximum response size to read (default 2097152)
-timeout int time to wait for request in seconds (default 10)
-aff, -automatic-form-fill enable optional automatic form filling (experimental)
-retry int number of times to retry the request (default 1)
-proxy string http/socks5 proxy to use
-H, -headers string[] custom hea der/cookie to include in request
-config string path to the katana configuration file
-fc, -form-config string path to custom form configuration file
DEBUG:
-health-check, -hc run diagnostic check up
-elog, -error-log string file to write sent requests error log
HEADLESS:
-hl, -headless enable headless hybrid crawling (experimental)
-sc, -system-chrome use local installed chrome browser instead of katana installed
-sb, -show-browser show the browser on the screen with headless mode
-ho, -headless-options string[] start headless chrome with additional options
-nos, -no-sandbox start headless chrome in --no-sandbox mode
-scp, -system-chrome-path string use specified chrome binary path for headless crawling
-noi, -no-incognito start headless chrome without incognito mode
SCOPE:
-cs, -crawl-scope string[] in scope url regex to be followed by crawler
-cos, -crawl-out-scope string[] out of scope url regex to be excluded by crawler
-fs, -field-scope string pre-defined scope field (dn,rdn,fqdn) (default "rdn")
-ns, -no-scope disables host based default scope
-do, -display-out-scope display external endpoint from scoped crawling
FILTER:
-f, -field string field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
-sf, -store-field string field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
-em, -extension-match string[] match output for given extension (eg, -em php,html,js)
-ef, -extension-filter string[] filter output for given extension (eg, -ef png,css)
RATE-LIMIT:
-c, -concurrency int number of concurrent fetchers to use (defaul t 10)
-p, -parallelism int number of concurrent inputs to process (default 10)
-rd, -delay int request delay between each request in seconds
-rl, -rate-limit int maximum requests to send per second (default 150)
-rlm, -rate-limit-minute int maximum number of requests to send per minute
OUTPUT:
-o, -output string file to write output to
-j, -json write output in JSONL(ines) format
-nc, -no-color disable output content coloring (ANSI escape codes)
-silent display output only
-v, -verbose display verbose output
-version display project version
katana requires url or endpoint to crawl and accepts single or multiple inputs.
Input URL can be provided using -u
option, and multiple values can be provided using comma-separated input, similarly file input is supported using -list
option and additionally piped input (stdin) is also supported.
katana -u https://tesla.com
katana -u https://tesla.com,https://google.com
$ cat url_list.txt
https://tesla.com
https://google.com
katana -list url_list.txt
echo https://tesla.com | katana
cat domains | httpx | katana
Example running katana -
katana -u https://youtube.com
__ __
/ /_____ _/ /____ ____ ___ _
/ '_/ _ / __/ _ / _ \/ _ /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1
projectdiscovery.io
[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.youtube.com/
https://www.youtube.com/about/
https://www.youtube.com/about/press/
https://www.youtube.com/about/copyright/
https://www.youtube.com/t/contact_us/
https://www.youtube.com/creators/
https://www.youtube.com/ads/
https://www.youtube.com/t/terms
https://www.youtube.com/t/privacy
https://www.youtube.com/about/policies/
https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com %2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen
https://www.youtube.com/new
https://m.youtube.com/
https://www.youtube.com/s/desktop/4965577f/jsbin/desktop_polymer.vflset/desktop_polymer.js
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-home-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/cssbin/www-onepick.css
https://www.youtube.com/s/_/ytmainappweb/_/ss/k=ytmainappweb.kevlar_base.0Zo5FUcPkCg.L.B1.O/am=gAE/d=0/rs=AGKMywG5nh5Qp-BGPbOaI1evhF5BVGRZGA
https://www.youtube.com/opensearch?locale=en_GB
https://www.youtube.com/manifest.webmanifest
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-watch-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/jsbin/web-animations-next-lite.min.vflset/web-animations-next-lite.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/custom-elements-es5-adapter.vflset/custom-elements-es5-adapter.js
https://w ww.youtube.com/s/desktop/4965577f/jsbin/webcomponents-sd.vflset/webcomponents-sd.js
https://www.youtube.com/s/desktop/4965577f/jsbin/intersection-observer.min.vflset/intersection-observer.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/scheduler.vflset/scheduler.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-i18n-constants-en_GB.vflset/www-i18n-constants.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-tampering.vflset/www-tampering.js
https://www.youtube.com/s/desktop/4965577f/jsbin/spf.vflset/spf.js
https://www.youtube.com/s/desktop/4965577f/jsbin/network.vflset/network.js
https://www.youtube.com/howyoutubeworks/
https://www.youtube.com/trends/
https://www.youtube.com/jobs/
https://www.youtube.com/kids/
Standard crawling modality uses the standard go http library under the hood to handle HTTP requests/responses. This modality is much faster as it doesn't have the browser overhead. Still, it analyzes HTTP responses body as is, without any javascript or DOM rendering, potentially missing post-dom-rendered endpoints or asynchronous endpoint calls that might happen in complex web applications depending, for example, on browser-specific events.
Headless mode hooks internal headless calls to handle HTTP requests/responses directly within the browser context. This offers two advantages:
Headless crawling is optional and can be enabled using -headless
option.
Here are other headless CLI options -
katana -h headless
Flags:
HEADLESS:
-hl, -headless enable experimental headless hybrid crawling
-sc, -system-chrome use local installed chrome browser instead of katana installed
-sb, -show-browser show the browser on the screen with headless mode
-ho, -headless-options string[] start headless chrome with additional options
-nos, -no-sandbox start headless chrome in --no-sandbox mode
-noi, -no-incognito start headless chrome without incognito mode
-no-sandbox
Runs headless chrome browser with no-sandbox option, useful when running as root user.
katana -u https://tesla.com -headless -no-sandbox
-no-incognito
Runs headless chrome browser without incognito mode, useful when using the local browser.
katana -u https://tesla.com -headless -no-incognito
-headless-options
When crawling in headless mode, additional chrome options can be specified using -headless-options
, for example -
katana -u https://tesla.com -headless -system-chrome -headless-options --disable-gpu,proxy-server=http://127.0.0.1:8080
Crawling can be endless if not scoped, as such katana comes with multiple support to define the crawl scope.
-field-scope
Most handy option to define scope with predefined field name, rdn
being default option for field scope.
rdn
- crawling scoped to root domain name and all subdomains (e.g. *example.com
) (default)fqdn
- crawling scoped to given sub(domain) (e.g. www.example.com
or api.example.com
)dn
- crawling scoped to domain name keyword (e.g. example
)katana -u https://tesla.com -fs dn
-crawl-scope
For advanced scope control, -cs
option can be used that comes with regex support.
katana -u https://tesla.com -cs login
For multiple in scope rules, file input with multiline string / regex can be passed.
$ cat in_scope.txt
login/
admin/
app/
wordpress/
katana -u https://tesla.com -cs in_scope.txt
-crawl-out-scope
For defining what not to crawl, -cos
option can be used and also support regex input.
katana -u https://tesla.com -cos logout
For multiple out of scope rules, file input with multiline string / regex can be passed.
$ cat out_of_scope.txt
/logout
/log_out
katana -u https://tesla.com -cos out_of_scope.txt
-no-scope
Katana is default to scope *.domain
, to disable this -ns
option can be used and also to crawl the internet.
katana -u https://tesla.com -ns
-display-out-scope
As default, when scope option is used, it also applies for the links to display as output, as such external URLs are default to exclude and to overwrite this behavior, -do
option can be used to display all the external URLs that exist in targets scoped URL / Endpoint.
katana -u https://tesla.com -do
Here is all the CLI options for the scope control -
katana -h scope
Flags:
SCOPE:
-cs, -crawl-scope string[] in scope url regex to be followed by crawler
-cos, -crawl-out-scope string[] out of scope url regex to be excluded by crawler
-fs, -field-scope string pre-defined scope field (dn,rdn,fqdn) (default "rdn")
-ns, -no-scope disables host based default scope
-do, -display-out-scope display external endpoint from scoped crawling
Katana comes with multiple options to configure and control the crawl as the way we want.
-depth
Option to define the depth
to follow the urls for crawling, the more depth the more number of endpoint being crawled + time for crawl.
katana -u https://tesla.com -d 5
-js-crawl
Option to enable JavaScript file parsing + crawling the endpoints discovered in JavaScript files, disabled as default.
katana -u https://tesla.com -jc
-crawl-duration
Option to predefined crawl duration, disabled as default.
katana -u https://tesla.com -ct 2
-known-files
Option to enable crawling robots.txt
and sitemap.xml
file, disabled as default.
katana -u https://tesla.com -kf robotstxt,sitemapxml
-automatic-form-fill
Option to enable automatic form filling for known / unknown fields, known field values can be customized as needed by updating form config file at $HOME/.config/katana/form-config.yaml
.
Automatic form filling is experimental feature.
-aff, -automatic-form-fill enable optional automatic form filling (experimental)
There are more options to configure when needed, here is all the config related CLI options -
katana -h config
Flags:
CONFIGURATION:
-d, -depth int maximum depth to crawl (default 2)
-jc, -js-crawl enable endpoint parsing / crawling in javascript file
-ct, -crawl-duration int maximum duration to crawl the target for
-kf, -known-files string enable crawling of known files (all,robotstxt,sitemapxml)
-mrs, -max-response-size int maximum response size to read (default 2097152)
-timeout int time to wait for request in seconds (default 10)
-retry int number of times to retry the request (default 1)
-proxy string http/socks5 proxy to use
-H, -headers string[] custom header/cookie to include in request
-config string path to the katana configuration file
-fc, -form-config string path to custom form configuration file
-field
Katana comes with built in fields that can be used to filter the output for the desired information, -f
option can be used to specify any of the available fields.
-f, -field string field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
Here is a table with examples of each field and expected output when used -
FIELD | DESCRIPTION | EXAMPLE |
---|---|---|
url | URL Endpoint | https://admin.projectdiscovery.io/admin/login?user=admin&password=admin |
qurl | URL including query param | https://admin.projectdiscovery.io/admin/login.php?user=admin&password=admin |
qpath | Path including query param | /login?user=admin&password=admin |
path | URL Path | https://admin.projectdiscovery.io/admin/login |
fqdn | Fully Qualified Domain name | admin.projectdiscovery.io |
rdn | Root Domain name | projectdiscovery.io |
rurl | Root URL | https://admin.projectdiscovery.io |
file | Filename in URL | login.php |
key | Parameter keys in URL | user,password |
value | Parameter values in URL | admin,admin |
kv | Keys=Values in URL | user=admin&password=admin |
dir | URL Directory name | /admin/ |
udir | URL with Directory | https://admin.projectdiscovery.io/admin/ |
Here is an example of using field option to only display all the urls with query parameter in it -
katana -u https://tesla.com -f qurl -silent
https://shop.tesla.com/en_au?redirect=no
https://shop.tesla.com/en_nz?redirect=no
https://shop.tesla.com/product/men_s-raven-lightweight-zip-up-bomber-jacket?sku=1740250-00-A
https://shop.tesla.com/product/tesla-shop-gift-card?sku=1767247-00-A
https://shop.tesla.com/product/men_s-chill-crew-neck-sweatshirt?sku=1740176-00-A
https://www.tesla.com/about?redirect=no
https://www.tesla.com/about/legal?redirect=no
https://www.tesla.com/findus/list?redirect=no
You can create custom fields to extract and store specific information from page responses using regex rules. These custom fields are defined using a YAML config file and are loaded from the default location at $HOME/.config/katana/field-config.yaml
. Alternatively, you can use the -flc
option to load a custom field config file from a different location. Here is example custom field.
- name: email
type: regex
regex:
- '([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'
- '([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'
- name: phone
type: regex
regex:
- '\d{3}-\d{8}|\d{4}-\d{7}'
When defining custom fields, following attributes are supported:
The value of name attribute is used as the
-field
cli option value.
The type of custom attribute, currenly supported option -
regex
The part of the response to extract the information from. The default value is
response
, which includes both the header and body. Other possible values areheader
andbody
.
You can use this attribute to select a specific matched group in regex, for example:
group: 1
katana -u https://tesla.com -f email,phone
-store-field
To compliment field
option which is useful to filter output at run time, there is -sf, -store-fields
option which works exactly like field option except instead of filtering, it stores all the information on the disk under katana_field
directory sorted by target url.
katana -u https://tesla.com -sf key,fqdn,qurl -silent
$ ls katana_field/
https_www.tesla.com_fqdn.txt
https_www.tesla.com_key.txt
https_www.tesla.com_qurl.txt
The -store-field
option can be useful for collecting information to build a targeted wordlist for various purposes, including but not limited to:
-extension-match
Crawl output can be easily matched for specific extension using -em
option to ensure to display only output containing given extension.
katana -u https://tesla.com -silent -em js,jsp,json
-extension-filter
Crawl output can be easily filtered for specific extension using -ef
option which ensure to remove all the urls containing given extension.
katana -u https://tesla.com -silent -ef css,txt,md
Here are additional filter options -
-f, -field string field to display in output (url,path,fqdn,rdn,rurl,qurl,file,key,value,kv,dir,udir)
-sf, -store-field string field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,file,key,value,kv,dir,udir)
-em, -extension-match string[] match output for given extension (eg, -em php,html,js)
-ef, -extension-filter string[] filter output for given extension (eg, -ef png,css)
It's easy to get blocked / banned while crawling if not following target websites limits, katana comes with multiple option to tune the crawl to go as fast / slow we want.
-delay
option to introduce a delay in seconds between each new request katana makes while crawling, disabled as default.
katana -u https://tesla.com -delay 20
-concurrency
option to control the number of urls per target to fetch at the same time.
katana -u https://tesla.com -c 20
-parallelism
option to define number of target to process at same time from list input.
katana -u https://tesla.com -p 20
-rate-limit
option to use to define max number of request can go out per second.
katana -u https://tesla.com -rl 100
-rate-limit-minute
option to use to define max number of request can go out per minute.
katana -u https://tesla.com -rlm 500
Here is all long / short CLI options for rate limit control -
katana -h rate-limit
Flags:
RATE-LIMIT:
-c, -concurrency int number of concurrent fetchers to use (default 10)
-p, -parallelism int number of concurrent inputs to process (default 10)
-rd, -delay int request delay between each request in seconds
-rl, -rate-limit int maximum requests to send per second (default 150)
-rlm, -rate-limit-minute int maximum number of requests to send per minute
Katana support both file output in plain text format as well as JSON which includes additional information like, source
, tag
, and attribute
name to co-related the discovered endpoint.
-output
By default, katana outputs the crawled endpoints in plain text format. The results can be written to a file by using the -output option.
katana -u https://example.com -no-scope -output example_endpoints.txt
-json
katana -u https://example.com -json -do | jq .
{
"timestamp": "2022-11-05T22:33:27.745815+05:30",
"endpoint": "https://www.iana.org/domains/example",
"source": "https://example.com",
"tag": "a",
"attribute": "href"
}
-store-response
The -store-response
option allows for writing all crawled endpoint requests and responses to a text file. When this option is used, text files including the request and response will be written to the katana_response directory. If you would like to specify a custom directory, you can use the -store-response-dir
option.
katana -u https://example.com -no-scope -store-response
$ cat katana_response/index.txt
katana_response/example.com/327c3fda87ce286848a574982ddd0b7c7487f816.txt https://example.com (200 OK)
katana_response/www.iana.org/bfc096e6dd93b993ca8918bf4c08fdc707a70723.txt http://www.iana.org/domains/reserved (200 OK)
Note:
-store-response
option is not supported in -headless
mode.
Here are additional CLI options related to output -
katana -h output
OUTPUT:
-o, -output string file to write output to
-sr, -store-response store http requests/responses
-srd, -store-response-dir string store http requests/responses to custom directory
-j, -json write output in JSONL(ines) format
-nc, -no-color disable output content coloring (ANSI escape codes)
-silent display output only
-v, -verbose display verbose output
-version display project version