Python 3 script to dump company employees from LinkedIn API๏ฌ
LinkedInDumper is a Python 3 script that dumps employee data from the LinkedIn social networking platform.
The results contain firstname, lastname, position (title), location and a user's profile link. Only 2 API calls are required to retrieve all employees if the company does not have more than 10 employees. Otherwise, we have to paginate through the API results. With the --email-format
CLI flag one can define a Python string format to auto generate email addresses based on the retrieved first and last name.
LinkedInDumper talks with the unofficial LinkedIn Voyager API, which requires authentication. Therefore, you must have a valid LinkedIn user account. To keep it simple, LinkedInDumper just expects a cookie value provided by you. Doing it this way, even 2FA protected accounts are supported. Furthermore, you are tasked to provide a LinkedIn company URL to dump employees from.
li_at
session cookie value e.g. via developer toolsli_at
or temporarily during runtime via the CLI flag --cookie
usage: linkedindumper.py [-h] --url <linkedin-url> [--cookie <cookie>] [--quiet] [--include-private-profiles] [--email-format EMAIL_FORMAT]
options:
-h, --help show this help message and exit
--url <linkedin-url> A LinkedIn company url - https://www.linkedin.com/company/<company>
--cookie <cookie> LinkedIn 'li_at' session cookie
--quiet Show employee results only
--include-private-profiles
Show private accounts too
--email-format Python string format for emails; for example:
[1] john.doe@example.com > '{0}.{1}@example.com'
[2] j.doe@example.com > '{0[0]}.{1}@example.com'
[3] jdoe@example.com > '{0[0]}{1}@example.com'
[4] doe@example.com > '{1}@example.com'
[5] john@example.com > '{0}@example.com'
[6] jd@example.com > '{0[0]}{1[0]}@example.com'
docker run --rm l4rm4nd/linkedindumper:latest --url 'https://www.linkedin.com/company/apple' --cookie <cookie> --email-format '{0}.{1}@apple.de'
# install dependencies
pip install -r requirements.txt
python3 linkedindumper.py --url 'https://www.linkedin.com/company/apple' --cookie <cookie> --email-format '{0}.{1}@apple.de'
The script will return employee data as semi-colon separated values (like CSV):
โโโ โโโ โโโโ โ โโ โโโโโโโโโ โโโโโโโ โโโ โโโโ โ โโโโโโโ โ โโ โโโโ โโโโโ โโโโโโ โโโโโโ โโโโโโ
โโโโ โโโโ โโ โโ โ โโโโโ โโ โ โโโโ โโโโโโโ โโ โโ โ โโโโ โโโ โโ โโโโโโโโโโ& #9600; โโโโโโโ โโโโโ โ โโโ โ โโโ
โโโโ โโโโโโโ โโ โโโโโโโโโ โโโโ โโโ โโโโโโโโโ โโ โโโโโโ โโโโโ โโโโโโโ โโโโโโโโ โโโโโโโโ โโโ โโโ โ
โโโโ โโโโโโโโ โโโโโโโโ โโ โโโ โ โโโโ โ&# 9617;โโโโโโโ โโโโโโโโโ โโโโ โโโโโโโ โโโ โโโโโโโ โโโโ โ โโโโโโโ
โโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโ โโโโโโโโ โโโโ โโโโโโโโ โ โโโโโโโ& #9618;โโโโ โโโโ
โ โโโ โโโ โ โโ โ โ โ โโ โโโโ โโ โ โโโ โ โโ โ โโ โ โ โโโ โ โโโโ โ โ โ โโ โ โโโโโ โ โโโ โโ โโ โโ โโโโ
โ โ โ โ โ โโ โโ โ โโโ โโ โโ โ โ โ โ โ โ โ โโ โโ โ โโ โ โ โ โโโโ โ โ โ โ โโโ โ โ โ โ โโ โ โโ
โ โ โ โ โ โ โ โ โโ โ โ โ โ โ โ โ โ โ โ โ โ โ โโโ โ โ โ โ โโ โ โโ โ
โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ
โ โ โ by LRVT
[i] Company Name: apple
[i] Company X-ID: 162479
[i] LN Employees: 1000 employees found
[i] Dumping Date: 17/10/2022 13:55:06
[i] Email Format: {0}.{1}@apple.de
Firstname;Lastname;Email;Position;Gender;Location;Profile
Katrin;Honauer;katrin.honauer@apple.com;Software Engineer at Apple;N/A;Heidelberg;https://www.linkedin.com/in/katrin-honauer
Raymond;Chen;raymond.chen@apple.com;Recruiting at Apple;N/A;Austin, Texas Metropolitan Area;https://www.linkedin.com/in/raytherecruiter
[i] Successfully crawled 2 unique apple employee(s). Hurray ^_-
LinkedIn will allow only the first 1,000 search results to be returned when harvesting contact information. You may also need a LinkedIn premium account when you reached the maximum allowed queries for visiting profiles with your freemium LinkedIn account.
Furthermore, not all employee profiles are public. The results vary depending on your used LinkedIn account and whether you are befriended with some employees of the company to crawl or not. Therefore, it is sometimes not possible to retrieve the firstname, lastname and profile url of some employee accounts. The script will not display such profiles, as they contain default values such as "LinkedIn" as firstname and "Member" in the lastname. If you want to include such private profiles, please use the CLI flag --include-private-profiles
. Although some accounts may be private, we can obtain the position (title) as well as the location of such accounts. Only firstname, lastname and profile URL are hidden for private LinkedIn accounts.
Finally, LinkedIn users are free to name their profile. An account name can therefore consist of various things such as saluations, abbreviations, emojis, middle names etc. I tried my best to remove some nonsense. However, this is not a complete solution to the general problem. Note that we are not using the official LinkedIn API. This script gathers information from the "unofficial" Voyager API.
An advance cross-platform and multi-feature GUI web spider/crawler for cyber security proffesionals. Spider Suite can be used for attack surface mapping and analysis. For more information visit SpiderSuite's website.
Spider Suite is designed for easy installation and usage even for first timers.
First, download the package of your choice.
Then install the downloaded SpiderSuite package.
See First time crawling with SpiderSuite article for tutorial on how to get started.
For complete documentation of Spider Suite see wiki.
Can you translate?
Visit SpiderSuite's translation project to make translations to your native language.
Not a developer?
You can help by reporting bugs, requesting new features, improving the documentation, sponsoring the project & writing articles.
For More information see contribution guide.
Contributers
This product includes software developed by the following open source projects:
Features โข Installation โข Usage โข Scope โข Config โข Filters โข Join Discord
katana requires Go 1.18 to install successfully. To install, just run the below command or download pre-compiled binary from release page.
go install github.com/projectdiscovery/katana/cmd/katana@latest
katana -h
This will display help for the tool. Here are all the switches it supports.
Usage:
./katana [flags]
Flags:
INPUT:
-u, -list string[] target url / list to crawl
CONFIGURATION:
-d, -depth int maximum depth to crawl (default 2)
-jc, -js-crawl enable endpoint parsing / crawling in javascript file
-ct, -crawl-duration int maximum duration to crawl the target for
-kf, -known-files string enable crawling of known files (all,robotstxt,sitemapxml)
-mrs, -max-response-size int maximum response size to read (default 2097152)
-timeout int time to wait for request in seconds (default 10)
-aff, -automatic-form-fill enable optional automatic form filling (experimental)
-retry int number of times to retry the request (default 1)
-proxy string http/socks5 proxy to use
-H, -headers string[] custom hea der/cookie to include in request
-config string path to the katana configuration file
-fc, -form-config string path to custom form configuration file
DEBUG:
-health-check, -hc run diagnostic check up
-elog, -error-log string file to write sent requests error log
HEADLESS:
-hl, -headless enable headless hybrid crawling (experimental)
-sc, -system-chrome use local installed chrome browser instead of katana installed
-sb, -show-browser show the browser on the screen with headless mode
-ho, -headless-options string[] start headless chrome with additional options
-nos, -no-sandbox start headless chrome in --no-sandbox mode
-scp, -system-chrome-path string use specified chrome binary path for headless crawling
-noi, -no-incognito start headless chrome without incognito mode
SCOPE:
-cs, -crawl-scope string[] in scope url regex to be followed by crawler
-cos, -crawl-out-scope string[] out of scope url regex to be excluded by crawler
-fs, -field-scope string pre-defined scope field (dn,rdn,fqdn) (default "rdn")
-ns, -no-scope disables host based default scope
-do, -display-out-scope display external endpoint from scoped crawling
FILTER:
-f, -field string field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
-sf, -store-field string field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
-em, -extension-match string[] match output for given extension (eg, -em php,html,js)
-ef, -extension-filter string[] filter output for given extension (eg, -ef png,css)
RATE-LIMIT:
-c, -concurrency int number of concurrent fetchers to use (defaul t 10)
-p, -parallelism int number of concurrent inputs to process (default 10)
-rd, -delay int request delay between each request in seconds
-rl, -rate-limit int maximum requests to send per second (default 150)
-rlm, -rate-limit-minute int maximum number of requests to send per minute
OUTPUT:
-o, -output string file to write output to
-j, -json write output in JSONL(ines) format
-nc, -no-color disable output content coloring (ANSI escape codes)
-silent display output only
-v, -verbose display verbose output
-version display project version
katana requires url or endpoint to crawl and accepts single or multiple inputs.
Input URL can be provided using -u
option, and multiple values can be provided using comma-separated input, similarly file input is supported using -list
option and additionally piped input (stdin) is also supported.
katana -u https://tesla.com
katana -u https://tesla.com,https://google.com
$ cat url_list.txt
https://tesla.com
https://google.com
katana -list url_list.txt
echo https://tesla.com | katana
cat domains | httpx | katana
Example running katana -
katana -u https://youtube.com
__ __
/ /_____ _/ /____ ____ ___ _
/ '_/ _ / __/ _ / _ \/ _ /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1
projectdiscovery.io
[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.youtube.com/
https://www.youtube.com/about/
https://www.youtube.com/about/press/
https://www.youtube.com/about/copyright/
https://www.youtube.com/t/contact_us/
https://www.youtube.com/creators/
https://www.youtube.com/ads/
https://www.youtube.com/t/terms
https://www.youtube.com/t/privacy
https://www.youtube.com/about/policies/
https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com %2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen
https://www.youtube.com/new
https://m.youtube.com/
https://www.youtube.com/s/desktop/4965577f/jsbin/desktop_polymer.vflset/desktop_polymer.js
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-home-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/cssbin/www-onepick.css
https://www.youtube.com/s/_/ytmainappweb/_/ss/k=ytmainappweb.kevlar_base.0Zo5FUcPkCg.L.B1.O/am=gAE/d=0/rs=AGKMywG5nh5Qp-BGPbOaI1evhF5BVGRZGA
https://www.youtube.com/opensearch?locale=en_GB
https://www.youtube.com/manifest.webmanifest
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-watch-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/jsbin/web-animations-next-lite.min.vflset/web-animations-next-lite.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/custom-elements-es5-adapter.vflset/custom-elements-es5-adapter.js
https://w ww.youtube.com/s/desktop/4965577f/jsbin/webcomponents-sd.vflset/webcomponents-sd.js
https://www.youtube.com/s/desktop/4965577f/jsbin/intersection-observer.min.vflset/intersection-observer.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/scheduler.vflset/scheduler.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-i18n-constants-en_GB.vflset/www-i18n-constants.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-tampering.vflset/www-tampering.js
https://www.youtube.com/s/desktop/4965577f/jsbin/spf.vflset/spf.js
https://www.youtube.com/s/desktop/4965577f/jsbin/network.vflset/network.js
https://www.youtube.com/howyoutubeworks/
https://www.youtube.com/trends/
https://www.youtube.com/jobs/
https://www.youtube.com/kids/
Standard crawling modality uses the standard go http library under the hood to handle HTTP requests/responses. This modality is much faster as it doesn't have the browser overhead. Still, it analyzes HTTP responses body as is, without any javascript or DOM rendering, potentially missing post-dom-rendered endpoints or asynchronous endpoint calls that might happen in complex web applications depending, for example, on browser-specific events.
Headless mode hooks internal headless calls to handle HTTP requests/responses directly within the browser context. This offers two advantages:
Headless crawling is optional and can be enabled using -headless
option.
Here are other headless CLI options -
katana -h headless
Flags:
HEADLESS:
-hl, -headless enable experimental headless hybrid crawling
-sc, -system-chrome use local installed chrome browser instead of katana installed
-sb, -show-browser show the browser on the screen with headless mode
-ho, -headless-options string[] start headless chrome with additional options
-nos, -no-sandbox start headless chrome in --no-sandbox mode
-noi, -no-incognito start headless chrome without incognito mode
-no-sandbox
Runs headless chrome browser with no-sandbox option, useful when running as root user.
katana -u https://tesla.com -headless -no-sandbox
-no-incognito
Runs headless chrome browser without incognito mode, useful when using the local browser.
katana -u https://tesla.com -headless -no-incognito
-headless-options
When crawling in headless mode, additional chrome options can be specified using -headless-options
, for example -
katana -u https://tesla.com -headless -system-chrome -headless-options --disable-gpu,proxy-server=http://127.0.0.1:8080
Crawling can be endless if not scoped, as such katana comes with multiple support to define the crawl scope.
-field-scope
Most handy option to define scope with predefined field name, rdn
being default option for field scope.
rdn
- crawling scoped to root domain name and all subdomains (e.g. *example.com
) (default)fqdn
- crawling scoped to given sub(domain) (e.g. www.example.com
or api.example.com
)dn
- crawling scoped to domain name keyword (e.g. example
)katana -u https://tesla.com -fs dn
-crawl-scope
For advanced scope control, -cs
option can be used that comes with regex support.
katana -u https://tesla.com -cs login
For multiple in scope rules, file input with multiline string / regex can be passed.
$ cat in_scope.txt
login/
admin/
app/
wordpress/
katana -u https://tesla.com -cs in_scope.txt
-crawl-out-scope
For defining what not to crawl, -cos
option can be used and also support regex input.
katana -u https://tesla.com -cos logout
For multiple out of scope rules, file input with multiline string / regex can be passed.
$ cat out_of_scope.txt
/logout
/log_out
katana -u https://tesla.com -cos out_of_scope.txt
-no-scope
Katana is default to scope *.domain
, to disable this -ns
option can be used and also to crawl the internet.
katana -u https://tesla.com -ns
-display-out-scope
As default, when scope option is used, it also applies for the links to display as output, as such external URLs are default to exclude and to overwrite this behavior, -do
option can be used to display all the external URLs that exist in targets scoped URL / Endpoint.
katana -u https://tesla.com -do
Here is all the CLI options for the scope control -
katana -h scope
Flags:
SCOPE:
-cs, -crawl-scope string[] in scope url regex to be followed by crawler
-cos, -crawl-out-scope string[] out of scope url regex to be excluded by crawler
-fs, -field-scope string pre-defined scope field (dn,rdn,fqdn) (default "rdn")
-ns, -no-scope disables host based default scope
-do, -display-out-scope display external endpoint from scoped crawling
Katana comes with multiple options to configure and control the crawl as the way we want.
-depth
Option to define the depth
to follow the urls for crawling, the more depth the more number of endpoint being crawled + time for crawl.
katana -u https://tesla.com -d 5
-js-crawl
Option to enable JavaScript file parsing + crawling the endpoints discovered in JavaScript files, disabled as default.
katana -u https://tesla.com -jc
-crawl-duration
Option to predefined crawl duration, disabled as default.
katana -u https://tesla.com -ct 2
-known-files
Option to enable crawling robots.txt
and sitemap.xml
file, disabled as default.
katana -u https://tesla.com -kf robotstxt,sitemapxml
-automatic-form-fill
Option to enable automatic form filling for known / unknown fields, known field values can be customized as needed by updating form config file at $HOME/.config/katana/form-config.yaml
.
Automatic form filling is experimental feature.
-aff, -automatic-form-fill enable optional automatic form filling (experimental)
There are more options to configure when needed, here is all the config related CLI options -
katana -h config
Flags:
CONFIGURATION:
-d, -depth int maximum depth to crawl (default 2)
-jc, -js-crawl enable endpoint parsing / crawling in javascript file
-ct, -crawl-duration int maximum duration to crawl the target for
-kf, -known-files string enable crawling of known files (all,robotstxt,sitemapxml)
-mrs, -max-response-size int maximum response size to read (default 2097152)
-timeout int time to wait for request in seconds (default 10)
-retry int number of times to retry the request (default 1)
-proxy string http/socks5 proxy to use
-H, -headers string[] custom header/cookie to include in request
-config string path to the katana configuration file
-fc, -form-config string path to custom form configuration file
-field
Katana comes with built in fields that can be used to filter the output for the desired information, -f
option can be used to specify any of the available fields.
-f, -field string field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
Here is a table with examples of each field and expected output when used -
FIELD | DESCRIPTION | EXAMPLE |
---|---|---|
url | URL Endpoint | https://admin.projectdiscovery.io/admin/login?user=admin&password=admin |
qurl | URL including query param | https://admin.projectdiscovery.io/admin/login.php?user=admin&password=admin |
qpath | Path including query param | /login?user=admin&password=admin |
path | URL Path | https://admin.projectdiscovery.io/admin/login |
fqdn | Fully Qualified Domain name | admin.projectdiscovery.io |
rdn | Root Domain name | projectdiscovery.io |
rurl | Root URL | https://admin.projectdiscovery.io |
file | Filename in URL | login.php |
key | Parameter keys in URL | user,password |
value | Parameter values in URL | admin,admin |
kv | Keys=Values in URL | user=admin&password=admin |
dir | URL Directory name | /admin/ |
udir | URL with Directory | https://admin.projectdiscovery.io/admin/ |
Here is an example of using field option to only display all the urls with query parameter in it -
katana -u https://tesla.com -f qurl -silent
https://shop.tesla.com/en_au?redirect=no
https://shop.tesla.com/en_nz?redirect=no
https://shop.tesla.com/product/men_s-raven-lightweight-zip-up-bomber-jacket?sku=1740250-00-A
https://shop.tesla.com/product/tesla-shop-gift-card?sku=1767247-00-A
https://shop.tesla.com/product/men_s-chill-crew-neck-sweatshirt?sku=1740176-00-A
https://www.tesla.com/about?redirect=no
https://www.tesla.com/about/legal?redirect=no
https://www.tesla.com/findus/list?redirect=no
You can create custom fields to extract and store specific information from page responses using regex rules. These custom fields are defined using a YAML config file and are loaded from the default location at $HOME/.config/katana/field-config.yaml
. Alternatively, you can use the -flc
option to load a custom field config file from a different location. Here is example custom field.
- name: email
type: regex
regex:
- '([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'
- '([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)'
- name: phone
type: regex
regex:
- '\d{3}-\d{8}|\d{4}-\d{7}'
When defining custom fields, following attributes are supported:
The value of name attribute is used as the
-field
cli option value.
The type of custom attribute, currenly supported option -
regex
The part of the response to extract the information from. The default value is
response
, which includes both the header and body. Other possible values areheader
andbody
.
You can use this attribute to select a specific matched group in regex, for example:
group: 1
katana -u https://tesla.com -f email,phone
-store-field
To compliment field
option which is useful to filter output at run time, there is -sf, -store-fields
option which works exactly like field option except instead of filtering, it stores all the information on the disk under katana_field
directory sorted by target url.
katana -u https://tesla.com -sf key,fqdn,qurl -silent
$ ls katana_field/
https_www.tesla.com_fqdn.txt
https_www.tesla.com_key.txt
https_www.tesla.com_qurl.txt
The -store-field
option can be useful for collecting information to build a targeted wordlist for various purposes, including but not limited to:
-extension-match
Crawl output can be easily matched for specific extension using -em
option to ensure to display only output containing given extension.
katana -u https://tesla.com -silent -em js,jsp,json
-extension-filter
Crawl output can be easily filtered for specific extension using -ef
option which ensure to remove all the urls containing given extension.
katana -u https://tesla.com -silent -ef css,txt,md
Here are additional filter options -
-f, -field string field to display in output (url,path,fqdn,rdn,rurl,qurl,file,key,value,kv,dir,udir)
-sf, -store-field string field to store in per-host output (url,path,fqdn,rdn,rurl,qurl,file,key,value,kv,dir,udir)
-em, -extension-match string[] match output for given extension (eg, -em php,html,js)
-ef, -extension-filter string[] filter output for given extension (eg, -ef png,css)
It's easy to get blocked / banned while crawling if not following target websites limits, katana comes with multiple option to tune the crawl to go as fast / slow we want.
-delay
option to introduce a delay in seconds between each new request katana makes while crawling, disabled as default.
katana -u https://tesla.com -delay 20
-concurrency
option to control the number of urls per target to fetch at the same time.
katana -u https://tesla.com -c 20
-parallelism
option to define number of target to process at same time from list input.
katana -u https://tesla.com -p 20
-rate-limit
option to use to define max number of request can go out per second.
katana -u https://tesla.com -rl 100
-rate-limit-minute
option to use to define max number of request can go out per minute.
katana -u https://tesla.com -rlm 500
Here is all long / short CLI options for rate limit control -
katana -h rate-limit
Flags:
RATE-LIMIT:
-c, -concurrency int number of concurrent fetchers to use (default 10)
-p, -parallelism int number of concurrent inputs to process (default 10)
-rd, -delay int request delay between each request in seconds
-rl, -rate-limit int maximum requests to send per second (default 150)
-rlm, -rate-limit-minute int maximum number of requests to send per minute
Katana support both file output in plain text format as well as JSON which includes additional information like, source
, tag
, and attribute
name to co-related the discovered endpoint.
-output
By default, katana outputs the crawled endpoints in plain text format. The results can be written to a file by using the -output option.
katana -u https://example.com -no-scope -output example_endpoints.txt
-json
katana -u https://example.com -json -do | jq .
{
"timestamp": "2022-11-05T22:33:27.745815+05:30",
"endpoint": "https://www.iana.org/domains/example",
"source": "https://example.com",
"tag": "a",
"attribute": "href"
}
-store-response
The -store-response
option allows for writing all crawled endpoint requests and responses to a text file. When this option is used, text files including the request and response will be written to the katana_response directory. If you would like to specify a custom directory, you can use the -store-response-dir
option.
katana -u https://example.com -no-scope -store-response
$ cat katana_response/index.txt
katana_response/example.com/327c3fda87ce286848a574982ddd0b7c7487f816.txt https://example.com (200 OK)
katana_response/www.iana.org/bfc096e6dd93b993ca8918bf4c08fdc707a70723.txt http://www.iana.org/domains/reserved (200 OK)
Note:
-store-response
option is not supported in -headless
mode.
Here are additional CLI options related to output -
katana -h output
OUTPUT:
-o, -output string file to write output to
-sr, -store-response store http requests/responses
-srd, -store-response-dir string store http requests/responses to custom directory
-j, -json write output in JSONL(ines) format
-nc, -no-color disable output content coloring (ANSI escape codes)
-silent display output only
-v, -verbose display verbose output
-version display project version