Introducing Katana: The CLI web crawler from PD

What is Katana?

Katana is a command-line interface (CLI) web crawling tool written in Golang. It is designed to crawl websites to gather information and endpoints. One of the defining features of Katana is its ability to use headless browsing to crawl applications. This means that it can crawl single-page applications (SPAs) built using technologies such as JavaScript, Angular, or React. These types of applications are becoming increasingly common, but can be difficult to crawl using traditional tools. By using headless browsing, Katana is able to access and gather information from these types of applications more effectively.

Katana is designed to be CLI-friendly, fast, efficient and with a simple output format. This makes it an attractive option for those looking to use the tool as part of an automation pipeline. Furthermore, regular updates and maintenance ensure that this tool remains a valuable and indispensable part of your hacker arsenal for years to come.

Tool integrations

Katana is an excellent tool for several reasons, one of which is its simple input/output formats. These formats are easy to understand and use, allowing users to quickly and easily integrate Katana into their workflow. Katana is designed to be easily integrated with other tools in the ProjectDiscovery suite, as well as other widely used CLI-based recon tools.

What is web crawling?

Any search engine you use today is populated using web crawlers. A web crawler indexes web applications by automating the “click every button” approach to discovering paths, scripts and other resources. Web application indexing is an important step in uncovering an application’s attack surface.

Installation

Katana allows a couple of different installation methods, downloading the pre-compiled binary, compiling the binary using go, or docker.

Binary

There are two ways to install the binary directly onto your system:

Download the pre-compiled binary from the release page.
Run go install:

cli

1go install github.com/projectdiscovery/katana/cmd/katana@latest

Docker

Install/Update docker image to the latest tag

cli

1docker pull projectdiscovery/katana:latest

2. Running Katana
a. Normal mode:

cli

1docker run projectdiscovery/katana:latest -u https://tesla.com

b. Headless mode:

cli

1docker run projectdiscovery/katana:latest -u https://tesla.com -system-chrome -headless

Options

Here are the raw options for your perusal – we'll take a closer look at each below!

Configuration

-d, -depth Defines maximum crawl depth, ex: -d 2 -jc, -js-crawl Enables endpoint parsing/crawling from JS files
-ct, -crawl-duration Maximum time to crawl the target for, ex: -ct 100 -kf, -known-files Enable crawling for known files, ex: all,robotstxt,sitemapxml, etc. -mrs, -max-response-size Maximum response size to read, ex: -mrs 200000 -timeout Time to wait for request in seconds, ex: -timeout 5 -aff, -automatic-form-fill Enable optional automatic form filling. This is still experimental
-retry Number of times to retry the request, ex: -retry 2 -proxy HTTP/socks5 proxy to use, ex: -proxy http://127.0.0.1:8080 -H, -headers Include custom headers/cookies with your request, ex: TODO -config Path to the katana configuration file, ex: -config /home/g0lden/katana-config.yaml -fc, -form-config Path to form configuration file, ex: -fc /home/g0lden/form-config.yaml

Headless

-hl, -headless Enable headless hybrid crawling. This is experimental
-sc, -system-chrome Use a locally installed Chrome browser instead of katana’s
-sb, -show-browser Show the browser on screen when in headless mode
-ho, -headless-options Start headless chrome with additional options
-nos, -no-sandbox Start headless chrome in --no-sandbox mode

Scope

-cs, -crawl-scope In-scope URL regex to be followed by crawler, ex: -cs login -cos, -crawl-out-scope Out-of-scope url regex to be excluded by crawler, ex: -cos logout -fs, -field-scope Pre-defined scope field (dn,rdn,fqdn), ex: -fs dn -ns, -no-scope Disables host-based default scope allowing for internet scanning
-do, -display-out-scope Display external endpoints found from crawling

Filter

-f, -field Field to display in output (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir), ex: -f qurl -sf, -store-field Field to store in selected output option (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir), ex: -sf qurl -em, -extension-match Match output for given extension, ex: -em php,html,js -ef, -extension-filter Filter output for given extension, ex: -ef png,css

Rate-limit

-c, -concurrency Number of concurrent fetchers to use, ex: -c 50 -p, -parallelism Number of concurrent inputs to process, ex: -p 50 -rd, -delay Request delay between each request in seconds, ex: -rd 3 -rl, -rate-limit Maximum requests to send per second, ex: -rl 150 -rlm, -rate-limit-minute Maximum number of requests to send per minute, ex: -rlm 1000

Output

-o, -output File to write output to, ex: -o findings.txt -j, -json Write output in JSONL(ines) format
-nc, -no-color Disable output content coloring (ANSI escape codes)
-silent Display output only
-v, -verbose Display verbose output
-version Display project version

Different inputs

There are four different ways to give katana input:

URL input

cli

1katana -u https://tesla.com

2. Multiple URL input

cli

1katana -u https://tesla.com,https://google.com

3. List input

cli

1katana -list url_list.txt

4. STDIN input (piped)

cli

1echo “https://tesla.com” | katana

Crawling modes

Standard mode

Standard mode uses the standard Golang HTTP library to make requests. The upside of this mode is that there is no browser overhead, so it’s much faster than headless mode. The downside is that the HTTP library in Go analyzes the HTTP response as is and any dynamic JavaScript or DOM (Document Object Model) manipulations won’t load, causing you to miss post-rendered endpoints or asynchronous endpoint calls.

If you are confident the application you are crawling does not use complex DOM rendering or has asynchronous events, then this mode is the one to use as it is faster. Standard mode is the default:

cli

1katana -u https://tesla.com

Headless mode

Headless mode uses internal headless calls to handle HTTP requests/responses within a browser context. This solves two major issues:

The HTTP fingerprint from a headless browser will be identified and accepted as a real browser – including TLS and user agent.
Better coverage by analyzing raw HTTP responses as well as the browser-rendered response with JavaScript.

If you are crawling a modern complex application that utilizes DOM manipulation and/or asynchronous events, consider using headless mode by utilizing the -headless option:

cli

1katana -u https://tesla.com -headless

Controlling your scope

Controlling your scope is important to returning valuable results. Katana has four main ways to control the scope of your crawl:

Field-scope
Crawl-scope
Crawl-out-scope
No-scope

Field-scope

When setting the field scope, you have three options:

rdn - crawling scoped to root domain name and all subdomains (default)
Running katana -u https://tesla.com -fs dn returns anything that matches *.tesla.com

fqdn - crawling scoped to given sub(domain)

a. Running katana -u https://tesla.com -fs fqdn returns nothing because no URLs containing only “tesla.com” are found

b. Running katana -u https://www.tesla.com -fs fqdn only returns URLs that are on the “www.tesla.com” domain.

dn - crawling scoped to domain name keyword

a. Running katana -u https://tesla.com -fs dn returns anything that contains the domain name itself. In this example, that is “tesla”. Notice how the results returned a totally new domain suppliers.teslamotors.com

Crawl-scope

The crawl-scope (-cs) flag works as a regex filter, only returning matching URLs. Look at what happens when filtering for “shop” on tesla.com. Only results with the word “shop” are returned.

Crawl-out-scope

Similarly, the crawl-out-scope (-cos) flag works as a filter that will remove any urls that match the regex given after the flag. Filtering for “shop” removes all urls that contain the string “shop” from the output.

No-scope

Setting the no-scope flag will allow the crawler to start at the target and crawl the internet. Running katana -u https://tesla.com -ns will pick up other domains that are not on the beginning target site “tesla.com” as the crawler will crawl any links it finds.

Making Katana a crawler for you with configuration

Depth

Define the depth of your crawl. The higher the depth, the more recursive crawls you will get. Be aware this can lead to long crawl times against large web applications.

cli

1katana -u https://tesla.com -d 5

Crawling JavaScript

For web applications with handfuls of JavaScript files, turn on JavaScript parsing/crawling. This is turned off by default, but turning this on will allow the crawler to crawl and parse JavaScript files. These files can be hiding all kinds of useful endpoints.

cli

1katana -u https://tesla.com -jc

Crawl duration

Set a predefined crawl duration and the crawler will return all URLs it finds in the specified time.

cli

1katana -u https://tesla.com -ct 2

Known files

Find and crawl any robots.txt or sitemap.xml files that are present. This functionality is turned off by default.

cli

1katana -u https://tesla.com -kf robotstxt,sitemapxml

Automatic form fill

Enables automatic form-filling for known and unknown fields. Known field values can be customized in the form config file (default location: $HOME/.config/katana/form-config.yaml)

cli

1katana -u https://tesla.com -aff

Handling your output

Field

The field flag is used to filter the output for the desired information you are searching for. ProjectDiscovery has been kind enough to give a very detailed table of all the fields with examples:

Look what happens when filtering the output of the crawl to only return URLs with query parameters in it:

Store-field

The store-field flag does the same thing as the field flag we just went over, except that it filters the output that is being stored in the file of your choice. It is awesome that they are split up. Between the store-field flag and the field flag above, you can make the data you see and the data you store different if needed.

cli

1katana -u https://tesla.com -sf key,fqdn,qurl

Extension-match & extension-filter

You can use the extension-match flag to only return urls that end with your chosen extensions

cli

1katana -u https://tesla.com -silent -em js,jsp,json

If you would rather filter for file extensions you DON’T want in the output, then you can filter them out of the output using the extension-filter flag

cli

1katana -u https://tesla.com -silent -ef css,txt,md

JSON

Katana has a JSON flag that allows you to output a JSON format that includes the source, tag, and attribute name related to the discovered endpoint.

Rate limiting and delays

Delay

The delay flag allows you to set a delay (in seconds) between requests while crawling. This feature is turned off by default.

cli

1katana -u https://tesla.com -delay 20

Concurrency

The concurrency flag is used to set the number of URLs per target to fetch at a time. Notice that this flag is used along with the parallelism flag to create the total concurrency model.

cli

1katana -u https://tesla.com -c 20

Parallelism

The parallelism flag is used to set the number of targets to be processed at one time. If you only have one target, then there is no need to set this flag.

cli

1katana -u https://tesla.com -p 20

Rate-limit

This flag allows you to set the maximum number of requests that the crawler is sending out per second

cli

1katana -u https://tesla.com -rl 100

Rate-limit-minute

A rate-limiting flag similar to the one above, but used to set a maximum number of requests per minute.

cli

1katana -u https://tesla.com -rlm 500

Chaining Katana with other ProjectDiscovery tools

Since katana can take input from STDIN, it is straightforward to chain katana with the other tools that ProjectDiscovery has released. A good example of this is:

cli

1subfinder -d tesla.com -silent | httpx -silent | katana

Conclusion

Hopefully, this has excited you to go out and crawl the planet. With all the options available, you should have no problem fitting this tool into your workflows. ProjectDiscovery has made this wonderful web crawler to cover many sore spots created by crawlers of the past. Katana makes crawling look like running!

Author – Gunnar Andrews, @g0lden1

Introducing Katana: The CLI web crawler from PD

Table of Contents

Authors

ProjectDiscovery

Share

What is Katana?

Tool integrations

What is web crawling?

Installation

Binary

Docker

Options

Configuration

Headless

Scope

Filter

Rate-limit

Output

Different inputs

Crawling modes

Standard mode

Headless mode

Controlling your scope

Field-scope

Crawl-scope

Crawl-out-scope

No-scope

Making Katana a crawler for you with configuration

Depth

Crawling JavaScript

Crawl duration

Known files

Automatic form fill

Handling your output

Field

Store-field

Extension-match & extension-filter

JSON

Rate limiting and delays

Delay

Concurrency

Parallelism

Rate-limit

Rate-limit-minute

Chaining Katana with other ProjectDiscovery tools

Conclusion

Related stories

Resilient Cyber podcast: Modernizing vulnerability management with open source

Announcing Pioneers, ProjectDiscovery's Ambassador Program

Introducing the httpx dashboard