Bot Detection¶
Implementations used for bot detection.
- searx.botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: Config) IPv4Network | IPv6Network [source]¶
Returns the (client) network of whether the real_ip is part of.
- searx.botdetection.get_real_ip(request: Request) str [source]¶
Returns real IP of the request. Since not all proxies set all the HTTP headers and incoming headers can be faked it may happen that the IP cannot be determined correctly.
This function tries to get the remote IP in the order listed below, additional some tests are done and if inconsistencies or errors are detected, they are logged.
The remote IP of the request is taken from (first match):
- searx.botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) Response | None [source]¶
Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default
Too Many Requests
response.
IP lists¶
Method ip_lists
¶
The ip_lists
method implements IP block-
and
pass-lists
.
[botdetection.ip_lists]
pass_ip = [
'167.235.158.251', # IPv4 of check.searx.space
'192.168.0.0/16', # IPv4 private network
'fe80::/10' # IPv6 linklocal
]
block_ip = [
'93.184.216.34', # IPv4 of example.org
'257.1.1.1', # invalid IP --> will be ignored, logged in ERROR class
]
- searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str] [source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.block_ip
list.
- searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str] [source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.pass_ip
list.
- searx.botdetection.ip_lists.SEARXNG_ORG = ['167.235.158.251', '2a01:04f8:1c1c:8fc2::/64']¶
Passlist of IPs from the SearXNG organization, e.g. check.searx.space.
Rate limit¶
Method ip_limit
¶
The ip_limit
method counts request from an IP in sliding windows. If
there are to many requests in a sliding window, the request is evaluated as a
bot request. This method requires a redis DB and needs a HTTP X-Forwarded-For
header. To take privacy only the hash value of an IP is stored in the redis DB
and at least for a maximum of 10 minutes.
The link_token
method can be used to investigate whether a request is
suspicious. To activate the link_token
method in the
ip_limit
method add the following configuration:
[botdetection.ip_limit]
link_token = true
If the link_token
method is activated and a request is suspicious
the request rates are reduced:
To intercept bots that get their IPs from a range of IPs, there is a
SUSPICIOUS_IP_WINDOW
. In this window the suspicious IPs are stored
for a longer time. IPs stored in this sliding window have a maximum of
SUSPICIOUS_IP_MAX
accesses before they are blocked. As soon as the IP
makes a request that is not suspicious, the sliding window for this IP is
dropped.
- searx.botdetection.ip_limit.API_MAX = 4¶
Maximum requests from one IP in the
API_WINDOW
- searx.botdetection.ip_limit.API_WINDOW = 3600¶
Time (sec) before sliding window for API requests (format != html) expires.
- searx.botdetection.ip_limit.BURST_MAX = 15¶
Maximum requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2¶
Maximum of suspicious requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.BURST_WINDOW = 20¶
Time (sec) before sliding window for burst requests expires.
- searx.botdetection.ip_limit.LONG_MAX = 150¶
Maximum requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10¶
Maximum suspicious requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.LONG_WINDOW = 600¶
Time (sec) before the longer sliding window expires.
- searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3¶
Maximum requests from one suspicious IP in the
SUSPICIOUS_IP_WINDOW
.
- searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000¶
Time (sec) before sliding window for one suspicious IP expires.
Method link_token
¶
The link_token
method evaluates a request as suspicious
if the URL /client<token>.css
is not requested by the
client. By adding a random component (the token) in the URL, a bot can not send
a ping by request a static URL.
Note
This method requires a redis DB and needs a HTTP X-Forwarded-For header.
To get in use of this method a flask URL route needs to be added:
@app.route('/client<token>.css', methods=['GET', 'POST'])
def client_token(token=None):
link_token.ping(request, token)
return Response('', mimetype='text/css')
And in the HTML template from flask a stylesheet link is needed (the value of
link_token
comes from get_token
):
<link rel="stylesheet"
href="{{ url_for('client_token', token=link_token) }}"
type="text/css" >
- searx.botdetection.link_token.get_ping_key(network: IPv4Network | IPv6Network, request: Request) str [source]¶
Generates a hashed key that fits (more or less) to a WEB-browser session in a network.
- searx.botdetection.link_token.get_token() str [source]¶
Returns current token. If there is no currently active token a new token is generated randomly and stored in the redis DB.
- searx.botdetection.link_token.is_suspicious(network: IPv4Network | IPv6Network, request: Request, renew: bool = False)[source]¶
Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument
renew
isTrue
the expire time of this ping is reset toPING_LIVE_TIME
.
- searx.botdetection.link_token.ping(request: Request, token: str)[source]¶
This function is called by a request to URL
/client<token>.css
. Iftoken
is valid aPING_KEY
for the client is stored in the DB. The expire time of this ping-key isPING_LIVE_TIME
.
- searx.botdetection.link_token.PING_KEY = 'SearXNG_limiter.ping'¶
Prefix of all ping-keys generated by
get_ping_key
- searx.botdetection.link_token.PING_LIVE_TIME = 3600¶
Livetime (sec) of the ping-key from a client (request)
- searx.botdetection.link_token.TOKEN_KEY = 'SearXNG_limiter.token'¶
Key for which the current token is stored in the DB
- searx.botdetection.link_token.TOKEN_LIVE_TIME = 600¶
Livetime (sec) of limiter’s CSS token.
Probe HTTP headers¶
Method http_accept
¶
The http_accept
method evaluates a request as the request of a bot if the
Accept header ..
did not contain
text/html
Method http_accept_encoding
¶
The http_accept_encoding
method evaluates a request as the request of a
bot if the Accept-Encoding header ..
did not contain
gzip
ANDdeflate
(if both values are missed)did not contain
text/html
Method http_accept_language
¶
The http_accept_language
method evaluates a request as the request of a bot
if the Accept-Language header is unset.
Method http_connection
¶
The http_connection
method evaluates a request as the request of a bot if
the Connection header is set to close
.
Method http_user_agent
¶
The http_user_agent
method evaluates a request as the request of a bot if
the User-Agent header is unset or matches the regular expression
USER_AGENT
.
- searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'¶
Regular expression that matches to User-Agent from known bots
Config¶
Configuration class Config
with deep-update, schema validation
and deprecated names.
The Config
class implements a configuration that is based on
structured dictionaries. The configuration schema is defined in a dictionary
structure and the configuration data is given in a dictionary structure.
- exception searx.botdetection.config.SchemaIssue(level: Literal['warn', 'invalid'], msg: str)[source]¶
Exception to store and/or raise a message from a schema issue.
- class searx.botdetection.config.Config(cfg_schema: Dict, deprecated: Dict[str, str])[source]¶
Base class used for configuration
- get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) Any [source]¶
Returns the value to which
name
points in the configuration.If there is no such
name
in the config and thedefault
isUNSET
, aKeyError
is raised.
- path(name: str, default=<UNSET>)[source]¶
Get a
pathlib.Path
object from a config string.
- pyobj(name, default=<UNSET>)[source]¶
Get python object refered by full qualiffied name (FQN) in the config string.