chuyskywalker
Admin
+2,439|6818|"Frisco"

This evening I was targeted by a data scraping program. A data scraping program is basically running a bot that relentlessly pulls data from my server at breakneck speeds. This pushed my server to far surpass it's normal 10-20 MySQL queries per second hitting a top of 150 Mysql queries per second.

While a part of me is very proud that the server and site handled 10 times the traffic without batting an eye, I am extremely upset by the behaviour of this person. From here on out I will be actively monitoring the server for people running data scraping programs and I will ban IP addresses if needed.

I put a lot of time and effort into running this site and pay for it out of pocket (with the help of ads) and I do not appreciate, and will not allow, people to ruthlessly steal my content.
TriggerHappy998
just nothing
+387|6818|-
Busted!
Reign_Of_Chaos
Member
+0|6797
did a trace on that IP and found this

OrgName:    Google Inc.
OrgID:      GOGL
Address:    1600 Amphitheatre Parkway
City:       Mountain View
StateProv:  CA
PostalCode: 94043
Country:    US

NetRange:   66.249.64.0 - 66.249.95.255
CIDR:       66.249.64.0/19
NetName:    GOOGLE
NetHandle:  NET-66-249-64-0-1
Parent:     NET-66-0-0-0-0
NetType:    Direct Allocation
NameServer: NS1.GOOGLE.COM
NameServer: NS2.GOOGLE.COM
Comment:   
RegDate:    2004-03-05
Updated:    2004-11-10

OrgTechHandle: ZG39-ARIN
OrgTechName:   Google Inc.
OrgTechPhone:  +1-650-318-0200
OrgTechEmail:  [email protected]

  ARIN WHOIS database, last updated 2005-08-18 19:10
  Enter ? for additional hints on searching ARIN's WHOIS database.

rofl
NM156
The H4xor Mod
+161|6817|North Texas
LOL.... It's googlebot.com! Disallow robots...

Code:

C:\ping -a 66.249.65.67

Pinging crawl-66-249-65-67.googlebot.com [66.249.65.67] with 32 bytes of data:

Reply from 66.249.65.67: bytes=32 time=45ms TTL=237
Reply from 66.249.65.67: bytes=32 time=43ms TTL=237
Reply from 66.249.65.67: bytes=32 time=49ms TTL=237
Reply from 66.249.65.67: bytes=32 time=52ms TTL=237

Ping statistics for 66.249.65.67:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 43ms, Maximum = 52ms, Average = 47ms
chuyskywalker
Admin
+2,439|6818|"Frisco"

Holy crap, google beat the shit outta my site. Turns out it wasn't the only one. After banning that bot for a minute, another IP showed up. Banned that one too and unbanned google -- seems to tamed the gbot down. Crazy.
Kushiel
Human Shield
+1|6800|Boynton Beach, FL

chuyskywalker wrote:

Holy crap, google beat the shit outta my site. Turns out it wasn't the only one. After banning that bot for a minute, another IP showed up. Banned that one too and unbanned google -- seems to tamed the gbot down. Crazy.
What can you say Chuy... you're popular.. lol Damn.. now if only in high school j/k
Krauser98
Extra Green Please!
+53|6801|USA! USA! USA!
I think Google is going to try to take over the world, but in a useful and helpful way.
priznat
Member
+0|6818
I seem to recall you can contact google and tell them to NOT do that to your site, they are pretty good about that kind of thing, apparently..

At least you are well indexed now thought
lowyaukee
Member
+2|6805
Frankly, I don't understand the stuffs above.  I just want to say that I love your works very much, this site, the stats, the forum.

Wish you all the best!
midgetspy
Member
+3|6799
To prevent robots from spidering certain parts (or all) of your site, include a robots.txt in the root:

http://www.robotstxt.org/wc/norobots.html

Nic
B.Schuss
I'm back, baby... ( sort of )
+664|6812|Cologne, Germany

I am not experienced in this field of tech stuff. so maybe someone could explain to me why google would want to do that to other sites ?

Are they interested in your user database ?

Last edited by B.Schuss (2005-08-19 00:07:47)

priznat
Member
+0|6818
Google bots "crawl" over the internet, indexing sites and taking a cached version, like a "snapshot"..

This is part of the way search engines like google can access search results so quick, it has local versions of a lot of sites filed away on its own server.. Just updates every once in a while with a new freshly scraped version..

As is my understanding, anyway..
B.Schuss
I'm back, baby... ( sort of )
+664|6812|Cologne, Germany

I see. thx for that.

btw, is sich behaviour legal ?
chuyskywalker
Admin
+2,439|6818|"Frisco"

No no, sorry. Google contributed, but there was another culprite at fault really (who I have indentified and contacted regarding this incident).

Also, google has to spider my pages in order to target the ads to the content, so it makes sense.
xbw_shane
Member
+-1|6818|Australia
yer i have Google's bots crawl through my own forum from time to time. i would love to see how much bandwidth they pull from my own server or how it affects my statistics. i dont have google ads but submitted my URL to it a while back.

of course i could stop them though if i really wanted as someone else mentioned.
CaffeineJunkie
Member
+0|6808|CT, USA
@Skywalker, just add a robots.txt that tells google not to index your sites past /

Or do they have to?  They might even google index all 100,000 people.. Wow.. that would hurt.
NiMhurchu
Suicide Operations
+13|6818|Germany
I think Jeff's way of coding the URL might add to the problem, too.
A searchbot like Google tries to catch every page in every sub-sub-subfolder -- and Jeff's URL is made up like that: http://www.bf2s.com/player/Name1/vs/Name2  for example.
No offense, though, maybe there is a way with the robots.txt. Maybe it can be coded dynamically, so that if a subfolder is loaded, a robots.txt is handed out to the bot dynamically.
chuyskywalker
Admin
+2,439|6818|"Frisco"

NiMhurchu wrote:

I think Jeff's way of coding the URL might add to the problem, too.
A searchbot like Google tries to catch every page in every sub-sub-subfolder -- and Jeff's URL is made up like that: http://www.bf2s.com/player/Name1/vs/Name2  for example.
Unless linked directly to that page, google wouldn't find it seeing as the only way to get to the comparision pages is via a POST form submission. However, player pages to get spidered. Not too big a deal, just has to be that way so it can make ads work.
CaffeineJunkie
Member
+0|6808|CT, USA
http://www.google.com/search?hl=en& … s.com+bf2s

Think about it, google takes the first page, finds all these top players.. moves on to their page.. each one of those pages links to more players (most killed, etc) and so on and so forth... Google indexed many a pages.
haligan
Member
+-1|6797

CaffeineJunkie wrote:

http://www.google.com/search?hl=en&hs=87n&lr=&safe=off&client=firefox-a&rls=org.mozilla:en-US:official&q=+site:bf2s.com+bf2s

Think about it, google takes the first page, finds all these top players.. moves on to their page.. each one of those pages links to more players (most killed, etc) and so on and so forth... Google indexed many a pages.
You are correct, depending on how or if caching is done on this site, that would add to the performance hit on the MySQL db.



Since this is my first post here I'd like to say thanks for the stats, but I wouldn't call what you have queried your content. At most it is EAs, but don't take that as being ungrateful for what you have setup.
Sarum
The Angry Geek
+11|6818
What you've all missed is that Chuy has said, while Googlebot contributed, it wasn't the prime cause. That was someone else, who he's contacted, but sensibly identified to us since we're like a pack of rabid dogs. So less talk about Google, it wasn't the cause.
blue60007
Member
+0|6815
Heh, well this happened to a server I admin (well sorta co-admin), we noticed a spike in server usage, and found the culprit to be google bots indexing our forums, went away after a day or two.
Preacher
In Hoc Signo Vinces
+1|6806|Netherlands, the
I put a lot of time and effort into running this site and pay for it out of pocket (with the help of ads) ....

Time for a Paypal button??? If it gets people a better update..I bet there will be some $ comming your way
chuyskywalker
Admin
+2,439|6818|"Frisco"

Paypal or not, I can't update any faster than IGN will withstand.
B.Schuss
I'm back, baby... ( sort of )
+664|6812|Cologne, Germany

well, then it's probably time for IGN to get a paypal button..

Board footer

Privacy Policy - © 2024 Jeff Minard