Hacker News .hnnew | past | comments | ask | show | jobs | submit | westside1506's commentslogin

Our service, 80legs, will let you easily do this. We let you specify seed links, how deep you want to crawl, and control many other aspects of the crawl. By default, we control the hard bits, like redirects and spider traps, but if you want to override our default functionality you can easily insert your own code to do it.

Our default functionality will let you identify mp3 files by regex or keyword, but if you need something more sophisticated you can override that too. I'm pretty sure, based on what you've said, that you could simply put in a few parameters and start running some jobs within a few minutes of getting started with 80legs that will do exactly what you want. If not, adding custom code to 80legs is pretty simple too.

Just send us your contact info on our website (http://www.80legs.com) and mention HN and I'll make sure you get a beta invite. BTW - we're still in private beta and the service is still free for right now.


We actually have a surprising number of customers come to us at 80legs wanting to crawl google search results. I don't think most of them are trying to reverse engineer google or anything like that. Most probably just want a fast way to find relevant topics to crawl.

They are disappointed when they learn we obey robots.txt, so we have them manually do searches to pull out seed lists for their 80legs crawls. It's a pain, but there's not really a way around it within the rules.


Plura affiliates actually accept responsibility for getting the permission of their users. Plura encourages disclosure and has found that it is actually very well received by the users once it is explained. It always works our better for Plura affiliates when they disclose. To that end, Plura has actually changed it's TOS with affiliates so that they directly take responsibility for getting user acceptance.

Most Plura apps/websites give users optin/optout capabilities. Rather than anything ill-intentioned, the actual model is really that Plura gives application developers a means of offering their application at a discount (or free) to users that don't mind trading their excess computer resources for the app. For those that don't want Plura+free, the application developer can give them other options (pay, ads, whatever).

Once the users really understand it, they are almost always happy that the developer has a new means of monetization so that the developer will continue to improve the software they are using.

BTW, this all runs in a secure java sandbox where nothing can actually see the users data, disk, what programs are running, or anything else about the computer. Plura has gone to great lengths to try to sanitize the entire process and be good guys.


Interesting way to earn money. However, why a regular user can't run a plura client too so that they can earn cash themsleves?


There's certainly no problem with individual users doing it. Just contact Plura through the web form at http://pluraprocessing.com/contact.php.

Alternatively, you can use one of our affiliates to raise money for charity (not related to our company) http://donatebot.com


Hi guys. We were actually about to do an "Ask HN: Review our startup" post, but I guess someone beat us to it.

So, please review our startup. :)

We are launching the beta today to a handful of users and will be letting in more and more users over time.

One other note: We don't just offer crawling. Our model is actually to allow you to analyze the web content that you discover. Using your own custom code that you push into 80legs, you can do sophisticated text processing, image processing, look inside PDFs, etc.


Sorry for hijacking your plan to post here first, but i found the idea incredibly cool and useful, and didn't know you guys are around here on HN. 80 legs can potentially save a lot of effort for people/companies who need to crawl web for data and analyze it.

Hope this really works out well for all of you.

p.s. Just in case you are curious, i got a reference note about your application from someone who was following Web 2.0 Expo.


Thanks for the nice comments luckystrike. We had a great time at the Web 2.0 Expo and we've been overwhelmed by all the interest in our service. We're pretty hopeful that we've built something that people want. :)


It looks pretty cool and even something I might be interested in using.

However my initial feedback relates to clarity.. on the About page you say 'crawl 2bn pages' then 'pay $2 per million pages crawled' and don't actually say that - as I understand it - customers set up custom searches matching a regexp and only pay for the hits (crawls) that match.

My immediate reaction on seeing '2bn'/'$2 per million' was to firstly think 'wtf, $4k per day then?' and secondly 'I hope that's an American not a British billion'. (Though we seem to have adopted yours, now.)

It's really just a wording/clarity thing though, and I might be alone in this.


Thanks, we will work on the wording. Just for clarity, we actually do custom crawls based on your needs. If you need to access one million pages, you tell us how to get to them and pay us $2 plus any time you spend processing those pages. You can do a generic crawl from http://dir.yahoo.com or you can give us a very customized seed list and just read those pages or crawl only a few levels deep from there. Your choice.

You certainly don't need to crawl two billion (2,000,000,000) pages per day. In fact, that's our total estimated capacity right now.


What about traffic?

2000000 * 40KB(Average compressed) / (10241024) 0.1 = 7.6 $. That's alone the cost to transfer it to your datacenter. I can'r reliably access the data on remote clients.

I guess the 2$ pricing tag is just marketing blah blha


Our service actually allows you to push your code into the system rather than trying to pull back all of the page contents. So, you end up running your semantic analysis, image analysis, or whatever you want to do on our grid. Very specifically, you implement a processPage() function of the following form:

byte[] processPage ( String url, byte[] pageContents, Object userData); (EDIT: remove code tag that didn't work...)

We run your function on the contents of the pages/images/objects you want to analyze and give you back your results from the millions or billions of pages you want to analyze.

The results from the processPage() function are completely free form. You serialize your results into a byte array and that's what you get back (except you get it back for all of your urls).

Now, since the processPage() function is free form, you can just turn around and "return pageContents;" from your function. That will give you all of the page contents from your crawl. That's not an ideal case for us, but we can handle it. We might eventually charge a small bandwidth or storage cost for this type of usage, but we do not intend to do so for our normal use case.

The bigger charge to the customer if they try to pull back all of the contents will be their local bandwidth charge. They would need to pull all of these pages' contents to their own servers. That will cost them quite a lot of bandwidth assuming they don't have their own fat pipe.

In summary, $2/million-pages-crawled is our real price and is not just marketing.


That's pretty cool. Thinking aloud then, if I wanted to say pull out all the adjectives from results matching $foo, I'd end up getting that data back and then have to pipe that into storage myself - costing me both bandwidth in and bandwidth out. Thought about cutting out the middleman and letting people write to S3 direct? (Yes, I have no idea how complicated this might be.)


Hey - I work for 80legs as well so thought I'd chime in and answer this question (westside is grabbing some food). We have thought about offering easy integration with AWS, but we'd probably implement this at a later time if we decided to go that route.


This looks very cool.

How do you (and/or Plura) deal with the problem of running code on other people's machines? How do you know that the data being sent back is valid, or that a competitor can't start a node and reverse-engineer your code? This may be less of an issue than I imagine, but I'm sure it's something you've thought about so I'd be interested in hearing your thoughts.


Great question. We've actually done a lot of work on this to ensure that there isn't a problem with running the code on various people's machines.

First, Plura actually runs the processPage() function in the restricted java sandbox so there is no way to actually see any data on the user's computer or do anything bad to their computer. Also, the code goes through a short verification process before it is deployed.

For the results, we do have a reasonably sophisticated validation process as well. For someone to change results from one node, they would have to do quite a bit of work.


Does this technology work using both images simultaneously? Or could lots of images be pre-processed individually into something that allowed a cheaper comparison?

Just to try to explain the question better, here's an example. Let's say I have 10 images and I want to find the most similar people among any pair of images. Do I need to run every pair of images (45 full comparisons) or can I pre-process the 10 images into something such that the 45 comparisons can be done in a less expensive way?


Recognition is generally broken down into two steps, processing faces into "templates" and then comparing those templates. Generating templates includes all the preprocessing stuff as well: detecting faces in an image, estimating their pose, and finding landmark points. Our site goes into these issues in some depth (with some examples). So yes, we do break the process down: generating the templates can be done individually for every image, which allows you store that result and use it for future comparisons.

Generating 2 templates is many more times expensive than comparing those templates. However, as your dataset grows, generating templates grows at N, and the number of comparisons you need to do grows at N^2. So eventually, comparisons dominate.


My angel group uses Angelsoft to organize the deals we receive from startups. I believe we pay Angelsoft for the privilege and get to use their tools to discuss and evaluate the deals that come our way. It is useful.

There is also tons of crap listed from other cities and groups. I guess these guys pay Angelsoft to pitch, but that part of the site is a complete waste. There is so much noise in those deals that I never look at them.


Very cool algorithm demo. How well would this compare to a very extreme JPEG compression?

Also, I'm not an image compression expert, but I wonder if you could use this technique as a pre-pass before a more standard image compression algorithm to improve the overall results. In other words, maybe subtract the image created using the polygons from the original image and then compress the result using JPEG (or something else more suitable?).


Very extreme jpeg pre 2000 compression would probably do considerably worse as it would degenerate into drawing in boxes of averaged color. jpeg200 would absolutely destroy this compression technique as it specifically uses a form of fourier analysis (wavelet analysis) which is provably optimal for this kind of compression. In fact, AFAIK jpeg2000 is designed so that as more bit are read from a file, it is possible to get a better and more accurate picture -- one of the benefits of using a fourier-analysis like technique. Sort of like a compression on steroids. Getting a lower res. picture is only slightly more complicated than truncating the file and requires minimal processing.


JPEG2K is a terrible format that, in practice, often comes out worse than JPEG. This is partly because wavelets are terrible from a psychovisual standpoint, but also because the format is just badly designed; Snow, a similar wavelet format, trashes it completely even when using the exact same wavelet transform.

Also, wavelets are not at all "provably optimal." The only "provably optimal" transform is the KLT, and even that isn't really, since in practice overcomplete transforms tend to have better compression efficiency than complete ones, at the cost of turning compression from a simple O(nlog(n)) frequency transform into an NP-complete problem of matching pursuits.


"Provably optimal" always involves unwarranted assumptions. For example, least-squares fitting is provably optimal, assuming all your distributions are independent and gaussian. Image compression quality is all about sensitivity of the human visual system to various kinds of errors, so nothing can be proven optimal, ever.


Proving still retains value. Because you reduce the burden of choosing a good algorihtm to choosing good assumptions.


One common set of assumptions in image denoising is that:

1. The pixels in the output image should be close in value to the corresponding pixels in the input image.

2. Neighboring pixels in the output image should have similar values.

I wonder if anyone has compared compression techniques using these sorts of assumptions.


Interesting concept, but this is not designed to find where your content has been copied onto other sites/blogs. That would be an cool service.


That service is called Google.


And if you're looking for a slightly more scalable solution than "Find a sentence from Page A which is juicy, Google it, record results, repeat for all 10,000 pages on my website", you may be interested in CopyScape.

For example, if you feed them www.tynt.com, they'd tell you that feedmyapp.com has borrowed a few sentences from them. (It makes sense, as the borrowed phrases are part of a listing for tynt.com)


That service is called Attributor.com


Wow, I just saw this. I was an early lender on Prosper to the tune of $15k. I "invested" it over a couple of months in about 100 different loans. I was considering it a test before investing considerably more.

Well, the test didn't go well and I've been steadily withdrawing my money over the past couple of years. At the end of the day, I expect (pending the outcome of this new development) that I'll lose about $3k total.


No, it does not have to be a game. We're just focusing our initial affiliate marketing on games because of the higher engagement.

The type of sites you mention will work well too. Send us a note through the Contact Us page on the website and we can help you figure out how much you could earn.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: