Company BlogCompany Blog

Downloading the internet, or how we got our first 1M articles

Sometime over this weekend the magic number of articles in our database reached a cool 1,000,000 articles – it’s actually 1,035,511 right now.

All In An Afternoon's Cycle
Image by Amsterdamize via Flickr

Because it’s been a while since I last wrote a blog post for Preona I thought this was the perfect occasion to talk a bit of tech and about what we’ve been up to for the past month or two.

LazyReadr was released to the AppStore in December. We got the happy news bang smack in the middle of our Christmas party. This wasn’t a full out launch, just a soft release into the wild to see what would happen.

We were pretty drunk and quite ecstatic.

This feeling lasted for about a day.

Then everything promptly went to shit. We ran out of allocated budget on the AppEngine, our scraping architecture melted and just about everything that can possibly go wrong server-side did go wrong. And I can’t blame the code that I wrote for crashing either. After all, before releasing our Superfeedr account was subscribed to a measly 600-ish RSS feeds and we were processing a comfortable 200 notifications per hour.

After releasing we are now subscribed to 4500 RSS feeds and are processing almost 3000 notifications per hour.

And let me tell you, downloading the internet seemed like a much better idea before we actually started doing it :)

Why the hell are you downloading the internet?

Yes, the internet is ripe with jokes about how it is undownloadable, cannot be stored and is generally pretty damn big. All of that is true, the internet really is pretty fucking big.

There are two philosophical and one technical reason why we wanted to download all of the internets.

U.S. Army Photo of EDVAC as installed in BRL b...

Image via Wikipedia

First of all, we want our users to be able to get their content as fast as their network speed will allow. This means we should (must?) have all of the content a user might want nicely stored for them and then send them the whole batch of fresh updates they’ve yet to read.

We also don’t like it if users are forced to look at weird advertising, incomplete content or pray forbid a login form of a paywall. That’s why we use Readability to scrape all the content and send it in a much nicer to read form.

Which leads us to the technological problem – our scraper just isn’t all that amazingly fast. It takes around two seconds to extract the meat of the article out of a decently sized and reasonably complex website. Obviously we can’t do that on the fly when users are downloading several hundred articles at a time and can barely be kept waiting 30 seconds.

Also there’s the added awesomeness of all these articles later being available to provide very awesome “related content” features, overviews of how stories develop through time and just a bunch of neat data mining opportunities. It’s a goldmine!

Hmm … I still think you’re idiots

Luckily we aren’t downloading the internet just willy-nilly. There is method behind the madness, we only download sources our users are actually subscribed to and our bet is that we can sell every downloaded article more than once. Even with only about 200 users this has already started showing to be true.

The general average is that for every RSS we handle, there are 1.3 users subscribed to it, the most popular RSS actually has 16 subscribers! This may not sound like a lot right now, but we bet it will get really significant when we reach a scale of a few thousand users.

You must also take into account that the feeds, which are really problematic and spam content every few minutes, are the more popular feeds with a lot of subscribers, while the fringe feeds with very few subscribers (generally personal blogs etc.) have a lot fewer updates and so aren’t that big a problem for us.

The technical stuff

Because I wanted this post to be useful, here’s some technical stuff about how we are actually downloading the internet in the hopes someone finds it useful/interesting or it piques their geeky side.

First some stats:

6,883,389 entities in our database
which is 25 GBytes of data
1,035,511 articles
1,279,391 images
67 GBytes of binary data

And last I checked we were adding about 10 gigs of data per week; might be a bit less right now because we improved some redundancies.

So far we really aren’t using anything too special to handle this kind of load, which you could argue isn’t that big a load anyway. For the most part we’re simply relying on the awesome scalability of Google AppEngine where we are constantly burning about 40 instances, whatever exactly that means to Google and it’s only costing us a few dollars per day. So far still cheaper and much, much simpler than doing it ourselves.

It also turns out that handling an average of 20 requests per second (due to task queues and every article triggering many requests etc.) with Django is very very easy when you’re using the AppEngine.

Downside is it can get a bit pricey quite quickly.

The more technically interesting stuff like scraping websites for content and linking similar articles together is not done on Google because it turns out burning that much CPU on a cloud platform is like shooting yourself in the face. So far all we need to handle an average of 1aps (article per second) where processing each article takes about two to three seconds is a very beefy VPS and a run of the mill PC. From our calculations we still have a buffer of about 600 articles per hour whereas with just a VPS we were lagging behind by something like 500 to 1000 articles per hour.

In essence when load grows we just have to add cheap commodity PC’s to the architecture and everything will be alright. This makes me pretty happy to be honest.

Final summation

Downloading the internet is fun, horrible and terrifying.

Enhanced by Zemanta

HackerNews full feed for your reading pleasure

tl;dr -> http://feeds.feedburner.com/HackerNewsFullFeed

Last night I was hanging around the #startups IRC channel and I just glimpsed the final half of a debate about how much the HackerNews RSS feed sucks.

As anyone who reads HackerNews via RSS knows, the feed only contains a title, a link to the content and a link to the discussion on HN. I’m not sure exactly why it is like so, but it would seem a lot of people would prefer a full feed containing at least some if not all of the content.

This got me thinking. Here I am, all the tools necessary to make a full feed at my disposal and it couldn’t hurt to tell some more people about this.

So I went and added a few functions to our LazyReadr architecture. Two test functions and one implementation function to be precise.

Then I went on a hunt through our database to find the id of the HackerNews feed, got the link, plopped everything on feedburner because … well because implementing the analytics code to track the subscribers seemed like a huge pain.

And voila, the Hacker News Full Feed at your disposal. Have fun.

We’re very open to feature requests and/or suggestions. One thing that pops to mind is including the comments, but I’m not sure how to implement that in a useful manner right now.

If you’re wondering what we’re using the scrape the articles –> Our Readability API.

Want any other feeds in their full-content glory? Ping us, we might be able to help :)

Enhanced by Zemanta

Ever wanted arc90′s Readability as an API?

Image representing Arc90 as depicted in CrunchBase
Image via CrunchBase

Over at Preona we have been wanting something just like that for a while now.

So we built it!

Some time ago, while developing LazyReadr, we were faced with the fact that RSS feeds simply aren’t all that lovely anymore. Many of them don’t contain the full content, a lot of them are littered with strange widgets designed to show up in your GoogleReader and what’s worse, some inject ads and other annoying crap in there too.

But we didn’t want any of that and because we still want all the lovely images and text formatting and links and stuff the author went through a lot of trouble to put in their text, we couldn’t use any of the standard solutions for scraping like java’s boilerpipe.

Arc90′s Readability seemed like the best option. It’s already established itself as the best web scraping algorithm out there … I mean, Safari uses it and that’s argument enough for me. If you need more of an argument, Flipboard uses it too.

Also a few test prods of Readability showed it was decent enough and whatever it doesn’t catch we can still put into a layer above it that will catch some weird cases and such. Still to be developed :)

Anyway, what I’m trying to say is that we have Readability running as an API and we are unleashing it to the unsuspecting public!

Yay!

I don’t care about your rambling, just give me the API

tl;dr we are running Readability on a server.

Because it turns out running complex javascript on a server can sometimes be a bit slow our API is designed to be asynchronous. This isn’t because we hate  you, it’s to make sure every task gets executed at least eventually if not right away. But we seem to be mostly on top of things so turnaround times are within roughly 10 seconds.

The API itself only has a single call, this is what you do:

In the body of a POST request to http://plateboiler.lazyreadr.com/ you send this sort of JSON:

{'callback': 'url-where-you-can-be-reached',
 'queue': 'scraping',
 'parameters': {'url': 'http://example.com/some/article'}
}

And when the processing is complete this sort of JSON will fly back in the body of a POST request to the provided url:

{'callback': 'url-where-you-can-be-reached',
 'queue': 'scraping',
 'result': 'the-html',
 'parameters': {'url': 'http://example.com/some/article'}
}

Update: because some languages have issues with escaped quotes in strings (they remove the escaping making the JSON invalid) the result is now encoded as Base64 to fix that issue.

As you may have guessed, it’s an exact clone of the task you posted, but with an added result property. All the parameters you passed are preserved and so on. This is to make it easier for you to identify the result that’s come back.

That’s honestly it. All you need to know to scrape any website.

Some examples:

Turn a Guardian article into a lovely scrape of a Guardian article.

Or a post about iPads into a shiny scraped post.

Update at midnight: the view-scraped-page part of the architecture was meant as a helper function. Being frontpaged on HackerNews obviously crashed it. The two examples have been moved to static files served by nginx. Sorry for any inconvenience.

For your convenience there are also some utility urls: worker logs, task dispatcher logs, list of tasks in queue and last but not least, you can view any scrape by going to http://plateboiler.lazyreadr.com/boiled/?url=your-url

Some technical details

Making this API public is experimental. We honestly don’t know what will happen when we publish this post. Some people we talked to suggestively wagged their eyebrows that they might want to possibly use an API like this.

So we’re releasing it as an experiment. Hopefully a lot of people will write to us (@swizec or @preona) saying how awesome we are and that they love us. Also hopefully everything won’t crash and burn and die if too many people use it. We also hope somebody does use it. That would make us very happy indeed.

Right now the scraper is eating through RSS feeds of LazyReadr’s test users. That means we are processing around 1400 articles every hour. We are reasonably certain we can process more.

The whole stack this is running on is pretty much OpenSource plus a lot of elbow grease.

We’re using node.js, and a custom fork of tmpvar’s jsdom (thanks man, that thing is great) – we also pushed a lot of our fixes upstream so maybe they help you too, the worker queue is an in-house opensource project by me -> rapid.queue, perhaps someone can use that. Plus we used a bunch of other node.js libraries found on github, some of them we patched up, some we just sort of wrapped with our glue layer.

We are not releasing the glue layer as opensource just yet :)

But maybe you can come up with your own and then we’ll be using your API instead of ours, if you do a better job.

Anyway, if you have any questions, suggestions or just want to say hi. Poke us in the comments or through twitter via @swizec or @preona.

Don’t crash our server!

Enhanced by Zemanta

Our USA adventure was just like the movies

Google T-Rex USA

It is now two weeks since we came back to Slovenia from the U.S. It’s cold, it’s raining almost every day and we are back in the basement coding and developing further while thinking of the sun and warmth we remember from California. Also the scent of snow in the air makes it easier for some of us.

Saying that, the U.S. trip was awesome, super great and a bit more even. Maybe I picked up a habit that everything is super now, but it really is. The sun is always nice, the people you talk to are amazing, it’s so much easier to bump into someone doing a cool summarizer service, or maybe another co-working tool, or even talk to a VC firm that give away Billions of US Dollars. Oh and this is all from experience, all you need is two weeks up there and you will get enough inspiration, energy and contacts to last you I’d say a couple of months.

Oppinio Pitching USAOk, so more on what we were actually spending time up there. Pitching, presenting, drinking and pitching some more, did I say we pitched? So, the first pitch was at the Oppino event, where we had our test run. Some regrouping later and a bit of advice from a great person from Stanford we adjusted our pitch and demo, to be more of the states culture, we went to the Rainmakers event.

Imagine a really nice room, around 60 or so people, 40 of them VCs and around 20 startups all of them in business suits. And then there were us, we looked kind of like Zuckerberg in the last The Social Network movie, except we didn’t have hoodies on, but you could still see we didn’t exactly fit in with the crowd. The outcome of the event? We got a meeting with the Founder of GigaOm Network, so lets just say that owning a suit is not everything, having a good personality can be better.

Spending also some time around San Francisco, walking through the Mission district, being soaked at Alcatraz, because of the perfect weather for being a tourist is of course at the time of pouring rain.

LazyReadr Bunny ScreenshotThere were many other big things we also learned, like that just being there in the US means actually having a presence. Events happen everyday and if you need funding, a VC is happy to meet you. A second meeting is also possible, we had it, so we are talking some more about some investments. But that is all in due time, first now the product should be out in the App Store. At the current calculations, LazyReadr will be out by the end of this month (November).

Want some more experiences or be hyped by US stories and startup culture, poke us for some coffee and we will tell you more, I didn’t want this blog post to turn into a 5 page explanation, but rather some inspiration that more Slovenian startups should go and see Silicon Valley.

Enhanced by Zemanta

Startup School 2010 notes and speakers main points

Todays Startup School 2010 organized by the YCombinator was great. So many fascinating talks by good entrepreneurs and passion for startups of any kind. As an entrepreneur and doing Preona as a startup it was the place to be.

We pitched and showed our LazyReadr, the personalized newspaper iPad app to many people and gotten some good feedback. It seems like the thing to pitch and show it to the best crowd of tech and like minded people as we are waiting to be reviewed by Apple.

As we are also applying to the winter YC it was great to see how the Silicon Valley startup scene looks like and what people are doing. Yesterdays YC dinner was also nice, pitching and mingling with everyone. Also talking to Robert Scoble and other, an awesome experience and we hope to come to SF again in a few months.

Now on to some notes and topics that each speaker pointed out. One topic throughout all the talks was that the team is more important and the best asset for the startup. The vision and just doing it is first and having a team that can make it is also the thing. Execution and product is the next step to a big company and also further development, talking to users, growing and hopefully going public or being bought in the end.

Do leave comments with your thoughts.

Now here are the talks:

Andy Bechtolsheim
Founder Arista Networks; Founder, Sun Microsystems

Innovation is the never ending search for better solutions. Best thing is to learn from the big. As a startup we have the advantage of spending loads of time in the ideas and starting innovation, being agile, while the execution is just following your vision and coding/building the thing you are working on. You don’t need a lot of money to start a web business, but if it takes off… Do what you want to do, external input only goes so far (pretty useless).

Paul Graham
Partner, Y Combinator; Founder, Viaweb

Super angels and VCs are fighting, higher evaluation for startups, which is good for founders. There will be lots of new investments into startups and is looking like a big new bubble for startups. Rounds that close fast at high valuations for founders are good news.

Andrew Mason
Founder, Groupon

Take one feature/use case and make a new business out of it when you fail. Build it and they will come concept for getting users is a fail. You have to target people, buy ads, sales. Push emails to users instead of just sending Groupons in the case of Groupon.

Tom Preston-Werner
Founder, GitHub

You have to read drive by Dan Pink. You always have a choice, do the stuff at your own pace, if you can – Be happy rather than get the money. Get users engaged through personal meetups and meetings. An engaged user is a lot better then the one you just buy or get from sales.

If you have T-shirts or other promotional material like cups – Don’t give away free stuff, sell it. It doesn’t need to be for money, it can be for engaging or happy users, a win-win for both.

Greg McAdoo
Partner, Sequoia Capital

Building great startups requires a long period of time. Startups need a few years to grow, develop and monetize to get to a point. You need to solve your own problems. The ones you are having everyday and with solving your own problem you will also help other people that have the same. To get introduced to Sequoia Capital a referral from another founder/someone they funded is a whole better way to get in. It gives you credibility and people know you are cool if you get recommended.

Reid Hoffman
Partner, Greylock; Founder, LinkedIn

Before you can go to the market, you need to get into distribution and engagement of the users. You can have a great idea, that is going to be great once the users are there. But how to get to the beach, to make the users, it is going to be hard. Make a plan. Communities, engagement. Talk to anyone that wants to talk to you. More intelligence and data gives you more pivots and thinking about how to go further on. How fast are you going? Observe similar companies to determine the pace. How do you get 10000 users, 10000000 users… Through distribution.

Ron Conway
Partner, SV Angel

Really, anybody can do it. Everybody started small and mostly as a 2 or 3 person team. If you are 1, you will need a team sooner or later, why not start making one from day 1 already.

Adam D’Angelo
Founder, Quora

Having a product is good, but experience with other startups brings also a good reference to angels. Other option is having lots of users to get on their radar. The design of an application can be important if you want it to be, and can play the advantage or differentiator in the product.

Dalton Caldwell
Founder, Picplz; Founder, Imeem

Don’t make a music startup. Period. Too much fuss and issues. Similar to software patents.

Mark Zuckerberg
Founder, Facebook

He started by moving to the Valley for just a short time. Didn’t really think it would be permanent. You should think about making a good architecture already in the beginning, otherwise it can happen that you need to fix it for years at a later time when you are already big.

Brian Chesky
Founder, Airbnb

Had loads of launches, are now doing it for 1000 days. And the business didn’t really take off. Applied to YC after a long time, sold breakfast cereals as a side thing for a while. Pivoted about 5 times after feedback from users and ideas. They are growing for the last few months and have really taken off.

To quote Douglas Adams on deadlines

Mockup
Image via Wikipedia

I love deadlines. I like the whooshing sound they make as they fly by.

It was early August when we made the decision to launch LazyReadr in September. It is now early October and some of you may have noticed that we neither launched nor kept our blog updated; luckily there’s twitter so at least we were showing signs of life.

So what happened?

Well … stuff … stuff happened.

When September rolled around we were holding in our hands a product. It was a finished product as per our anticipated specs. It was also the third generation of this product that we got to. The first being a super quick two week mockup, the second being a 1.5 month mockup with a fairly decent backend. And the third being what we had after roughly two months of work at the beginning of September.

We decided not to launch. And I am still convinced it was a good idea.

There is a lot of talk in the lean startup movement about a so called Minimum Viable Product. The product you first launch will suck, but that’s OK because you’ll make it better.

Well the thing is, there is a Minimum Product and then there is a Minimum Viable Product. We had the former, but not the latter. Sure, it was somewhat shiny. Sure it kind of sort of worked if the stars were aligned properly and you knew you had to stand on one hand while praying to the god called Zerb.

In other words, it was completely useless. The only feature that could set it apart from the competition was how utterly the whole thing sucked. A small rule we like to go by is to “Make products you are proud of!” there was no pride to be gained from launching that on time.

So we went to work.

And since September also happens to be a month of exams … well it delayed the work a bit, but at least now everyone from the team is happily enrolled into the next year of their respective college courses. That counts for something too.

LazyReadr has now reached a stage that we can be totally proud of. We can take the iPad to a pub and show off what we’ve made without bowing our heads. Hell, the whole thing is plain sexy!

Just this last week we ironed out all of the major bugs, we made a website scraper that’s got heads turning and people offering us money for an API (more on that some other time), we implemented a very shiny design and made a few little compromises.

We now have a MVP that we can be proud of and that’s all that really matters in a startup.

Enhanced by Zemanta

We reached the 1000 commits milestone!

Deciding to launch LazyReadr this September has made for one hectic fun-filled and not very relaxing summer.

As a result Hamax made the 1000th commit to our svn repository yesterday. And I owe him a batch of home made muffins for it even though at least 50% of all the commits were mine, but hey, that’s how bounties work right?

The repo was created a year ago when the then-team switched from working on Twitulater to working on something completely new and applying to Seedcamp. The product and the idea sucked big time and we didn’t get much further than the Seedcamp shortlistings. But it was the beginning of something much greater. It was the beginning of an era.

Our mission: understand the internet!

The first attempt looked a bit like this:

But that didn’t work out very well and we scrapped the idea right after we got back from London. We then tried to make a tool for sharing stuff online. Surely that way we’ll be able to understand what goes on in this crazy world right?

Here’s what it looked like at first when it was a Firefox add-on:

Yep, not very appetizing at all. But it got a lot better with our second attempt:

This one got us into the mini Seedcamp finals at Prague. That was lots of fun.

But then summer rolled in and we realized that, hey, it’s time to go after the big one. Let’s make that LazyReadr we’ve been talking about for ages. And let’s make it an iPad app too! Yeah!

Wooo.

So we’re making it. We totally are. And it’s going right to the App Store this September. That’s in like two weeks or something! And then we’re taking it to Silicon Valley in October … but that’s a different story.

And last but not least, here’s a lovely visualization of everything that went on throughout all of these fun times. You can see team members coming and going, you can see when we’re really working hard, when university takes its toll …. oh yeah, a lot can be deduced from these strange floating names [developers] and blobs [files-theyre-working-on] and everything. Quite a lot.

First 1000 revisions of Preona from preona on Vimeo.

Enhanced by Zemanta

Our very awesome countdown board is quite awesome

Last night, or quite possibly two nights ago, I was very bored at night. So much so that I watched Street Customs on Discovery Channel. For those who don’t know, it’s a show about a very cool entrepreneur who got a 3000$ loan from his gramps and transformed them into a huge worldwide car customising business in a few short years.

Brilliant guy; the show though isn’t quite up my alley.

Anyhow, watching that I noticed they’re using a very cool motivation device in their shop. Up just below the ceiling is a plastic timebomb-like device with a giant LCD display counting down days, hours, minutes and seconds until the deadline.

It’s a great way to put some pressure on the team. Everyone can see how little time is left. The countdown timer is a whole lot different from some insane boss/founder/CEO person who keeps urging you along and saying how little time there’s left. Hell, sometimes even that guy might have trouble feeling the pressure.

Introduce the timer!

Therefore the next natural step was taking one of our whiteboards at Preona and turning it into a huge Countdown Board. Huger even that the one Ryan Friedlinghaus uses at West Coast Customs. We weren’t using it much anyway because the sofa gets in the way of comfortable writing.

But to make things even more interesting … our Countdown Board includes a list of the bigger milestones we have to pass before the time runs out :)

Enhanced by Zemanta

Preona is giving away free iPads!

Behold the iPad in All Its Glory
Image via Wikipedia

Carving out a niche for yourself in the modern online world is extremely difficult we reckong it might be a dash easier to do in an upcoming market that’s not quite well formed yet and then working on from there.

This is why we are bringing our awesomeness to the iPad. (and I also happen to be writing this from one)

Determined as hell to come out of the woodwork with LazyReadr by the end of the summer, we are still missing one crucial ingredient in the mix.

Alfa/beta testers!

But good testers are difficult to find! So here’s the deal: if you volunteer to help us with testing and discovering what users like, we are giving you an iPad.

For at least a week you get to live with our app and our iPad, then tell us what it was like. We might even buy you a beer to help you talk.

Then, at the end of the summer, our favourite tester gets to keep the iPad forever.

Naturally we also accept testers who already own an iPad. You just get a beer-two-three though :)

We’re officially starting testing in two weeks, so drop us a line in the comments, via twitter or by email if you’re interested in becoming our awesome test bunny.

Enhanced by Zemanta

London startup expedition

The "nostalgic tram" no. 91 runs thr...
Image via Wikipedia

Just shy of a fortnight ago we participated in a Slovenian startup expedition to London. It was a day of great historical events and magnificent happenings.

Slovenia got pushed out of the world cup of footballs.

But other than that the expedition was a whole lot of fun. We got to meet a lot of people we talked to at the Prague Seedcamp, awesomer still, some of them remembered us from as far back as the Seedcamp shortlistings in London last September. That was very awesome indeed, shows we’re doing something right.

And we love doing stuff right.

Other than that we got a great opportunity to find out we’re actually getting competent at pitching what we’re doing … at least a little bit … well enough that people commented they really liked our presentation. But not enough for people to start taking their checkbooks out and giving us money.

Oh well.

But there was one very important thing we got out of the whole experience. It was the fact that we need to stop being silly about what we’re doing. We need to board ourselves up in the basement for the next two months and make something. Just being great looking and awesome blokes all around will not be enough for the Silicon Valley trip this autumn. Not by far.

That’s why we’re buying an iPad or two in the upcoming weeks.

But more about that some other time :)

Enhanced by Zemanta