Sometime over this weekend the magic number of articles in our database reached a cool 1,000,000 articles – it’s actually 1,035,511 right now.

- Image by Amsterdamize via Flickr
Because it’s been a while since I last wrote a blog post for Preona I thought this was the perfect occasion to talk a bit of tech and about what we’ve been up to for the past month or two.
LazyReadr was released to the AppStore in December. We got the happy news bang smack in the middle of our Christmas party. This wasn’t a full out launch, just a soft release into the wild to see what would happen.
We were pretty drunk and quite ecstatic.
This feeling lasted for about a day.
Then everything promptly went to shit. We ran out of allocated budget on the AppEngine, our scraping architecture melted and just about everything that can possibly go wrong server-side did go wrong. And I can’t blame the code that I wrote for crashing either. After all, before releasing our Superfeedr account was subscribed to a measly 600-ish RSS feeds and we were processing a comfortable 200 notifications per hour.
After releasing we are now subscribed to 4500 RSS feeds and are processing almost 3000 notifications per hour.
And let me tell you, downloading the internet seemed like a much better idea before we actually started doing it
Why the hell are you downloading the internet?
Yes, the internet is ripe with jokes about how it is undownloadable, cannot be stored and is generally pretty damn big. All of that is true, the internet really is pretty fucking big.
There are two philosophical and one technical reason why we wanted to download all of the internets.
First of all, we want our users to be able to get their content as fast as their network speed will allow. This means we should (must?) have all of the content a user might want nicely stored for them and then send them the whole batch of fresh updates they’ve yet to read.
We also don’t like it if users are forced to look at weird advertising, incomplete content or pray forbid a login form of a paywall. That’s why we use Readability to scrape all the content and send it in a much nicer to read form.
Which leads us to the technological problem – our scraper just isn’t all that amazingly fast. It takes around two seconds to extract the meat of the article out of a decently sized and reasonably complex website. Obviously we can’t do that on the fly when users are downloading several hundred articles at a time and can barely be kept waiting 30 seconds.
Also there’s the added awesomeness of all these articles later being available to provide very awesome “related content” features, overviews of how stories develop through time and just a bunch of neat data mining opportunities. It’s a goldmine!
Hmm … I still think you’re idiots
Luckily we aren’t downloading the internet just willy-nilly. There is method behind the madness, we only download sources our users are actually subscribed to and our bet is that we can sell every downloaded article more than once. Even with only about 200 users this has already started showing to be true.
The general average is that for every RSS we handle, there are 1.3 users subscribed to it, the most popular RSS actually has 16 subscribers! This may not sound like a lot right now, but we bet it will get really significant when we reach a scale of a few thousand users.
You must also take into account that the feeds, which are really problematic and spam content every few minutes, are the more popular feeds with a lot of subscribers, while the fringe feeds with very few subscribers (generally personal blogs etc.) have a lot fewer updates and so aren’t that big a problem for us.
The technical stuff
Because I wanted this post to be useful, here’s some technical stuff about how we are actually downloading the internet in the hopes someone finds it useful/interesting or it piques their geeky side.
First some stats:
6,883,389 entities in our database
which is 25 GBytes of data
1,035,511 articles
1,279,391 images
67 GBytes of binary data
And last I checked we were adding about 10 gigs of data per week; might be a bit less right now because we improved some redundancies.
So far we really aren’t using anything too special to handle this kind of load, which you could argue isn’t that big a load anyway. For the most part we’re simply relying on the awesome scalability of Google AppEngine where we are constantly burning about 40 instances, whatever exactly that means to Google and it’s only costing us a few dollars per day. So far still cheaper and much, much simpler than doing it ourselves.
It also turns out that handling an average of 20 requests per second (due to task queues and every article triggering many requests etc.) with Django is very very easy when you’re using the AppEngine.
Downside is it can get a bit pricey quite quickly.
The more technically interesting stuff like scraping websites for content and linking similar articles together is not done on Google because it turns out burning that much CPU on a cloud platform is like shooting yourself in the face. So far all we need to handle an average of 1aps (article per second) where processing each article takes about two to three seconds is a very beefy VPS and a run of the mill PC. From our calculations we still have a buffer of about 600 articles per hour whereas with just a VPS we were lagging behind by something like 500 to 1000 articles per hour.
In essence when load grows we just have to add cheap commodity PC’s to the architecture and everything will be alright. This makes me pretty happy to be honest.
Final summation
Downloading the internet is fun, horrible and terrifying.
Related articles
- Features – RSS For Non-Techie Librarians | LLRX.com (llrx.com)
- RSS: A Reply (Author’s follow up to “RSS Is Dying”) (camendesign.com)
- News RSS Ticker: An Elegant News Ticker For Windows & Linux (makeuseof.com)
- Top 10 Fresh And Best Free WordPress Themes (webdesignish.com)

Company Blog





























Last comments