Hello World
Hi, I saw your post on HN about your craigslist mashup. I would love to know how you went about making this? I dont know if you blog but a blog entry explaining how you went about this project would be great. Also, you've done an awesome job at this. Thanks
Oh why not, a blog can't hurt.
The mashup in question is Flippity, and it's hardly finished. Today it's just an alpha prototype, and was released primarily to get user and community feedback. You see, Dan and I have been working on the project more or less full time for about 2 months, and we've gotten kind of bored without any users. Yeah, we have ADD. More importantly, I think 2 months for a consumer internet site without any users is a long, long time. I expect future projects to have prototypes ready in under a month, but this is Dan and I's first real project together, so we also had a process to establish. Oh, and neither one of us was a web developer.
The idea was pretty simple -- plot Craigslist listings on a map. Add features like radius search, email notifications, and so forth. But the core idea was to let you see what's being sold around your home. Why ? Well, with this you can do nifty things. You can execute a proximity search so you don't have to drive too far (Craigslist only provides regions). You can isolate nearby items and visit those sellers in one trip. And best of all, without searching for anything in particular, you can see what all your 'neighbors' are selling. It's like a garage sale, but with all of Craigslist as a data set.
So version .001 scraped Craigslist -- it issued GET requests via http, parsing the resulting data set. This worked, but it was ugly and faced the immediate threat of being IP-blocked by Craigslist at any point since it was centralized. Not long after, I realized Craigslist has RSS feeds. Did you know this ? If so, congratulations, you're one of a total of seven people who do. The icon is neatly tucked in the bottom right corner of any given results page.
What good are RSS feeds ? Well, they take the parsing out of the equation, since RSS is based on XML, which aside from being bloated doesn't require much parsing. Much more importantly, RSS feeds are read and cached by RSS readers. Some of those readers, like Google and Yahoo, have an API. What does this mean in plain English ? It means that we can now get our Craigslist data through a 3rd party, without ever touching Craigslist directly. Not only that, but if anyone else using the readers pulled the same data in the last 15-30 minutes, then it comes directly from the 3rd party, and Craigslist is only hit once. Even better, there's a nifty tool called Yahoo pipes that will let you obtain data from anywhere on the internet (Craigslist included), parse it in a bunch of ways, and spit it out via JSON/XML/RSS.
This lead us to our next design. We'll have a cron job that runs every 15 minutes (Craigslist refresh rate), pulling data from our private Yahoo pipe which reads a bunch of Craigslist RSS feeds (there's 36 categories in "for sale") and spits them out as gzipped JSON. We only need to read the first page for most categories, as the Craigslist posting rate is usually less than 100 items per 15 minutes (cars and trucks is one exception). Our python code then parses this data and inserts it into a Postgres database. Another process runs continuously and uses Google's Geocoding API to geocode the locations of these new entries, if possible, into latitudes and longitudes. Since the location field in Craigslist is optional, it often contains junk, but it turns out there's enough utility nonetheless. Finally, if the geolocation fails, we'll extract the general region from the URL or posting and approximate the location, attempting to salvage it.
We wouldn't have to jump through all these hoops if Craigslist had an API. But even without an API, if they managed to get higher quality locations, Flippity would be much more useful. Exact addresses are ideal, but people may be unwilling to post those, so why not cross streets ? I emailed Craigslist's founder, Craig Newmark to get his opinion. I'll post some of that exchange in the next blog entry.