Google app engine and Yahoo Pipes — fetch page web service
Yahoo Pipes build in fetch page module have restriction - it can fetch pages only under 200 kb.
Pipe looks like this:
But there is web service module that allow bypass this restriction.
All we need is to write web service that will fetch pages and attache them to feed.
Here is sample:
app.yaml
application: yahoo-pipes-fetch-page
version: 1
runtime: python
api_version: 1
handlers:
- url: /
static_files: index.html
upload: index.html
- url: .*
script: main.py
main.py
#!/usr/bin/env python
from google.appengine.ext import webapp
from google.appengine.ext.webapp import util
import urllib2
import re
import simplejson
class MainHandler(webapp.RequestHandler):
def get(self):
self.response.out.write('get: Hello world!')
class AppendHtmlHandler(webapp.RequestHandler):
def post(self):
data = self.request.get("data")
obj = simplejson.loads(data)
items = obj["items"]
for item in items:
req = urllib2.Request(item['link'], None, {'User-agent': 'Mozilla/5.0'})
html = urllib2.urlopen(req).read()
item['html'] = html[0]
self.response.content_type = "application/json"
simplejson.dump(obj, self.response.out)
class AppendBodyHandler(webapp.RequestHandler):
def post(self):
data = self.request.get("data")
obj = simplejson.loads(data)
items = obj["items"]
for item in items:
req = urllib2.Request(item['link'], None, {'User-agent': 'Mozilla/5.0'})
html = urllib2.urlopen(req).read()
body = re.findall(r'<body[^>]*>(.*?)</body>', html, re.DOTALL|re.MULTILINE)
body = body[0]
body = re.compile(r'<script.*?</script>', re.DOTALL|re.MULTILINE).sub('', body)
body = re.compile(r'<noscript.*?</noscript>', re.DOTALL|re.MULTILINE).sub('', body)
body = re.compile(r'<style.*?</style>', re.DOTALL|re.MULTILINE).sub('', body)
item['body'] = body
self.response.content_type = "application/json"
simplejson.dump(obj, self.response.out)
def main():
application = webapp.WSGIApplication([('/', MainHandler),
('/appendhtml', AppendHtmlHandler),
('/appendbody', AppendBodyHandler)],
debug=True)
util.run_wsgi_app(application)
if __name__ == '__main__':
main()
Now you can make pipes like this:
BUT. Here is epic fail:
Web service failure:
An Error Occurred
408 User-agent timeout (select)
So ...