A guy I play frisbee with turns out to have a cool developer's website, EvanJones.ca. Here are a couple of the cool links I found there:
Python code coverage [via EvanJones.ca]
Yahoo Zookeeper (distributed systems infrastructure) [via EvanJones.ca]A guy I play frisbee with turns out to have a cool developer's website, EvanJones.ca. Here are a couple of the cool links I found there:
Python code coverage [via EvanJones.ca]
Yahoo Zookeeper (distributed systems infrastructure) [via EvanJones.ca]I have a lot of content in notepad.yahoo.com -- links, articles, etc. I'd like to download it, format it programmatically, and post it to my blog. I'd like to do this in python, and I've already tried HTTPBasicAuthHandler and urllib2 with no success -- I get HTML that asks for login, rather than HTML that shows my notepad entries.
Here's what I have tried:
import urllib2, base64
username, password = 'hughdbrown@yahoo.com', '?'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username, password)
auth_handler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
req = urllib2.Request('http://notepad.yahoo.com/')
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
req.add_header("Authorization", "Basic %s" % base64string)
handle = urllib2.urlopen(req)
print handle.read()
Does anyone have an idea of how to get into notepad.yahoo.com to download notes?
[2008-05-08]I got email from Alex Angelopoulos who had this recommendation:
Actually, I used WSH because I'm most used to it, even though it was Python that initially got me excited about scripting. I also _write_ about admin scripting, and the usual name of the game is "what can you do with the bits already in the box?"
That said, what I'm doing could easily be translated into Python using its COM interop. All I do is automate the browser. This is of course dramatically more resource-intensive than curl, but it means that any ugly bits of the page are handled exactly the way the browser handles them, and the job changes from a lot of parsing and submission-handling to selecting elements by name or id and then setting values or invoking actions.
I've attached a demo of it so far. It logs in, then takes the first page - presumably listing all of the notes - and returns the URLs for each and every note. It's possible to push this farther, of course, actually walking through the notes and grabbing the content of each one.
Notes on the demo:
(1) It's VBScript. I know, don't say it. : / However, I correctly cased object properties and methods so translation should be straightforward if you want to Pythonize. The only terminally ugly bits should be the setAttribute methods; those are both arguments and should be parenthesized in more rational languages. To run the demo as-is, you need to modify the username/password, and then I advise running from the console WSH host cscript so you don't get lots of popups for the URLs.
(2) It uses hardcoded username/password which I've set topside (and will likely make into commandline arguments in a better version).
(3) If you haven't done IE automation before, note the little readyState wait loops. You need to wait for IE to finish each navigation event, and in the case of complex documents, it's also necessary to wait for the document to render.
At this point, I'm not certain what you _really_ want to do from here - or if the browser automation approach isn't feasible for some reason. What I particularly like about this approach, though, is that you can work with objects instead of text, and it's always guaranteed to work with pages designed to the "Well, it worked in Internet Explorer..." spec.
Alex
And he provided WSH code:
username = "yyyyyyyy@yahoo.com"
passwd = "xxxxxxx"
Dim ie: Set ie = CreateObject("InternetExplorer.Application")
' Initial navigation needed to stabilize IE
ie.Navigate("about:blank")
Do Until ie.readyState = 4 : WScript.Sleep 100: Loop
' Show IE for debugging; can be hidden once it works.
ie.Visible = true
ie.Navigate "http://notepad.yahoo.com"
Do Until ie.readyState = 4 : WScript.Sleep 100: Loop
Dim doc: Set doc = ie.Document
Set usernameNode = doc.getElementById("username")
usernameNode.setAttribute "value", username
Set passwordNode = doc.getElementById("passwd")
passwordNode.setAttribute "value", passwd
' There's only one form, but it doesn't have an id -
' so we get by tagname and treat it like a collection...
'Dim body: Set body = doc.Body
Set forms = doc.Body.getElementsByTagName("form")
For each form in forms
if form.getAttribute("name") = "login_form" then form.submit()
next
' we're navigating in to the folders.
Do Until ie.readyState = 4 : WScript.Sleep 100: Loop
Set doc = ie.Document
' Apparently the document builds after the browser is technically
' finished navigating. So we wait until the document readyState
' "complete" - yes, it's a text state, not a flag...
Do Until doc.readyState = "complete"
WScript.Sleep 100
Loop
Set table = ie.Document.getElementById("datatable")
Set links = table.getElementsByTagName("a")
for each link in links
WScript.Echo link.href
next
[2008-05-09]And that reminded me that I have used WatiN, a .NET library ostensibly for testing GUIs but really ideal for screen-scraping HTML pages -- particularly because it has Intellisense. So I wrote a program using WatiN that does everything:
using System;
using System.Globalization;
using System.Collections.Generic;
using WatiN.Core;
namespace YahooNotepad
{
class YahooLink
{
private string url, folder, title;
DateTime dt;
public YahooLink(string url, string folder,
string title, DateTime dt)
{
this.dt = dt;
this.title = title;
this.url = url;
this.folder = folder;
}
public string Url { get { return this.url; } }
}
class Program
{
static DateTime hackParse(string dateStr)
{
string[] ds = dateStr.Split(new char[] { ' ' });
string[] datePart = ds[0].Split(new char[] { '/' });
string[] timePart = ds[1].Split(new char[] { ':' });
int year = 2000 + int.Parse(datePart[2]);
int month = int.Parse(datePart[0]);
int day = int.Parse(datePart[1]);
int hour = int.Parse(timePart[0]);
if (ds[2] == "pm" && hour > 12)
hour += 12;
int minute = int.Parse(timePart[1]);
DateTime dt = new DateTime(year, month, day,
hour, minute, 0);
return dt;
}
static bool processPages(IE ie, Queue<YahooLink> notes)
{
Table table = ie.Table("datatable");
// Skip the <HT> row
for (int i = 1; i < table.TableRows.Length; i++)
{
TableRow row = (TableRow) table.TableRows[i];
TableCell href = (TableCell) row.TableCells[1];
Link link = (Link) href.Links[0];
TableCell folder = (TableCell) row.TableCells[2];
TableCell date = (TableCell)row.TableCells[3];
DateTime dt = hackParse(date.Text);
Console.WriteLine("{0} (in {1} on {2}): {3}",
link.OuterText, folder.Text, dt, link.Url);
YahooLink yl = new YahooLink(link.Url, folder.Text,
link.OuterText, dt);
notes.Enqueue(yl);
}
try
{
ie.Link(Find.ByText("Next")).Click();
return true;
}
catch (Exception)
{
return false;
}
}
[STAThread]
static void Main(string[] args)
{
string userName = "hughdbrown@yahoo.com";
string password = "?";
string website = "http://notepad.yahoo.com";
using (IE ie = new IE(website))
{
try
{
ie.TextField(Find.ById("username"))
.TypeText(userName);
ie.TextField(Find.ById("passwd"))
.TypeText(password);
ie.Button(Find.ByValue("Sign In"))
.Click();
}
catch (Exception)
{
// Carry on -- probably already logged in
}
Queue<YahooLink> notes = new Queue<YahooLink>();
while (processPages(ie, notes))
;
foreach (YahooLink yl in notes)
{
ie.GoTo(yl.Url);
TextField tf = ie.TextField(Find
.ById("charCountTA"));
Console.WriteLine(tf.Text);
}
ie.Link(Find.ById("Sign Out")).Click();
}
}
}
}
There are a couple of hacks:
from __future__ import with_statement
def fixFile(fileName) :
with open(fileName, "r") as f :
str = f.read().replace("\r\n", "\n").replace("\r", "\n")
with open(fileName, "w") as f :
f.write(str)
Two ways to load a dictionary with keys and values:
# Using builtin zip() d = dict(zip(the_keys, the_values)) # Using itertools.izip() import itertools d = dict(itertools.izip(the_keys, the_values)
The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.Pyflix is a small package written in Python that provides an easy entry point for getting up and running in the Netflix Prize competition. It combines an efficient storage scheme with an intuitive high-level API that allows contestants to focus on the real problem, the recommendation system algorithm
From Thomas Guest's The Third Rule of Program Optimisation
$ python -m cProfile -o prof.dump program.py
You then load the prof.dump output file and analyse it using the pstats module. For example, to find which 10 functions took the most time:
>>> import pstats
>>> prof = pstats.Stats("prof.dump")
>>> prof.sort_stats("time").print_stats(10)