I have a lot of content in notepad.yahoo.com -- links, articles, etc. I'd like to download it, format it programmatically, and post it to my blog. I'd like to do this in python, and I've already tried HTTPBasicAuthHandler and urllib2 with no success -- I get HTML that asks for login, rather than HTML that shows my notepad entries.
Here's what I have tried:
import urllib2, base64
username, password = 'hughdbrown@yahoo.com', '?'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username, password)
auth_handler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
req = urllib2.Request('http://notepad.yahoo.com/')
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
req.add_header("Authorization", "Basic %s" % base64string)
handle = urllib2.urlopen(req)
print handle.read()
Does anyone have an idea of how to get into notepad.yahoo.com to download notes?
[2008-05-08]
I got email from Alex Angelopoulos who had this recommendation:
Actually, I used WSH because I'm most used to it, even though it was Python that initially got me excited about scripting. I also _write_ about admin scripting, and the usual name of the game is "what can you do with the bits already in the box?"
That said, what I'm doing could easily be translated into Python using its COM interop. All I do is automate the browser. This is of course dramatically more resource-intensive than curl, but it means that any ugly bits of the page are handled exactly the way the browser handles them, and the job changes from a lot of parsing and submission-handling to selecting elements by name or id and then setting values or invoking actions.
I've attached a demo of it so far. It logs in, then takes the first page - presumably listing all of the notes - and returns the URLs for each and every note. It's possible to push this farther, of course, actually walking through the notes and grabbing the content of each one.
Notes on the demo:
(1) It's VBScript. I know, don't say it. : / However, I correctly cased object properties and methods so translation should be straightforward if you want to Pythonize. The only terminally ugly bits should be the setAttribute methods; those are both arguments and should be parenthesized in more rational languages. To run the demo as-is, you need to modify the username/password, and then I advise running from the console WSH host cscript so you don't get lots of popups for the URLs.
(2) It uses hardcoded username/password which I've set topside (and will likely make into commandline arguments in a better version).
(3) If you haven't done IE automation before, note the little readyState wait loops. You need to wait for IE to finish each navigation event, and in the case of complex documents, it's also necessary to wait for the document to render.
At this point, I'm not certain what you _really_ want to do from here - or if the browser automation approach isn't feasible for some reason. What I particularly like about this approach, though, is that you can work with objects instead of text, and it's always guaranteed to work with pages designed to the "Well, it worked in Internet Explorer..." spec.
Alex
And he provided WSH code:
username = "yyyyyyyy@yahoo.com"
passwd = "xxxxxxx"
Dim ie: Set ie = CreateObject("InternetExplorer.Application")
' Initial navigation needed to stabilize IE
ie.Navigate("about:blank")
Do Until ie.readyState = 4 : WScript.Sleep 100: Loop
' Show IE for debugging; can be hidden once it works.
ie.Visible = true
ie.Navigate "http://notepad.yahoo.com"
Do Until ie.readyState = 4 : WScript.Sleep 100: Loop
Dim doc: Set doc = ie.Document
Set usernameNode = doc.getElementById("username")
usernameNode.setAttribute "value", username
Set passwordNode = doc.getElementById("passwd")
passwordNode.setAttribute "value", passwd
' There's only one form, but it doesn't have an id -
' so we get by tagname and treat it like a collection...
'Dim body: Set body = doc.Body
Set forms = doc.Body.getElementsByTagName("form")
For each form in forms
if form.getAttribute("name") = "login_form" then form.submit()
next
' we're navigating in to the folders.
Do Until ie.readyState = 4 : WScript.Sleep 100: Loop
Set doc = ie.Document
' Apparently the document builds after the browser is technically
' finished navigating. So we wait until the document readyState
' "complete" - yes, it's a text state, not a flag...
Do Until doc.readyState = "complete"
WScript.Sleep 100
Loop
Set table = ie.Document.getElementById("datatable")
Set links = table.getElementsByTagName("a")
for each link in links
WScript.Echo link.href
next
[2008-05-09]
And that reminded me that I have used WatiN, a .NET library ostensibly for testing GUIs but really ideal for screen-scraping HTML pages -- particularly because it has Intellisense. So I wrote a program using WatiN that does everything:
using System;
using System.Globalization;
using System.Collections.Generic;
using WatiN.Core;
namespace YahooNotepad
{
class YahooLink
{
private string url, folder, title;
DateTime dt;
public YahooLink(string url, string folder,
string title, DateTime dt)
{
this.dt = dt;
this.title = title;
this.url = url;
this.folder = folder;
}
public string Url { get { return this.url; } }
}
class Program
{
static DateTime hackParse(string dateStr)
{
string[] ds = dateStr.Split(new char[] { ' ' });
string[] datePart = ds[0].Split(new char[] { '/' });
string[] timePart = ds[1].Split(new char[] { ':' });
int year = 2000 + int.Parse(datePart[2]);
int month = int.Parse(datePart[0]);
int day = int.Parse(datePart[1]);
int hour = int.Parse(timePart[0]);
if (ds[2] == "pm" && hour > 12)
hour += 12;
int minute = int.Parse(timePart[1]);
DateTime dt = new DateTime(year, month, day,
hour, minute, 0);
return dt;
}
static bool processPages(IE ie, Queue<YahooLink> notes)
{
Table table = ie.Table("datatable");
// Skip the <HT> row
for (int i = 1; i < table.TableRows.Length; i++)
{
TableRow row = (TableRow) table.TableRows[i];
TableCell href = (TableCell) row.TableCells[1];
Link link = (Link) href.Links[0];
TableCell folder = (TableCell) row.TableCells[2];
TableCell date = (TableCell)row.TableCells[3];
DateTime dt = hackParse(date.Text);
Console.WriteLine("{0} (in {1} on {2}): {3}",
link.OuterText, folder.Text, dt, link.Url);
YahooLink yl = new YahooLink(link.Url, folder.Text,
link.OuterText, dt);
notes.Enqueue(yl);
}
try
{
ie.Link(Find.ByText("Next")).Click();
return true;
}
catch (Exception)
{
return false;
}
}
[STAThread]
static void Main(string[] args)
{
string userName = "hughdbrown@yahoo.com";
string password = "?";
string website = "http://notepad.yahoo.com";
using (IE ie = new IE(website))
{
try
{
ie.TextField(Find.ById("username"))
.TypeText(userName);
ie.TextField(Find.ById("passwd"))
.TypeText(password);
ie.Button(Find.ByValue("Sign In"))
.Click();
}
catch (Exception)
{
// Carry on -- probably already logged in
}
Queue<YahooLink> notes = new Queue<YahooLink>();
while (processPages(ie, notes))
;
foreach (YahooLink yl in notes)
{
ie.GoTo(yl.Url);
TextField tf = ie.TextField(Find
.ById("charCountTA"));
Console.WriteLine(tf.Text);
}
ie.Link(Find.ById("Sign Out")).Click();
}
}
}
}
There are a couple of hacks:
- The DateTime parsing failed on 4/30/08 for no apparent reason, so I just hacked up a quick-and-dirty date parser for American-style dates.
- If you are already logged in to notepad.yahoo.com, it skips the extra log in by catching a throw exception.
- Yahoo seems to find it peculiar that I am looking at 2000 notes and pulling them all down. It raises a security warning sometime after I have pulled down a few hundred. Maybe I'll have to add a call to sleep().