Difference between revisions of "Software/Spider"
From ThorstensHome
(→spider for KDE 4) |
m (1 revision(s)) |
Revision as of 09:09, 18 October 2008
A program that follows all links in a given html file.
Contents |
Perl program
The following code lists all links in an html file.
#!/usr/bin/perl require HTML::LinkExtor; $p = HTML::LinkExtor->new(\&parse, ""); sub parse { my($tag, %links) = @_; my ($att, $url) = @{[%links]}; print "$url\n"; } if ($ARGV[0]) { $p->parse_file($ARGV[0]); } else { print "Usage: spider.pl htmlfile.htm\n"; }
KDE programs
→ Download version for KDE 4
→ Download version for KDE 3
spider for KDE 4
Problem:
<body>
is recognized correctly, but not
<body lang=DE link=blue vlink=purple bgcolor=#eeeeff>
This is because for XML it should be like
<body lang="DE" link="blue" vlink="purple" bgcolor="#eeeeff">
Solution:
Use tidy to make sure your html file is an xhtml file. Or use
QXmlQuery query; query.setQuery("index.html", "/html/body/h1"); QStringList headings; query.evaluateTo(&headings);
Spider.pl
Spider.pl follows all links in an html file:
#!/usr/bin/perl # spider.pl (c) 2008 by Thorsten Staerk # This program extracts links in a web page and follows them. require HTML::LinkExtor; $p = HTML::LinkExtor->new(\&parse, ""); sub parse { my($tag, %links) = @_; my ($att, $url) = @{[%links]}; #print "$url\n"; get($url); } sub parse_if_ok() { $filename=$_[0]; $level=$_[1]; $p->parse_file($filename); } #int main get("http://www.heise.de", 0); sub get( $url ) { print "Entering get $_[0]\n"; $url=$_[0]; $level=$_[1]; # return if no http:// url $newurl=$url; $newurl =~ s\http://\\; if ($newurl eq $url) {print "returning\n"; return}; use LWP::UserAgent; $agent = LWP::UserAgent->new; $answer = HTTP::Request->new(GET => $url); $answer->header('Accept' => 'text/html'); $res = $agent->request($answer); if ($res->is_success) { print "successfully got response"; $number="1"; $oldurl=url; $url =~ s\http://\\; if ($oldurl eq $url) {return}; print "still here"; $filename=$url; $filename=~ s\/\-\g; open(FILE, ">"."$filename"); print (FILE $res->content); print "printing file $filename"; $p->parse_file("file$url"); } else { print "Error: " . $res->status_line . "\n"; } }