Software/Spider

From ThorstensHome
Revision as of 22:21, 5 October 2008 by WikiSysop (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

A program that follows all links in a given html file.

Contents

Perl program

The following code lists all links in an html file.

#!/usr/bin/perl
require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&parse, "");
sub parse
{
  my($tag, %links) = @_;
  my ($att, $url) = @{[%links]};
  print "$url\n";
}
if ($ARGV[0])
{
  $p->parse_file($ARGV[0]);
}
else
{
  print "Usage: spider.pl htmlfile.htm\n";
}

KDE programs

Download version for KDE 4
Download version for KDE 3

spider for KDE 4

Problem:

<body> 

is recognized correctly, but not

<body lang=DE link=blue vlink=purple bgcolor=#eeeeff>

This is because for XML it should be like

<body lang="DE" link="blue" vlink="purple" bgcolor="#eeeeff">

Solution:

Use tidy to make sure your html file is an xhtml file. Or use

QXmlQuery query;
query.setQuery("index.html", "/html/body/h1");
QStringList headings;
query.evaluateTo(&headings);

Spider.pl

Spider.pl follows all links in an html file:

#!/usr/bin/perl
# spider.pl (c) 2008 by Thorsten Staerk
# This program extracts links in a web page and follows them.
require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&parse, "");

sub parse
{
    my($tag, %links) = @_;
    my ($att, $url) = @{[%links]};
    #print "$url\n";
    get($url);
}

sub parse_if_ok()
{
  $filename=$_[0];
  $level=$_[1];
  $p->parse_file($filename);
}

#int main
get("http://www.heise.de", 0);

sub get( $url )
{
  print "Entering get $_[0]\n";
  $url=$_[0];
  $level=$_[1];
  # return if no http:// url
    $newurl=$url;
    $newurl =~ s\http://\\;
    if ($newurl eq $url) {print "returning\n"; return};
  use LWP::UserAgent;
  $agent = LWP::UserAgent->new;
  $answer = HTTP::Request->new(GET => $url);
  $answer->header('Accept' => 'text/html');
  $res = $agent->request($answer);
  if ($res->is_success)
  {
    print "successfully got response";
    $number="1";
    $oldurl=url;
    $url =~ s\http://\\;
    if ($oldurl eq $url) {return};
    print "still here";
    $filename=$url;
    $filename=~ s\/\-\g;
    open(FILE, ">"."$filename");
    print (FILE $res->content);
    print "printing file $filename";
    $p->parse_file("file$url");

  }
  else
  {
    print "Error: " . $res->status_line . "\n";
  }
}