User Tools

Site Tools


user:mhoram:scripts:wiki_orphans

Introduction

This is a perl script which walks through the Crossfire Dokuwiki index, parses the links on every page, and reports any orphaned pages – those that are never linked to from other dokuwiki pages. (In dokuwiki parlance, they have no backlinks.) If these pages aren't under heavy construction, they should probably be linked to from somewhere so they can be found, or deleted if no longer needed.

This was a fairly quick hack, so it doesn't have all the error-checking that a completed script should, but I'm out of time for today, and it does appear to work. Thanks to Rednaxela for the idea.

Requirements

  • Perl
  • LWP::Simple (a perl module)
  • A reasonably fast internet connection. It takes a few minutes across my wireless, so would probably take hours over dialup.

Results

Here are the results of a run on 2006-12-27, with comments in italics.

  • client_side_scripting:scripts:c
  • monsters:a – Under heavy construction; will be linked from a monster index.
  • playground – Makes sense.
  • servers:cat2
  • start
  • unlinked_test_page – Page for testing this script; could be deleted at any time.
  • user:kshinji – Should be added to people.
  • user_cavesomething_todo_spelldescriptions

Code

#!/usr/bin/perl
use strict;
use warnings;
 
# This script walks the index tree of the Crossfire dokuwiki
# and counts the number of links to each page.  Those with zero
# links are reported as orphans.
 
use LWP::Simple;
 
my $base_url  = 'http://wiki.metalforge.net';
my $base_path = '/doku.php';
my %page;  # stores page paths, whether they've been followed, and a count
my %did_index;  # stores index directories that have been expanded
my $DEBUG = 0;
 
check_link($base_url.$base_path."/start?do=index");
 
for my $link (sort keys %page){
  next if $link =~ /^$base_path/;
  print "$link\n" unless $page{$link}->{count};
}
 
sub check_link {
  my $path = shift;
  debug("Checking link $path");
  my $index_text = get( $path );
  $index_text =~ s{^.+?wikipage start}{}s;
  my(@index_links) = $index_text =~ m{"$base_path/([^"]+?)"}g;
  for my $index_link (@index_links){
    if( $index_link =~ m{\?idx=(.+)$} ){
      # this is an index directory, recurse through it once
      unless( $did_index{$index_link} ){
	$did_index{$index_link} = 1;
	check_link("$base_url$base_path/$index_link");
      }
    } else {
      # this is a page, parse it and mark it if it hasn't been done yet
      unless( $page{$index_link}->{done} ){
	debug("Getting $base_url$base_path/$index_link?do=backlink");
	my $text = get("$base_url$base_path/$index_link?do=backlink");
	$text =~ s{^.+?wikipage start}{}s;
	$page{$index_link}->{count} = $text =~ m{(wikilink1)}gs;
	$page{$index_link}->{done} = 1;
      }
    }
  }
}
 
 
sub debug {
  print STDERR @_, "\n" if $DEBUG;
}

Notes & Comments

  • I see you have it set up to manually read each page for links. It would probably be better/easier/faster for it to instead parse the “backlink” pages for each page. To get the backlink page, just have “do=backlinks” variables, for example http://wiki.metalforge.net/doku.php/user:mhoram:scripts:wiki_orphans?do=backlink. This is normally accessed in dokuwiki by clicking the page title in the top left. — Alex Schultz 2006/12/20 17:57
    • But that would be so easy, and I've already written so much lovely code to do it the hard way. ;-) It probably would be much lighter on bandwidth, though, since the “backlink” page will often be considerably smaller than the page itself, so I'll change it to work that way. – Aaron Baugher 2006/12/21 15:25
      • This idea has now been incorporated. – Aaron Baugher 2006/12/27 14:46

References

user/mhoram/scripts/wiki_orphans.txt · Last modified: 2010/11/12 11:38 (external edit)