LinkScan

LinkScan for Unix. Reference Manual.

Section 29

  Previous   Contents   Next   Help   Reference   HowTo   Card 

LinkScan Application Notes

  1. LinkScan to Email Interface
  2. Testing Wireless Servers with LinkScan
  3. Testing Secure Servers with LinkScan
  4. Testing Japanese Language Sites with LinkScan
  5. Google Sitemaps
  6. XML Documents

29.1 LinkScan to Email Interface

LinkScan incorporates several functions that relate to electronic mail. These include:

Some or all of the following parameters must be configured in order to use these functions:

Windows Systems -- linkscan.sys

Sendmailpath = perl utils/sendmail.pl
Smtphost = smtp.example.com
Hostname = www.example.com
Mailfrom = LinkScan@example.com
Nameservers =
[...]
Mailto = 1

Unix Systems -- linkscan.sys

Sendmailpath = /usr/lib/sendmail -t
Smtphost = 
Hostname = www.example.com
Mailfrom = LinkScan@example.com
Nameservers =
[...]
Mailto = 1

linkscan.cfg

For completeness, we address two related settings in the linkscan.cfg file:

Mailhost = example.com
Checkmailto = 0

29.2 Testing Wireless Servers with LinkScan

LinkScan includes support for the Wireless Application Protocol (WAP) and Wireless Markup Language (WML). This allows LinkScan to validate wireless sites via an HTTP gateway. Typically, you will need to add the following configuration commands to linkscan.cfg:


Extraheader User-Agent: Nokia7110/1.0 (04.80)
Mimetypes text/vnd.wap.wml H

This will cause LinkScan to send an appropriate User-Agent header with each request and to parse/follow documents with a MIME/Content-Type of text/vnd.wap.wml.

29.3 Testing Secure Servers with LinkScan

LinkScan may be configured to test websites hosted on secure servers running the Secure Sockets Layer (SSL). i.e. sites with URL's of the form https://www.example.com/.

On the Microsoft Windows platforms, you need only specify the URL of the site to be scanned. LinkScan includes native support for the Secure Sockets Layer.

On Unix systems, you will need to install additional software to handle the SSL encryption. The required packages are:

At the time of writing LinkScan has been tested with OpenSSL version 0.9.6 and Net::SSLeay version 1.05.

Installation of both packages is very straightforward if you have root access:



cd $HOME/openssl-0.9.6
./config
make
make test
make install   # See Note 1

cd $HOME/Net_SSLeay.pm-1.05
perl Makefile.PL
make
make test      # See Note 2
make install   # See Note 1

Note 1: The make install steps may fail if you do not have root access. You may install and run these packages from a user directory if you do not have root access by using something like this:


cd $HOME/openssl-0.9.6
./config --openssldir=$HOME/myopenssl
make
make test
make install

cd $HOME/Net_SSLeay.pm-1.05
perl Makefile.PL $HOME/myopenssl
make
make test
mv ./blib/lib/Net/ /usr/www/linkscan/
mv ./blib/lib/auto/ /usr/www/linkscan/

Note 2: The make test on Net::SSLeay will produce a number of errors. In general, you can safely ignore them.

Once the module Net::SSLeay has been successfully installed, LinkScan will be able to scan https://... sites without any additional configuration changes.

Disclaimer

Each of the above referenced programs (with the exception of LinkScan) is maintained by parties other than Electronic Software Publishing Corporation. You are solely responsible for your use of those products and your compliance with any applicable software license agreements. Several of the referenced products contain encryption algorithms, the distribution and use of which may be subject to various laws and regulations. You are solely responsible for compliance.

29.4 Testing Japanese Language Sites with LinkScan

When scanning sites that contain (in whole or in part) Japanese pages, include the following directives in the Project configuration file (on Windows systems, via the Advanced Tab of the Project Planning Property Sheet):


Jisencode = 1
Displaylang = EUC-JP

Pages containing JIS, Shift-JIS and/or EUC-JP encoded Japanese characters will be normalized to EUC-JP. This means, for example, that the TITLE tags extracted from different documents may be combined in a single summary document (e.g. the LinkScan SiteMap) even though the original pages were constructed with different encodings.

The encoding type of each document is stored in the LinkScan database together with the MIME type (Content-Type). The Search Documents Report may be used to search/display this data and help enforce consistent encoding standards across mixed language sites.

29.5 Google Sitemaps

LinkScan automatically creates a XML Sitemap file in a format suitable for submission to Google Sitemaps. For more background, see Google Webmaster Help Center.

The XML Sitemap file is created automatically. The file name is sitemap.xml and it resides in the Project subdirectory of the LinkScan installation directory. e.g.

The file is formatted in compliance with the Google Sitemaps Protocol. However, Google recommend that the file be compressed using gzip. The gzip utility is standard on most UNIX systems. Windows users may download a free command line implementation of gzip from http://www.gzip.org/.

LinkScan produces the sitemap.xml file with the following Google-defined fields for each web page listed:

In addition, LinkScan will optionally limit the scope of the Google Sitemap to the first "N" levels (as defined by the LinkScan Link Order SiteMap). This may be defined by adding a Gsmlevels command to the Project linkscan.cfg file [Windows users: add this command via the Advanced Tab of the Project Planning Property Sheet].

29.6 XML Documents

At version 11.6, LinkScan is able to parse and extract links from the following document types:

The following paragraphs describe how to use LinkScan to scan XML (or other similarly formatted) documents. Activating and configuring the XML parser involves two basic steps.

  1. First, LinkScan must be told to route documents of the appropriate type to the XML parser for analysis. On UNIX systems this may be done with the Mimetypes and Filetypes directives in the linkscan.cfg file.

    Mimetypes text/xml X
    
    Filetypes xml X
    

    On Windows systems, these options may be set via the Mimes and Files Tabs of the Project Planning Property Sheet.

    The former is used with HTTP Scanning and it will route all documents with Content-Type: text/xml header to the XML parser. The latter is used with File System Scanning and it will route all files with a .xml file extension to the new XML parser.

  2. Second, LinkScan must be told how to extract links from the XML document. This is done via Regular Expressions and is best illustrated by example. Suppose we have an XML document organized like this:

    <?xml version="1.0" encoding="ISO-8859-15"?>
    <link>
      <linkUrl>http://www.elsop.com/</linkUrl>
      <linkText>LinkScan</linkText>
      <linkTarget>_blank</linkTarget>
      <linkRef>000012345678</linkRef>
    </link>
    

    We construct an Xmlmatch directive and add it to the linkscan.cfg file:

    Xmlmatch  = <linkUrl>([^<]+)</linkUrl>.*?<linkText>([^<]+)</linkText> $1 $2
    

    LinkScan will now extract the link (http://www.elsop.com/) and the associated caption (LinkScan) from that XML file.

The new parser means that LinkScan can now be used to quickly and accurately extract links from XML and similarly formatted data files.

LinkScan for Unix. Reference Manual. Section 29. LinkScan Application Notes
LinkScan Version 12.1
© Copyright 1997-2010 Electronic Software Publishing Corporation (Elsop)
LinkScan™ and Elsop™ are Trademarks of Electronic Software Publishing Corporation

  Previous   Contents   Next   Help   Reference   HowTo   Card