LinkScan

LinkScan for Unix. Reference Manual.

Section 28

  Previous   Contents   Next   Help   Reference   HowTo   Card 

LinkScan File Formats


The following notes describe the format of many of
the LinkScan database files stored in:

...LinkScan/ProjectName/data/
...LinkScan/ProjectName/hist/

Each file is created in (mainly) ASCII format,
with one Record per Line. Each Record contains
a number of Fields, delimited with <Control-G>
characters (Octal: 007). The Fields associated
with each Record type are outlined below.

idx.dat
=======
Establishes the mapping between an "idx" number and each
unique Document/Link/URL examined by LinkScan.

 0 = idx
 1 = URL
 2 = Document Title


doc.dat
=======
Contains the attributes and characteristics for each unique
Document/Link/URL examined by LinkScan.

 0 = idx (see idx.dat)
 1 = URL
 2 = Owner Code (see linkscan.own)
 3 = Clicks
 4 = Link Type (see below)
 5 = Content-Type (MIME)
 6 = Link Status Code (see codes.txt)
 7 = Extended Status (normally blank)
 8 = Location for Redirect (see idx.dat)
 9 = Original Status Code (pre-redirect)
10 = Content-Length (size in bytes)
11 = Last-Modified (date/time)
12 = Reserved
13 = File System Pathname
14 = Document Title
15 = In-line bytes (page weight)
16 = Number of Errors in this document
17 = Number of Warnings in this document


orp.dat
=======
Contains information concerning all Orphaned Files.

 0 = URL
 1 = File System Pathname
 2 = Symlink (0=No; 1=Followed symlink; 2=Is symlink)
 3 = File Size
 4 = Date/Time last modified
 5 = Owner Code (see linkscan.own)
 6 = Link Type (see below)
 7 = Link Status Code (see codes.txt)


mad.dat and map.dat
===================
Contain the LinkScan SiteMap Data
mad.dat -- directory order
map.dat -- link order

 0 = Level in Map
 1 = Dot-Decimal Notation
 2 = Document URL
 3 = Document Title
 4 = Owner Code (see linkscan.own)
 5 = Content-Length (size in bytes)
 6 = Last-Modified (date/time)
 7 = Total # of child documents for this node


lnk.dat
=======
Contains the attributes of every link considered by LinkScan.

 0 = Owner Code (see linkscan.own)
 1 = From URL (see idx.dat)
 2 = Line Number (times 10)
 3 = To URL (see idx.dat)
 4 = Link Type Code (see below)
 5 = Link Status Code (see codes.txt)
 6 = Extended Status (normally blank)
 7 = cnt
 8 = Link Caption/Description
 9 = File Size (in-line images only)
10 = Redirect location (see idx.dat)


err.dat
=======
Subset of lnk.dat file, excluding records relating to all
good links.


linkscan.own
============
Establishes the mapping between the Owner Code and Owner Name.

0 = Owner Name
1 = Owner Code


linkscan.sum
============
Summary Statistics Data (Note this file is TAB delimited)

 0 = Version
 1 = Date and time of scan
 2 = Total Documents
 3 = Missing Documents
 4 = Documents Containing Errors
 5 = Total Other Files
 6 = Missing Other Files
 7 = Total Anchors
 8 = Missing Anchors
 9 = Total External Links
10 = External Links Tested This Scan
11 = External Links with Errors
12 = External Links with Possible Errors
13 = External Links with Warnings
14 = Total Orphans


linkscan.tim
============
HTTP Transaction Times (Note this file is TAB delimited)

0   URL fetched
1   HTTP status code (200, 404 etc)
2   Document size (bytes)
3   Document Body flag (0=not available; 1=available but not fetched;
                        2=available and fetched)
4   Transaction time (milliseconds)
5   Redirect location

Notes:
* Transaction Time includes time to follow any redirects.
* Time includes time to fetch document body on HTML
  and similar MIME types only.
* On other file types (images for example) the transaction
  time does NOT include the body download. But it does
  measure the time and network/server latency for the
  exchange of full request and response headers. The
  additional time could be computed from the file size
  and a knowledge of the available connection bandwidth.
  It's likely to be quite accurate given that the HTTP
  server has only to push the data from an already found
  file down an already open socket, to the client. Since
  most image file formats incorporate compression, you're
  unlikely to see any further savings even if the
  connection type supported such a scheme.
* Timing will be impacted by # of processes used for
  the scan and also, to some extent, the relative
  performance of the target server and the LinkScan
  machine.



hist/xxxxxx/dat
===============

History Data -- New File Created for Each Scan

 0 = Document URL
 1 = Owner Name
 2 = Document Type Code (see below)
 3 = Clicks
 4 = Content-Type (MIME)
 5 = Document Status Code (see codes.txt)
 6 = Content-Length (size in bytes)
 7 = Last-Modified (date/time)
 8 = Document Title


Document Type Codes
===================

 H = HTML Document
 D = PDF Document
 J = JavaScript Document
 M = Image Map
 S = Flash Document
 T = Text Document
 Y = Reserved
 Z = Import Document

 F = Other File Type
 I = In-line image
 N = Document with Nofollow rule
 O = Orphaned Document
 P = Orphaned File

 A = Anchor
 R = Redirection (internal)

 U = External link
 V = Redirection (external)
 X = Reserved (typically mailto: or invalid characters)

LinkScan for Unix. Reference Manual. Section 28. LinkScan File Formats
LinkScan Version 12.1
© Copyright 1997-2010 Electronic Software Publishing Corporation (Elsop)
LinkScan™ and Elsop™ are Trademarks of Electronic Software Publishing Corporation

  Previous   Contents   Next   Help   Reference   HowTo   Card