Updated: 2002/06/10
Index

NAME

aspseek-sql - the structure of SQL database tables used by ASPseek

SQL TABLES

wordurl

This table keeps information about each word in main and real-time database, one record per word.
FieldDescription

wordWord itself.
word_idNumeric ID of word.
urls Information about sites and urls, in which word is encountered. Empty if size of info is greater than 1000 bytes, in this case info is stored in separate file.
urlcountNumber of URLs in which word is encountered.
totalcountTotal count of this word in all URLs.

Last 3 fields are used only if CompactStorage is set to no, and updated after finishing of crawling, or then index(1) is run with -D option.

wordurl1

This table keeps all information about each word in real-time database, one record per word.
FieldDescription

wordWord itself.
word_idNumeric ID of word, refers to wordurl.word.
urls Information about sites and urls in which word is encountered. Always not empty regardless of size.
urlcountNumber of URLs in which word is encountered.
totalcountTotal count of this word in all URLs.

Last 3 fields are updated immediately after downloading of the URL by index(1) when it is run with -T option.

urlword

This table keeps information about all encountered URLs, both indexed and not indexed yet which match specified conditions in configuration files.
FieldDescription

url_idID of URL.
site_idID of site, refers to sites.site_id.
deleted Set to 1 if server returned 404 error and DeleteBad is set to yes, or if robots.txt or configuration rules disallow to index this URL.
urlURL itself.
next_index_time Time of next indexing in seconds from UNIX epoch.
status HTTP status returned by server or 0 if document has not been indexed yet.
crcMD5 checksum of document.
last_modified "Last-Modified" HTTP header returned by HTTP server.
etag"ETag" header returned by HTTP server.
last_index_time Time of last indexing in seconds from UNIX epoch.
referrerID of URL which first referred this URL.
tagArbitrary tag.
hopsDepth of URL in hyperlink tree.
redir URL ID, where current URL is redirected or 0 if this URL is not redirected.
origin URL ID of document which is origin of this cloned document, or zero if this is not clone.

urlwordsNN (where NN is 2-digit number from 00-15)

These tables contain additional info about existing indexed URLs. Number NN in table name is URL_ID mod 16.
FieldDescription

deleted Set to 1 if server returned 404 error and DeleteBad is set to yes, or if robots.txt or configuration rules disallow to index this URL.
wordcount Count of unique words in the indexed part of URL.
totalcount Total count of words in the indexed part of URL.
content_type Content-Type HTTP header returned by server.
charset Document charset taken from Content-Type HTTP header or META.
titleFirst 128 characters from pages title.
txt First 255 characters from page body, stripped from HTML tags.
docsizeTotal size of URL.
keywordsFirst 255 characters from page keywords.
description First 100 characters from page description.
langNot used now.
wordsZipped content of URL.
hrefs Sorted array of outgoing href IDs from this URL.

robots

This table contains information parsed from robots.txt file for each site.
FieldDescription

hostinfoHost name.
pathPath to exclude from indexing.

sites

This table contains IDs for all indexed sites.
FieldDescription

site_idID of site.
site Site name with protocol, like http://www.my.com/.

stat

This table contains information about query statistics for each completed query.
FieldDescription

addr IP address of computer, from which query was requested.
proxy IP address of proxy server, through which query was requested.
queryQuery string.
ulURL limit used to restrict the query.
spWeb spaces used to restrict the query.
siteSite ID used to restrict the query.
npResults page number requested.
psResults per page.
sitesNumber of found sites matching query.
urlsNumber of found URLs matching query.
start Query processing start in seconds from UNIX epoch.
finish Query processing finish in seconds from UNIX epoch.
referer URL of web page from which query was requested.

subsets

Table describing all subsets, which can be used to restrict the search. Populated manually with URL masks. Subset is the set of URLs from the particular directory of site. Putting masks describing whole site is not necessary.
FieldDescription

subset_idID of subset.
mask URL mask. Example: http://www.my.com/dir/%. Examples of wrong use: http://www.aspstreet.com/%, http://www.aspstreet/%.

spaces

Table describing web spaces. Web space is the set of sites. Each site belonging to particular space must be put into separate record. Populated manually or using -A option of index. If populated manually, run index -B after changing this table.
FieldDescription

space_idID of web space.
site_id ID of site belonging to the space, refers to sites.site_id.

tmpurl

Table describing URLs indexed since start of last indexing. Used for debugging.
FieldDescription

url_idURL ID.
threadOrdinal thread number, which indexed URL.

wordsite

Auxiliary table used when search is restricted to site pattern. Built at the end of indexing from sites table.
FieldDescription

wordWord used in site name between dots.
sites Array of site IDs, where this word is encountered.

citation

This table contains reverse index of hyperlinks. It is used only if IncrementalCitations is set to no.
FieldDescription

url_idURL ID.
referrers Array of URL IDs, which have hyperlink to this URL.

BLOBS

wordurl.urls, wordurl1.urls

Sites information, ordered by site_id.
OffsetLengthDescription

04Offset of URL info for 1st site.
44 ID of 1st site where word is encountered.
84Offset of URL info for 2nd site.
124 ID of 2nd site where word is encountered.
...
(N-1)*84 Offset of URL info for Nth site, where N is the total number of sites in which word is encountered.
(N-1)*8+44Offset of URL info for Nth site.
(N-1)*8+84 Offset of URL info end for Nth site. Must point to the end of blob or file.
URLs information. Follows sites information immediately. Offsets are counted from 0.
OffsetLengthDescription

04 URL ID of first site in sites info section.
42Word count in this URL.
62First position.
82Second position.
...
6+(N-1)*22 Nth position, where N is the total word count in the URL.
Repeated with info for URLs from the same site, with ID greater than previous.
...
Repeated with info for URLs for next sites from sites info section.

urlwordsNN.words

This field contains gzipped content of URL.
OffsetLengthDescription

04 Size of URL content before zipping or 0xFFFFFFFF if content is not zipped.
4Zipped sizeZipped or original URL content.

wordsite.sites

This field contains array of sites/positions for word. Sorted by site IDs.

Structure of array element:

BitsDescription

24-31 Bitmap of positions, highest bit is set to 1 is word is first-level domain.
0-23Site ID.

FILES

/usr/local/aspseek/etc/DBType/tables.sql

SEE ALSO

aspseek(7), index(1), searchd(1).

AUTHORS

Copyright (C) 2000, 2001, 2002 by SWsoft.
Man page by Kir Kolyshkin <kir@asplinux.ru> and Alexander F. Avdonkin <al@asplinux.ru>.


Index

NAME
SQL TABLES
wordurl
wordurl1
urlword
urlwordsNN (where NN is 2-digit number from 00-15)
robots
sites
stat
subsets
spaces
tmpurl
wordsite
citation
BLOBS
wordurl.urls, wordurl1.urls
urlwordsNN.words
wordsite.sites
FILES
SEE ALSO
AUTHORS

This document was created by man2html using the manual pages.
Time: 13:43:46 GMT, December 25, 2002