aspseek-sql(5) ASPseek user's guide aspseek-sql(5) NAME aspseek-sql - the structure of SQL database tables used by ASPseek SQL TABLES wordurl This table keeps information about each word in main and real-time database, one record per word. +-----------+---------------------------------------------+ |Field |Description | +-----------+---------------------------------------------+ +-----------+---------------------------------------------+ |word |Word itself. | +-----------+---------------------------------------------+ |word_id |Numeric ID of word. | +-----------+---------------------------------------------+ |urls |Information about sites and urls, in which | | |word is encountered. Empty if size of info | | |is greater than 1000 bytes, in this case | | |info is stored in separate file. | +-----------+---------------------------------------------+ |urlcount |Number of URLs in which word is encountered. | +-----------+---------------------------------------------+ |totalcount |Total count of this word in all URLs. | +-----------+---------------------------------------------+ Last 3 fields are used only if CompactStorage is set to no, and updated after finishing of crawling, or then index(1) is run with -D option. wordurl1 This table keeps all information about each word in real- time database, one record per word. +-----------+---------------------------------------------+ |Field |Description | +-----------+---------------------------------------------+ +-----------+---------------------------------------------+ |word |Word itself. | +-----------+---------------------------------------------+ |word_id |Numeric ID of word, refers to wordurl.word. | +-----------+---------------------------------------------+ |urls |Information about sites and urls in which | | |word is encountered. Always not empty | | |regardless of size. | +-----------+---------------------------------------------+ |urlcount |Number of URLs in which word is encountered. | +-----------+---------------------------------------------+ |totalcount |Total count of this word in all URLs. | +-----------+---------------------------------------------+ Last 3 fields are updated immediately after downloading of the URL by index(1) when it is run with -T option. urlword This table keeps information about all encountered URLs, both indexed and not indexed yet which match specified conditions in configuration files. +---------------+-----------------------------------------+ |Field |Description | +---------------+-----------------------------------------+ +---------------+-----------------------------------------+ |url_id |ID of URL. | +---------------+-----------------------------------------+ |site_id |ID of site, refers to sites.site_id. | +---------------+-----------------------------------------+ |deleted |Set to 1 if server returned 404 error | | |and DeleteBad is set to yes, or if | | |robots.txt or configuration rules disal- | | |low to index this URL. | +---------------+-----------------------------------------+ |url |URL itself. | +---------------+-----------------------------------------+ |next_index_time|Time of next indexing in seconds from | | |UNIX epoch. | +---------------+-----------------------------------------+ |status |HTTP status returned by server or 0 if | | |document has not been indexed yet. | +---------------+-----------------------------------------+ |crc |MD5 checksum of document. | +---------------+-----------------------------------------+ |last_modified |"Last-Modified" HTTP header returned by | | |HTTP server. | +---------------+-----------------------------------------+ |etag |"ETag" header returned by HTTP server. | +---------------+-----------------------------------------+ |last_index_time|Time of last indexing in seconds from | | |UNIX epoch. | +---------------+-----------------------------------------+ |referrer |ID of URL which first referred this URL. | +---------------+-----------------------------------------+ |tag |Arbitrary tag. | +---------------+-----------------------------------------+ |hops |Depth of URL in hyperlink tree. | +---------------+-----------------------------------------+ |redir |URL ID, where current URL is redirected | | |or 0 if this URL is not redirected. | +---------------+-----------------------------------------+ |origin |URL ID of document which is origin of | | |this cloned document, or zero if this is | | |not clone. | +---------------+-----------------------------------------+ urlwordsNN (where NN is 2-digit number from 00-15) These tables contain additional info about existing indexed URLs. Number NN in table name is URL_ID mod 16. +--------------+------------------------------------------+ |Field | Description | +--------------+------------------------------------------+ +--------------+------------------------------------------+ |deleted | Set to 1 if server returned 404 error | | | and DeleteBad is set to yes, or if | | | robots.txt or configuration rules disal- | | | low to index this URL. | +--------------+------------------------------------------+ |wordcount | Count of unique words in the indexed | | | part of URL. | +--------------+------------------------------------------+ |totalcount | Total count of words in the indexed part | | | of URL. | +--------------+------------------------------------------+ |content_type | Content-Type HTTP header returned by | | | server. | +--------------+------------------------------------------+ |charset | Document charset taken from Content-Type | | | HTTP header or META. | +--------------+------------------------------------------+ |title | First 128 characters from pages title. | +--------------+------------------------------------------+ |txt | First 255 characters from page body, | | | stripped from HTML tags. | +--------------+------------------------------------------+ |docsize | Total size of URL. | +--------------+------------------------------------------+ |keywords | First 255 characters from page keywords. | +--------------+------------------------------------------+ |description | First 100 characters from page descrip- | | | tion. | +--------------+------------------------------------------+ |lang | Not used now. | +--------------+------------------------------------------+ |words | Zipped content of URL. | +--------------+------------------------------------------+ |hrefs | Sorted array of outgoing href IDs from | | | this URL. | +--------------+------------------------------------------+ robots This table contains information parsed from robots.txt file for each site. +-----------------+---------------------------------------+ | Field | Description | +-----------------+---------------------------------------+ +-----------------+---------------------------------------+ | hostinfo | Host name. | +-----------------+---------------------------------------+ | path | Path to exclude from indexing. | +-----------------+---------------------------------------+ sites This table contains IDs for all indexed sites. +--------------------+------------------------------------+ | Field | Description | +--------------------+------------------------------------+ +--------------------+------------------------------------+ | site_id | ID of site. | +--------------------+------------------------------------+ | site | Site name with proto- | | | col, like | | | http://www.my.com/. | +--------------------+------------------------------------+ stat This table contains information about query statistics for each completed query. +------------+--------------------------------------------+ | Field | Description | +------------+--------------------------------------------+ +------------+--------------------------------------------+ | addr | IP address of computer, from which | | | query was requested. | +------------+--------------------------------------------+ | proxy | IP address of proxy server, through | | | which query was requested. | +------------+--------------------------------------------+ | query | Query string. | +------------+--------------------------------------------+ | ul | URL limit used to restrict the query. | +------------+--------------------------------------------+ | sp | Web spaces used to restrict the query. | +------------+--------------------------------------------+ | site | Site ID used to restrict the query. | +------------+--------------------------------------------+ | np | Results page number requested. | +------------+--------------------------------------------+ | ps | Results per page. | +------------+--------------------------------------------+ | sites | Number of found sites matching query. | +------------+--------------------------------------------+ | urls | Number of found URLs matching query. | +------------+--------------------------------------------+ | start | Query processing start in seconds from | | | UNIX epoch. | +------------+--------------------------------------------+ | finish | Query processing finish in seconds | | | from UNIX epoch. | +------------+--------------------------------------------+ | referer | URL of web page from which query was | | | requested. | +------------+--------------------------------------------+ subsets Table describing all subsets, which can be used to restrict the search. Populated manually with URL masks. Subset is the set of URLs from the particular directory of site. Putting masks describing whole site is not neces- sary. +-------------------+-------------------------------------+ | Field | Description | +-------------------+-------------------------------------+ +-------------------+-------------------------------------+ | subset_id | ID of subset. | +-------------------+-------------------------------------+ | mask | URL mask. Example: | | | http://www.my.com/dir/%. | | | Examples of wrong use: | | | http://www.aspstreet.com/%, | | | http://www.aspstreet/%. | +-------------------+-------------------------------------+ spaces Table describing web spaces. Web space is the set of sites. Each site belonging to particular space must be put into separate record. Populated manually or using -A option of index. If populated manually, run index -B after changing this table. +---------------------+-----------------------------------+ | Field | Description | +---------------------+-----------------------------------+ +---------------------+-----------------------------------+ | space_id | ID of web space. | +---------------------+-----------------------------------+ | site_id | ID of site belonging | | | to the space, refers | | | to sites.site_id. | +---------------------+-----------------------------------+ tmpurl Table describing URLs indexed since start of last index- ing. Used for debugging. +----------+----------------------------------------------+ | Field | Description | +----------+----------------------------------------------+ +----------+----------------------------------------------+ | url_id | URL ID. | +----------+----------------------------------------------+ | thread | Ordinal thread number, which indexed URL. | +----------+----------------------------------------------+ wordsite Auxiliary table used when search is restricted to site pattern. Built at the end of indexing from sites table. +------------+--------------------------------------------+ | Field | Description | +------------+--------------------------------------------+ +------------+--------------------------------------------+ | word | Word used in site name between dots. | +------------+--------------------------------------------+ | sites | Array of site IDs, where this word | | | is encountered. | +------------+--------------------------------------------+ citation This table contains reverse index of hyperlinks. It is used only if IncrementalCitations is set to no. +---------------------+-----------------------------------+ | Field | Description | +---------------------+-----------------------------------+ +---------------------+-----------------------------------+ | url_id | URL ID. | +---------------------+-----------------------------------+ | referrers | Array of URL IDs, | | | which have hyperlink | | | to this URL. | +---------------------+-----------------------------------+ BLOBS wordurl.urls, wordurl1.urls +---------------------------------------------------------+ |Sites information, ordered by site_id. | +------------+--------+-----------------------------------+ |Offset Length |Description | +------------+--------+-----------------------------------+ +------------+--------+-----------------------------------+ | 0 | 4 |Offset of URL info for 1st site. | +------------+--------+-----------------------------------+ | 4 | 4 |ID of 1st site where word is en- | | | |countered. | +------------+--------+-----------------------------------+ | 8 | 4 |Offset of URL info for 2nd site. | +------------+--------+-----------------------------------+ | 12 | 4 |ID of 2nd site where word is en- | | | |countered. | +------------+--------+-----------------------------------+ | ... | +------------+--------+-----------------------------------+ | (N-1)*8 | 4 |Offset of URL info for Nth site, | | | |where N is the total number of | | | |sites in which word is encoun- | | | |tered. | +------------+--------+-----------------------------------+ | (N-1)*8+4 | 4 |Offset of URL info for Nth site. | +------------+--------+-----------------------------------+ | (N-1)*8+8 | 4 |Offset of URL info end for Nth | | | |site. Must point to the end of | | | |blob or file. | +------------+--------+-----------------------------------+ |URLs information. Follows sites information immedi- | |ately. Offsets are counted from 0. | +------------+--------+-----------------------------------+ |Offset Length |Description | +------------+--------+-----------------------------------+ +------------+--------+-----------------------------------+ | 0 | 4 |URL ID of first site in sites | | | |info section. | +------------+--------+-----------------------------------+ | 4 | 2 |Word count in this URL. | +------------+--------+-----------------------------------+ | 6 | 2 |First position. | +------------+--------+-----------------------------------+ | 8 | 2 |Second position. | +------------+--------+-----------------------------------+ | ... | +------------+--------+-----------------------------------+ | 6+(N-1)*2 | 2 |Nth position, where N is the to- | | | |tal word count in the URL. | +------------+--------+-----------------------------------+ | Repeated with info for URLs from the same site, with | | ID greater than previous. | +---------------------------------------------------------+ | ... | +---------------------------------------------------------+ | Repeated with info for URLs for next sites from sites | | info section. | +---------------------------------------------------------+ urlwordsNN.words This field contains gzipped content of URL. +--------+--------------+---------------------------------+ |Offset | Length | Description | +--------+--------------+---------------------------------+ +--------+--------------+---------------------------------+ | 0 | 4 | Size of URL content before zip- | | | | ping or 0xFFFFFFFF if content | | | | is not zipped. | +--------+--------------+---------------------------------+ | 4 | Zipped size | Zipped or original URL content. | +--------+--------------+---------------------------------+ wordsite.sites This field contains array of sites/positions for word. Sorted by site IDs. Structure of array element: +-------------------+-------------------------------------+ | Bits | Description | +-------------------+-------------------------------------+ +-------------------+-------------------------------------+ | 24-31 | Bitmap of positions, | | | highest bit is set to | | | 1 is word is first- | | | level domain. | +-------------------+-------------------------------------+ | 0-23 | Site ID. | +-------------------+-------------------------------------+ FILES /usr/local/aspseek/etc/DBType/tables.sql SEE ALSO aspseek(7), index(1), searchd(1). AUTHORS Copyright (C) 2000, 2001, 2002 by SWsoft. Man page by Kir Kolyshkin and Alexander F. Avdonkin . ASPseek v.1.2.10 2002/06/10 aspseek-sql(5)