Updated: 2002/06/10
Index
NAME
aspseek-sql - the structure of SQL database tables used by ASPseek
SQL TABLES
wordurl
This table keeps information about each word in main and real-time database,
one record per word.
| Field | Description
|
|
| word | Word itself.
|
| word_id | Numeric ID of word.
|
| urls |
Information about sites and urls, in which word is encountered.
Empty if size of info is greater than 1000 bytes, in this case info is stored in separate file.
|
| urlcount | Number of URLs in which word is encountered.
|
| totalcount | Total count of this word in all URLs.
|
Last 3 fields are used only if CompactStorage is set to no, and
updated after finishing of crawling, or then index(1)
is run with -D option.
wordurl1
This table keeps all information about each word in real-time database,
one record per word.
| Field | Description
|
|
| word | Word itself.
|
| word_id | Numeric ID of word, refers to wordurl.word.
|
| urls |
Information about sites and urls in which word is encountered.
Always not empty regardless of size.
|
| urlcount | Number of URLs in which word is encountered.
|
| totalcount | Total count of this word in all URLs.
|
Last 3 fields are updated immediately after downloading of the URL by
index(1) when it is run with -T option.
urlword
This table keeps information about all encountered URLs, both indexed
and not indexed yet which match specified conditions in configuration files.
| Field | Description
|
|
| url_id | ID of URL.
|
| site_id | ID of site, refers to sites.site_id.
|
| deleted |
Set to 1 if server returned 404 error and DeleteBad is set
to yes, or if robots.txt or configuration rules disallow
to index this URL.
|
| url | URL itself.
|
| next_index_time |
Time of next indexing in seconds from UNIX epoch.
|
| status |
HTTP status returned by server or 0 if document has not been indexed yet.
|
| crc | MD5 checksum of document.
|
| last_modified |
"Last-Modified" HTTP header returned by HTTP server.
|
| etag | "ETag" header returned by HTTP server.
|
| last_index_time |
Time of last indexing in seconds from UNIX epoch.
|
| referrer | ID of URL which first referred this URL.
|
| tag | Arbitrary tag.
|
| hops | Depth of URL in hyperlink tree.
|
| redir |
URL ID, where current URL is redirected or 0 if this URL is not redirected.
|
| origin |
URL ID of document which is origin of this cloned document, or zero if
this is not clone.
|
urlwordsNN (where NN is 2-digit number from 00-15)
These tables contain additional info about existing indexed URLs.
Number
NN in table name is URL_ID mod 16.
| Field | Description
|
|
| deleted |
Set to 1 if server returned 404 error and DeleteBad is set
to yes, or if robots.txt or configuration rules disallow
to index this URL.
|
| wordcount |
Count of unique words in the indexed part of URL.
|
| totalcount |
Total count of words in the indexed part of URL.
|
| content_type |
Content-Type HTTP header returned by server.
|
| charset |
Document charset taken from Content-Type HTTP header or META.
|
| title | First 128 characters from pages title.
|
| txt |
First 255 characters from page body, stripped from HTML tags.
|
| docsize | Total size of URL.
|
| keywords | First 255 characters from page keywords.
|
| description |
First 100 characters from page description.
|
| lang | Not used now.
|
| words | Zipped content of URL.
|
| hrefs |
Sorted array of outgoing href IDs from this URL.
|
robots
This table contains information parsed from robots.txt file for each site.
| Field | Description
|
|
| hostinfo | Host name.
|
| path | Path to exclude from indexing.
|
sites
This table contains IDs for all indexed sites.
| Field | Description
|
|
| site_id | ID of site.
|
| site |
Site name with protocol, like http://www.my.com/.
|
stat
This table contains information about query statistics for each completed query.
| Field | Description
|
|
| addr |
IP address of computer, from which query was requested.
|
| proxy |
IP address of proxy server, through which query was requested.
|
| query | Query string.
|
| ul | URL limit used to restrict the query.
|
| sp | Web spaces used to restrict the query.
|
| site | Site ID used to restrict the query.
|
| np | Results page number requested.
|
| ps | Results per page.
|
| sites | Number of found sites matching query.
|
| urls | Number of found URLs matching query.
|
| start |
Query processing start in seconds from UNIX epoch.
|
| finish |
Query processing finish in seconds from UNIX epoch.
|
| referer |
URL of web page from which query was requested.
|
subsets
Table describing all subsets, which can be used to restrict the search.
Populated manually with URL masks. Subset is the set of URLs from
the particular directory of site. Putting masks describing whole site
is not necessary.
spaces
Table describing web spaces. Web space is the set of sites.
Each site belonging to particular space must be put into separate record.
Populated manually or using
-A option of
index. If populated
manually, run
index -B after changing this table.
| Field | Description
|
|
| space_id | ID of web space.
|
| site_id |
ID of site belonging to the space, refers to sites.site_id.
|
tmpurl
Table describing URLs indexed since start of last indexing. Used for debugging.
| Field | Description
|
|
| url_id | URL ID.
|
| thread | Ordinal thread number, which indexed URL.
|
wordsite
Auxiliary table used when search is restricted to site pattern.
Built at the end of indexing from
sites table.
| Field | Description
|
|
| word | Word used in site name between dots.
|
| sites |
Array of site IDs, where this word is encountered.
|
citation
This table contains reverse index of hyperlinks. It is used only if
IncrementalCitations is set to
no.
| Field | Description
|
|
| url_id | URL ID.
|
| referrers |
Array of URL IDs, which have hyperlink to this URL.
|
BLOBS
wordurl.urls, wordurl1.urls
Sites information, ordered by site_id.
|
| Offset | Length | Description
|
|
| 0 | 4 | Offset of URL info for 1st site.
|
| 4 | 4 |
ID of 1st site where word is encountered.
|
| 8 | 4 | Offset of URL info for 2nd site.
|
| 12 | 4 |
ID of 2nd site where word is encountered.
|
...
|
| (N-1)*8 | 4 |
Offset of URL info for Nth site, where N is the total number of sites in which
word is encountered.
|
| (N-1)*8+4 | 4 | Offset of URL info for Nth site.
|
| (N-1)*8+8 | 4 |
Offset of URL info end for Nth site. Must point to the end of blob or file.
|
URLs information. Follows sites information immediately. Offsets are counted from 0.
|
| Offset | Length | Description
|
|
| 0 | 4 |
URL ID of first site in sites info section.
|
| 4 | 2 | Word count in this URL.
|
| 6 | 2 | First position.
|
| 8 | 2 | Second position.
|
...
|
| 6+(N-1)*2 | 2 |
Nth position, where N is the total word count in the URL.
|
Repeated with info for URLs from the same site,
with ID greater than previous.
|
...
|
Repeated with info for URLs for next sites from sites info section.
|
urlwordsNN.words
This field contains gzipped content of URL.
| Offset | Length | Description
|
|
| 0 | 4 |
Size of URL content before zipping or 0xFFFFFFFF if content is not zipped.
|
| 4 | Zipped size | Zipped or original URL content.
|
wordsite.sites
This field contains array of sites/positions for word. Sorted by site IDs.
Structure of array element:
| Bits | Description
|
|
| 24-31 |
Bitmap of positions, highest bit is set to 1 is word is first-level domain.
|
| 0-23 | Site ID.
|
FILES
/usr/local/aspseek/etc/DBType/tables.sql
SEE ALSO
aspseek(7),
index(1),
searchd(1).
AUTHORS
Copyright (C) 2000, 2001, 2002 by SWsoft.
Man page by Kir Kolyshkin <
kir@asplinux.ru>
and Alexander F. Avdonkin <
al@asplinux.ru>.
Index
- NAME
-
- SQL TABLES
-
- wordurl
-
- wordurl1
-
- urlword
-
- urlwordsNN (where NN is 2-digit number from 00-15)
-
- robots
-
- sites
-
- stat
-
- subsets
-
- spaces
-
- tmpurl
-
- wordsite
-
- citation
-
- BLOBS
-
- wordurl.urls, wordurl1.urls
-
- urlwordsNN.words
-
- wordsite.sites
-
- FILES
-
- SEE ALSO
-
- AUTHORS
-
This document was created by
man2html
using the manual pages.
Time: 13:43:46 GMT, December 25, 2002