From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum ( Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

In The News:

How Can I Set Up an SEO Agency?  Search Engine Journal
How Google Uses SEO For Its Websites  Bigger Law Firm Magazine
How to Choose an E-commerce SEO Agency?
SEO in 2020: Going Beyond Google  Search Engine Journal

Maximize Your Search Engine Traffic - 13 Ways to Pull in More Visitors From the Search Engines

Maximizing traffic from the search engines to your web site... Read More

Got Spiders?

Many internet marketers blow mountains of start-up cash on their... Read More

The Other Side of the Search Gods Abracadabra!

Thousands of servers ...billions of web pages.... the possibility of... Read More

Search Engine Optimization Explained

Search Engine Listing And OptimizationIn the world of website marketing,... Read More

Keywords ? Key For All Doors

Keyword research is certainly the most important phase in Search... Read More

Search Engine Updates vs. SEO

Webmasters always anxiously wait for a search engine update. Those... Read More

All About Links -- Interview With Link Building Expert , Bob Gladstein

Julia: Welcome Bob. Thank you for taking the time to... Read More

Increase Your Search Engine Ranking

There are methods to increase your search engine rankings which... Read More

Link Popularity Explained and How To Build Links

Link popularity is the single most influential factor for determining... Read More

The Search Engine Showdown

If you're anything like me, you have a favourite search... Read More

The Life Blood of Internet-Based Home Businesses...Search Engines

Anyone involved in an Internet-based home business will soon come... Read More

Absolute & Relative Links How Do They Rank?

The question for this article is whether or not you... Read More

Beyond Search Engines

Some webmasters report that search engines account for 75% or... Read More

Search Engine Indexing - 3 Strategies Guaranteed to Skyrocket Your Success

In order to design a website that performs well with... Read More

What is Google Talk ?

The advent of Internet has affected lives of almost the... Read More

You Have More Web Sites Than You Think You Have: The Importance of Deep Submission

If I ask you how many web sites do you... Read More

Google?s Siren Call ? Is It Crashing Your Search Engine Marketing?

It's difficult to dispute the rational behind the rant since... Read More

A Way for Search Engines to Improve

Wouldn't it be nice if the search engines could comprehend... Read More

2 Lesser Known Ways to Brainstorm for Internet Home Business Keywords

Search Engine Optimization (SEO) doctrine states that you should always... Read More

How to Use the Google Patent to Get More Traffic

According to the recent release of the Google Patent Application,... Read More

Search Term Suggestion Tool offers a cool function to assist you on your... Read More

7 Simple Steps to Spy on Your Online Competition and Acheive a High Page Rank

My Grandfather ran a small Grocery Store and when you... Read More

How to Prevent Duplicate Content with Effective Use of the Robots.txt and Robots Meta Tag

Duplicate content is one of the problems that we regularly... Read More

Googles PR System Explained

The complexities of Google's PR (Page Ranking) System have grown... Read More

The Business Case for SEO

It's interesting how potential clients have preconceived notions about which... Read More