From Corpora to Matching

Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.

In The News:

SEO myths busted by an ex-Googler  Search Engine Land
How to Make SEO Happen in 2020  Practical Ecommerce
Managing Successful SEO Migrations  Search Engine Journal
What Are Managed SEO Services?  The Daily Campus
10 Best SEO Podcasts to Master the Art of SEO  99Signals - Tech News, Tech Hacks, & More
What Does It Mean to ‘Do SEO’?  Search Engine Journal
5 Easy SEO Wins with Powerful Results  Search Engine Journal
Poll: After The SEO Audit; Next Steps  Search Engine Roundtable
My experience with SEO  Practical Ecommerce
Can SEO Be Made Predictable?  Search Engine Journal
How to Create Content for SEO  Search Engine Journal
What Is a Google Penalty in SEO?  Search Engine Journal

Why Search Engine Optimization is Not Enough

OK. So you've created a nice website with lots of... Read More

Search Engine Success & The Google-Vision Secret

Want to know the secret to great search engine listings?... Read More

9 Steps to Getting Better Search Engine Rankings

You finally have a website and you are ready to... Read More

Search Engine FAQ

Why shouldn't I use a submission service that submits my... Read More

Keywords are the ?KEY? to a Popular and Profitable Web Site

Keyword Research will reveal answers to 3 critical questions:1. Is... Read More

How To Get Listed In Google For Free

Google does not accept payment for inclusion in their main... Read More

Developing A List Of Keywords For Marketing

Keywords aren't just some words that allow search engines, like... Read More

The Golden 5: Steps to Google Success

The Dream: You wake up one morning and notice your... Read More

Introduction to Google Page Rank (PR)

For anyone looking to enhance their Google Page Rank (PR)... Read More

Banned By Google And Back Again

The date: 29th July 2005. The time: early morning. I... Read More

Keyword Demand Isnt Enough

I get half of the world traffic for the term... Read More

Getting Listed in the ODP, Google Directory

First of all, the Google directory is really just the... Read More

Twelve Steps to Higher Search Engine Placement

Recent studies suggest that more than 80% of new visitors... Read More

The Changing Face of Search Engine Optimization

With the ever evolving internet market for just about anything... Read More

Link Building - The Waiting Game

Link building is a waiting game. Many clients have asked... Read More

Local Customers Know Where to Find Local Businesses... the Internet

Through search engines and directories, the Internet provides a quick... Read More

Beyond the Box with Googles Web API

Google, the most popular, and many say best, search engine,... Read More

Search Engine Optimization and Web Site Usability

Build a Web site and the people will come.Ha! If... Read More

Search Engine Optimization for Beginners

If you are confused about terms like "search engine optimization"... Read More

Have You Heard Of Website Optimization

Have you heard of website optimization ? If you are... Read More

The Role of the Robots.txt File to Improve Site Ranking!

Not many web master take the time to use a... Read More

How to Get the Ranking You Always Wanted!

Is your web site well ranked (In the top ten... Read More

Search Engine Spam

Running an online business relies to a greater or lesser... Read More

Writing Search Engine Friendly Webpages

In order to tap the huge stream of targeted traffic... Read More

Playing By Googles Rules

As the undisputable leader in search engines, Google places a... Read More