SEO Information

From Corpora to Matching


Making effective use of the Internet is increasingly about creating better and more intelligent applications and search engines. Here is a brief introduction into how search engines work:

01) Define the corpus, search space/data;
02) Separate the corpus into documents;
03) Generate features for each document;
04) Generate a representation of each document;
05) Study the feature/vector space;
06) Cluster documents;
07) Reduce dimensionality;
08) Accept input Queries;
09) Find the cosine angles against the query vector;
10) Find the sought vector column;
11) Output results to user in some way;

Each document in a corpus (database) is described by a set of keywords called index terms. We assign weights to index terms according to their relevance (frequency of occurrence for instance), this is how we go about creating the index, that we can then search.

Corpus preparation:
Web pages of interest are analysed and cleaned by removing hypertext tags or any other hyper language; Pages are then broken down into documents where each document is scanned through searching for words/terms of interest: those which make a document unique, not standard words.

Extract terms of interest:
Bear in mind that terms of interest must be invariant, that is be characteristic of a document, not generic and easy to find in any corpus/document. The idea is to find a signature per document.

Build term-by-document matrix:
The search space is defined by N dimensions where the chosen terms/features of a document is a point in the N term space, this allows conceptual/semantic searches.

Each document becomes a column vector, each row represents a term. Each row identifies the frequency of a term across the analysed corpus, at first we simply build the matrix by counting the terms for each document.

Compress the matrix:
There are two basic techniques/methods, Compress Row Storage (Scans matrix row by row) and Compress Column Storage (Scans matrix column by column) Both use three arrays.

Normalis the matrix:
Normalisation implies transforming column vectors to unit vectors: i.e. vectors of unit length

Unit document vectors contain frequency of terms; the normalisation is applied because the semantic content of a document is generally determined the relative frequency of terms.

Singular Value Decomposition:
This simplifies a symmetric matrix into three matrices Two are identical and represent the eigenvectors: the new dimensions. The third is diagonal and represents the eigenvalues, that is the spread of the corpus along these new dimensions.

A geometric interpretation:
The corpus is first formated, stemmed and is then stored in a compact term-by-document matrix. Each column of such matrix is then normalised to produce the likelihood of a term across the corpus, or, equivalently, the frequency of terms in a document.

The term-by-document matrix is then decomposed to calculate eigen values and vectors. Eigen vectors represent a new Cartesian coordinate frame spanning the same search space, BUT, they indicate the most important dimenions/axis along which documents mainly lie. Eigen value do quantify the spread of documents along these new axes/eigen vectors.

Queries:
Queries must be based on defined features/terms within the term-by-document matrix, matching in a vector space such as this is implemented by multiplying the query vector against the terms by document matrix,ie matching a query vector q against the documents of the matrix.

© I am the website administrator of the Wandle industrial museum (http://www.wandle.org). Established in 1983 by local people determined to ensure that the history of the valley was no longer neglected but enhanced awareness its heritage for the use and benefits of the community.


MORE RESOURCES:

Earthtimes (press release)

Search Engine Optimization Inc. Announces New Addition to Senior ...
MSNBC - 11 hours ago
As CDO for Search Engine Optimization Inc. (SEO Inc.), Ed will develop new strategies for future sales growth and company expansion as well as spearhead the ...
Affordable SEO Marketing Services PowerHomeBiz.com (press release)
SEO, Inc. Hires Executive socalTech.com
NY SEO Firm Ranks High With NY Businesses WebWire (press release)
4Hoteliers - Promotion World (press release)
all 40 news articles


Save 50% on the 3-Day Advanced SEO Weekend Workshop Coming to ...
PR Web (press release), WA - 5 hours ago
Leading SEO industry educators Robin Nobles, John Alexander and Michael Marshall will be meeting on Jan. 24-26 for a special hands-on SEO skills Workshop ...


Video SEO “Posers” Piss Me Off!
ReelSEO Online Video News, CA - 13 hours ago
Online video has become incredibility opportunistic for many new businesses to enter the space, with more people claiming to be “experts” in Video SEO. ...


USA SEO Pros Beats Recession Lows
PR.com (press release), NY - 5 hours ago
As the number of jobless claims increases across the country, it would seem that many people are turning to home businesses and USA SEO Pros has the numbers ...


Linking Related Theme Pages To Improve SEO
Practical Ecommerce (subscription), Grand Junction - 21 hours ago
Consider using a technique called link siloing to improve search engine optimization for your site. The idea is to link related theme pages supporting pages ...


Baytech Releases Long-Awaited Integrated SEO / PPC Marketing Solution
Business Wire (press release), CA - Jan 8, 2009
“SEO is like a foundation for all websites,” says Jennifer Lin, Marketing Director at Baytech. “Keep in mind that SEO is optimizing a website for better ...


SEO 'aids social media release success'
Digital Response Media, UK - 1 hour ago
Search engine optimisation (SEO) can help to increase the success of marketing campaigns involving social media news releases, it has been suggested. ...
PR Interview: Become a Social Media Release Convert: 8 Steps MarketingSherpa.com (subscription)
all 2 news articles


Fantastic Growth Propels USA SEO Pros into 2009
BigNews.biz (press release), MA - Jan 6, 2009
Now that the buzz of the holiday season is dying down, it’s time to get back to business at USA SEO Pros. The numbers are in for 2008, and it was another ...


SEO Consult Maintains Its Ranking as The UK's Number 1 Search ...
MSNBC - Jan 6, 2009
TopSEOs, the independent authority on search vendors, has ranked SEO Consult, the UK's leading Search Engine Optimisation (SEO) company, as the number one ...
How PPC and SEO Can Benefit Companies In A Struggling Economy Promotion World (press release)
all 19 news articles


What’s going to happen to SEO in 2009?
Blogstorm, UK - 5 hours ago
by Patrick Altoft on January 9, 2009 Predicting what will happen in the SEO world is pretty much a question of guessing what the engineers and product ...

SEO - Google News

home | site map
© Seo-Latest.com 2008