Unimportant Words Impact Relevancy
- Stop words
- Unimportant terms identified in the search engine indexing Filtration Process.
The famous phrase to the left has only one term that will survive the filtration process (the word Question) – the rest of the terms are assigned no value during the scoring process. To understand this it’s useful to know that a page goes through five phases to produce a score for terms on that page. These are:
- Markup and Format Removal
Producing a Term Score
The first two parts (markup, format removal and tokenization) are known as the Document Linearization process, and this is where formatting and page markup codes are removed from the web page to include all HTML, CSS instructions, scripts, comments and other special formatting codes. During tokenization the page is reduced to a list of terms, then a second level of processing is done to insure that terms are lower-cased, punctuation is removed and other rules are invoked.
The third step known as filtration is where all the terms are divided into two groups:
- Important Terms
- Unimportant Terms
The task is to identify those terms that can be used in future weighting and relevancy calculations. These terms must meet two criteria:
First, the term must describe what the document is about (e.g., Einstein, theory, relativity, speed, light, mc2, energy, gravitation, black holes, physics, cosmic microwave, time and radiation).
Second, the term must differentiate the document from other documents in the document database (in the second test; the terms speed, light and time exist in far too many context so they can’t be used to differentiate documents).
The task is to identify frequently used words that provide no value during weighting. In the following list you will find the top 25 most common terms in the English language, and you can see why they provide no value:
the, of, to, and, a, in, is, it, you, that, he, was, for, on, are, with, as, I, his, they, be, at, one, have and this.
As you work your way down the list you began to see terms that are commonly found in keyword phrases, company names, product names, taglines, URL strings and in ad copy.
You will see these words used everywhere – they are on the list of the 500 most common words in the English language.
Hot, boat, cause, boy, home, hand, large, big, white, children, music, book, mark, feet, rain, eat, fish, mountain, north, wood, paper, war, river, car, color, friend, horse, watch, love, money, road, map, machine, star, online, web, men, animal, mother, house, father, school, family, black, rock, moon, foot, gold, city, tree, door, king, language and game.
Because they are so common, they don’t provide much SEO value from a relevancy perspective because they fail to satisfy the second condition for identifying an important term – they won’t differentiate documents from each other, and accordingly will not receive a weighting value during indexing.
A second category of stop words are terms that have many meanings and appear in many contexts – prepositions fall into this category, but so do some adjectives such as the color red. The word Web, online and Internet also fall into this category, as they appear within the context of every subject and every market segment in the world.
- Music: 9.5 billion search results
- Map: 5.9 billion search results
- Music Map: 1.97 million search results
- MusicMap: 70K search results
These terms are so common that they produce staggering numbers of search results – even with the two-word combination. However, producing a new concatenated word (MusicMap) reduces the universe to 70K results. This results in a term that is potentially more valuable, but you will have to put the time and money into branding it.
Find Out More
About Lexington eBusiness Consulting
Mark Sprague’s 25 years of product development experience, which includes expertise in Search Engines, Information Products, SEO platforms and Social Networking applications, provide in-depth expertise to help you refine products and services, and improve your search engine marketing and website’s performance by:
- Developing a superior data-driven SEO strategy for your website.
- Understanding your customers’ search behavior and normalizing it to your content strategy.
- Understanding how search engine technology practically impacts SEO and content strategies.
- Understanding how search technology impacts content in a social networking environment.
- Developing a superior user experience based on sound information architecture, usability and coding standards.
Lexington eBusiness Consulting
Mark Sprague, CEO
580 Lowell Street
Lexington, MA 02420