9 02, 2016

How to Identify Stolen Content and Take Action!

By | 2017-06-16T12:46:34+00:00 February 9th, 2016|Categories: SEO|Tags: , , , , , , , , |

Imagine that you and your staff have spent countless hours creating engaging content for your website, only to discover that much of it has been stolen and repurposed by others – without your consent.

The appearance of duplicate content could adversely affect your website search rankings, making it more difficult for prospective students, alumni and the community to find you. And as we all know, good content rules. So, why let others break them (the rules, that is)?

At Beacon, we’ve seen what unethical practices such as copy scraping can do. Having personally experienced the theft of our content fairly recently, I thought I’d share the steps I took to alert Google to this offense and protect our company from the negative fallout that can follow.

Here are six easy steps for getting back at the thieves who steal copy.

Step 1 – Verify that your suspicions are correct.

Perform a quick Google search to determine where your copy is showing up across the internet. You can randomly select copy from a webpage (copy and paste a few sentences in a Google search box) to run a query. The search results will indicate if your copy appears on another site on the web other than your own.

For example, here are the results from my search.

Scraped Content

The search results will provide you with a list of webpages where that content appears (including your own, of course). As you can see in this example, there is another website using content I wrote without my consent (see the red arrow above).

Step 2– Investigate the extent of the theft

Stolen ContentScraped Content

When investigating the extent of plagiarism, check to see if your content was been copied verbatim. Also, you’ll want to check if this is an isolated event or if the website in question has copied multiple pieces of content. In our example above, you will notice multiple instances of stolen content. It’s time to take action.

Step 3 – Reach out to the website’s administrator

Reach out to the webmaster of the website that stole the copy. If the webmaster’s email contact isn’t readily displayed, check the about or policy sections of their website. The webmaster’s address is often hidden within these pages.

Once you’ve found an email address, notify him that you are aware of the offending activity and request that he remove the stolen content within a defined period of time. A week to ten days is more than enough.

Should the webmaster voluntarily remove the stolen content, your job is done. Have a latte. However, most nefarious webmasters will ignore such warnings and hide behind a perceived veil of anonymity.

Now, the fun begins.

Step 4 – Contact the hosting provider

It’s time to perform a who-is-lookup. This online tool provides you with the webmaster’s identity and more importantly, their website hosting provider. Armed with this new information, I reached out to the hosting provider and let them know that a website they host had blatantly infringed on my intellectual copyrights. I respectfully requested that they take down the website in question.

Step 5 – File a DMCA request

If the hosting provider fails to respond, then it’s time to file a dirty DMCA request. Only take this step once you have exhausted the other options. Also, keep in mind that you need to have the authority to act on behalf of your organization prior to filing this request.

You have the option of drafting your own DMCA takedown request or downloading this DMCA Take Down Notice Template to customize and send to the offending website owner. After you have sent the DMCA notice, give the website a week to ten days to respond. If you don’t hear back within the time you designate in your notice, it’s time to elevate the complaint to Google and get some sort of resolution.

Step 6 – Request Google remove the stolen content

Log into Google Search Console: https://www.google.com/webmasters/tools/dmca-notice. This will take you to the copyright removal section within Google (see below). Simply follow the instructions and be sure to describe the nature of the work being copied and include URLs where the copyrighted work can be viewed. Also, include the link to the infringing material.

Scraping Site

The DMCA request tends to work pretty quickly so you want to keep an eye on how many pages are currently indexed and compare it over the next few days or weeks. You can double check this by running another search query containing a snippet of your stolen copy. If you were successful in your attempt at protecting your content, you will see that Google has removed pages from its search engine that were infringing upon your copyrights once they complete their investigation.

Monitoring tip: If you would like to check the progress of your request, perform a site search if the offending site and make a note of the number of pages Google has indexed (see below). Compare this number to future searches and you may find the Google now indexes fewer of the website’s pages than before your request. This is a sign that Google may be taking action.

stolen content before after

You’ll know you’ve reached a final resolution when you run a search query and see the following highlighted message displayed:

stolen content example

Good luck and happy hunting!

18 01, 2016

RankBrain in 2016

By | 2017-06-16T12:52:33+00:00 January 18th, 2016|Categories: SEO|Tags: , , , |

RankBrain

Google has used word frequency, word distance, and world knowledge based on co-occurrences to connect the many relations between words to serve up answers to search queries. But now, due to the recent breakthroughs in language translation and image recognition, Google has turned to powerful neural networks that have the ability to connect the best answers to the millions of search queries Google receives daily.

RankBrain is the name of the Google neural network that scans content and is able to understand the relationships between words and phrases on the web.

Why is it better than the previous methods? In a nutshell, RankBrain is a better method because it is deep learning self-improving system. Training itself on pages within the Google index, RankBrain looks upon the relationships between the search queries and the content contained within the Google index.

How does it do this? Neural networks are very good at conducting reading comprehension based on examples and detecting patterns in those examples. Trained on existing data, Google’s vast database of website documents is able to provide the necessary large-scale level of training sets. When conducting training, Google changes key phrases or words into mathematical entities called vectors which act as signals. RankBrain then runs an evaluation similar to the cloze test.  A cloze test is a reading comprehension activity where words are emitted from a passage and then filled back in. With a cloze test, there may be many possible answers, but on-going training from a vast data set allows for a better understanding of the linguistic relationships of these entities.  Let’s look at an example:

The movie broke all (entity1) over the weekend.

Hollywood’s biggest stars were out on the red carpet at the (entity2) premiere.

After deciphering all of the intricate patterns of the vectors, RankBrain can deliver an answer to a query such as “Which movie had the biggest opening at the box office?” by using vector signals from entities that point to the search result entity receiving the most attention. It does this without any specific coding, without rules, or semantic markup. Even for queries that may be vague in nature, the neural network is able to outperform even humans.

With RankBrain, meaning is inferred from use. As RankBrain’s training and comprehension improves, it can focus on the best content that it believes will help answer a search query. As a result, RankBrain can understand search queries never seen before. In 2016, be prepared to provide the contextual clues that RankBrain is looking for.

1 12, 2015

How-To: Robots.txt Disaster Prevention

By | 2017-08-11T16:09:50+00:00 December 1st, 2015|Categories: SEO|Tags: , |

It’s any SEO‘s worst nightmare. The Production robots file got overwritten with the Test version. Now all your pages have been ‘disallowed’ and are dropping out of the index. And to make it worse, you didn’t notice it immediately, because how would you?

Wouldn’t it be great if you could prevent robots.txt file disasters? Or at least know as soon as something goes awry? Keep reading, you’re about to do just that.

The ‘How’

Our team recently began using Slack. Even if you don’t need a new team communication tool, it is worth having for this purpose. One of Slack’s greatest features is ‘Integrations’. Luckily for us SEOs, there is an RSS integration.

5 Simple Steps for Quick Robots.txt Disaster Remediation:

  1. Take the URL for your robots file and drop it into Page2Rss.
  2. Configure the Slack RSS integration.
  3. Add the Feed URL (from Page2RSS) for your robots file.
  4. Select the channel to which notifications will post. ( I suggest having a channel for each website/client, read more on why later)
  5. Relax and stop worrying about an accidental ‘disallow all’.

The Benefits of this Method

4 Benefits of Using Page2Rss & Slack to Watch Your Robots File:

  1. You can add your teammates to channels, so key people can know when changes occur! One person sets up the feed once, many people get notified.
  2. Page2Rss will check the page at least once a day. This means you’ll never go more than 24 hours with a defunct robots file.
  3. No one on your team has to check clients’ robots files for errors.
  4. You’ll know what day your dev team updated the file. This enables you to add accurate annotations in Google Analytics.

robots damage prevention

The ‘Why’

OK, cool, but why is this necessary? Because you care about your sites’ reputation with search engines, that’s why!

Mistakes happen with websites all the time, and search engines know that. They’re not in the business of destroying websites. But they are in the business of providing the best results to their customers. So if you neglect to fix things like this with a quickness, you’re risking demotion.

I’ve seen sites go weeks with a bad robots file, and it is not pretty. Once search engines have removed your site from the index, it is quite difficult to get back. It can sometimes take weeks to regain the indexation you had prior. Don’t undo hard work put into search engine optimization because your file was overwritten. Do yourself a favor and setup this automated monitoring feature.

I’ve armed you with this info, now there is no excuse for getting de-indexed due to robots.txt issues. If this has happened to someone you know, please share this post with them!

29 10, 2015

Google and Its Book Scanning Initiative – Trick or Treat?

By | 2018-05-01T08:26:50+00:00 October 29th, 2015|Categories: SEO|Tags: , , |

This Halloween, Google has toilet papered your entire yard and the US Second Circuit Court of Appeals just rang the doorbell, left a flaming bag of you-know what on your doorstep and ran like a bat outta’ Hell. Who are you?

You’re an author with a career worth of product, mostly published offline through traditional literary mediums. You have every right to feel that you’ve been wronged. I know I do.

While I don’t advocate for the trampling of anyone’s rights in favor of another (one of my pet peeves), the 2nd Circuit Court decision has some upside. Think Frankenstein and fire.  Let me explain…

poster-art-smallA Quick Overview

As you probably know, the objective of Google’s book scanning initiative is to scan every book available and make the contents available online for educational purposes. The book scanning initiative (the way I understand it) does not make copyrighted materials available online for free to those who wouldn’t otherwise have it available to them. This project is meant to aid libraries in copying their current catalog for use by library patrons who would otherwise have access to already paid for, hard copy versions.

The Authors Guild had taken great exception to the book scanning project, as one might expect. Citing existing laws on copyright infringement (17 U.S.C. § 107) The Author’s Guild has argued that Google’s book scanning initiative deprives writers of revenue from their work.

This court battle started way back in 2005.

The 10 year ordeal appears to be over. The US Second Circuit Court of Appeals ruled with Google and its “fair use” defense. The “fair use” defense (admittedly greatly simplified here in the interest of expediency) argues that since the content is being used for educational purposes, it serves a greater good. Additionally, it does not “excessively damage the market” for the current copyright holder.

If you’re not a creator (or even if you are), you’re probably wondering what this means for your website or agency. Will the fire Google started be used for good or evil? Will users see a benefit or will my SEO efforts become a horror show?

The answer is yes, yes, yes and maybe. Let’s talk about the bad first.

More Panda Updates

There is no doubt that while Google may be providing a service through this massive book scanning effort, they’ll get their sweat equity when they use this data to fine tune their algorithm in their pursuit to rid the internet of duplicate content. While this means a better user experience for most (yeah!), it could mean sleepless nights for SEOs and website operators who have used nefarious means to add “new” content to their blogs or websites.

Imagine your agency gets a new client. That’s a good thing. What you don’t know is that this client has in the past employed an SEO firm that had resorted to using re-purposed content from rare books. The next Panda update comes and your client gets slammed. Guess who gets blamed? FIRE, BAD.

However, there’s a great deal of good news too. Consider this:

More Books for the Disabled

In a related ruling, the appeals court decided in favor of HathiTrust Digital Library and their application of “fair use”. A non-profit project, HathiTrust Digital Library consist of a consortium or university libraries with a mission to provide digital books for the disabled. FRIEND, GOOD.

Better Experience for The End User

Less fluff and more real content will result from future algorithm changes. That’s great news for users and all of us who do things the right way. FRIEND, GOOD.

More Work for Content Creators

This is a big maybe but, in theory, this could work to a writer’s advantage. As the algorithm detects new re-purposed copy, something of value has to replace the fluff copy that had previously been used.  FRIEND, GOOD.

In Conclusion

Like the blind man in the original Frankenstein movie, I probably won’t convince any traditional writer that that his or her rights are not being subjugated in favor of commerce and the rights of another. And in the end, you can’t ignore the fact that the “monster” enjoyed a big, fat cigar with a friend. It ain’t all bad.

And on a side note, I hear the new Panda algorithm can scrape poop from shoes.

18 10, 2015

How to Properly Handle Pagination for SEO [Free Tool Included]

By | 2017-08-11T16:08:58+00:00 October 18th, 2015|Categories: SEO|Tags: , , , |

Let’s start out by defining what I mean by ‘pagination’. This mostly applies to ecommerce sites where there are multiple pages of products in a given category, but it can occasionally be seen on lead-gen sites as well. Here’s an example of what this might look like:

  • http://www.freakerusa.com/collections/all
  • http://www.freakerusa.com/collections/all?page=1
  • http://www.freakerusa.com/collections/all?page=2
  • http://www.freakerusa.com/collections/all?page=3
  • http://www.freakerusa.com/collections/all?page=4

(pages 3 & 4 don’t actually exist on this site, but it helps illustrate my example a little bit more)

In this case, you’ve got 4 pages all with the same meta data. It’s likely that search engines are going to index all of the pages listed above, and count the pages with parameters as duplicates of the original first page. You’ve also got a duplicate hazard with /collections/all and/collections/all?page=1. If you’re concerned with search engine optimization and your organic visibility, you’re going to want to keep reading.

Proper Pagination for SEO

So, how do you go about solving this problem? Fortunately, all the major search engines recognize and obey rel= tags; rel=canonical, rel=prev, and rel=next. The canonical tag says “hey, we know this page has the same stuff as this other page, so index our preferred version”. The ‘prev’ and ‘next’ tags say “we know these pages are paginated and have duplicate meta elements, so here’s the page that will come next, and here’s the one the precedes it”. There are HTML tags that go along with each of these that you’ll need to have your dev team add to the <head> section of the pages. Rather than show you what these tags are and how to generate them for each page, I’ve built an Excel spreadsheet that will generate all necessary tags (for paginated categories up to 20 pages in depth), all you need to do is add your base-URL at the top and hit enter. By ‘base-URL’ I mean this: “http://www.freakerusa.com/collections/all?page=”, basically it’s the paginated URL without the actual number of the page.

Tag Builder CTA

14 10, 2015

Avoiding Duplicate Content – Get the Most From Your SEO Efforts

By | 2018-09-14T15:24:31+00:00 October 14th, 2015|Categories: SEO|Tags: , |

The Panda algorithm is just another example of Google’s effort to identify “thin” content and enhance the user experience. To clarify, the actual quality of the copy is secondary. The objective is to add content that is of value to the user. Quality of copy and value of content can mean two very different things. So for example, the word count of any page theoretically isn’t that important as it does not correlate to value or thinness.

What specifically constitutes duplicate content, then?

Yes, thin content would include republished material or very similar pages. But, that just scrapes (pun intended) the surface. In general terms, anything that may obfuscate a page ranking or make it difficult for Google to determine which page to index may be construed as a duplicate content issue. These could include (but are not limited to):

duplicate content elvis

  • Printer friendly versions of pages
  • Same URL for mobile site
  • www. and non-www. pages (no canonical tags)
  • Identical product descriptions for similar products
  • Guest posts

How can I solve my duplicate content issues?

Canonical tags can help solve many duplicate content issues. Proper use of rel=canonical tags can ensure that Google passes any link or content authority to the preferred URL. Your preferred URL will show up in the Google search results.

There is a clear, preferred method to eliminate mobile URL issues. Move to a responsive site. While you may feel that budget constraints make this a less desirable option, responsive design enhances the user experience – which is what the Google algorithm is all about. The seo benefits of responsive design make this an investment that will pay off immediately and well into the future.

Expanding your product descriptions can be a laborious task, particularly when one considers the sheer volume of products any one website may offer. You can bolster product description content in any number of ways. As well as expanding the product description verbiage, one can include specifications or details, include “related purchases”, or add testimonials from previous users. For items that require assembly, how-to videos are a great alternative.

If your site accepts guest posts, search online before posting any new guest content to ensure that the content does not reside elsewhere.

Creating New Content: Does Size Matter?

I’ve heard it said that Google determines the quality of its search results using the time to long click method. In other words, a significant factor in determining the value of a search result is the amount of time a user spends on a website after leaving the Google search page. Additional emphasis is placed on the user’s next move. So, if the user does not go back to google search to perform another search, the presumption is that the question was answered adequately. It doesn’t matter how long the user spent on the page that was served up as the result. If this is accurate, the length of copy is not important. If the content was lengthy but did not meet the user’s expectation, they presumably would return to re-search the topic. If the resulting article was short but to the point and adequately answered the user’s query, then they would not likely return to perform another search. Assuming the time to long click method is used, size does not matter so much as the actual value of the material to the user.

That being said, sometimes less isn’t more. Larger articles seem to rank better in my personal experience. This may simply be due to the fact that when writing a longer article, more information is being shared thereby increasing the likelihood that the user finds what they’re looking for. While not consistent with stated policy, why not err on the side of caution and not only include valuable information but as much of it as possible?

8 10, 2015

Should I Use Canonicals or 301 Redirects?

By | 2017-08-08T08:42:09+00:00 October 8th, 2015|Categories: SEO|Tags: , , |

Should you 301 redirect that page to another, or should you use a rel=canonical tag? There are tons of reason why you might have some redundancy on your site. If it’s an eCommerce site, you’re probably displaying product listing pages a few different ways (sort by price, color, rating, etc.), or you might have navigation pages that are similar to your SEO landing pages. Whatever the case may be, chances are pretty good you have some form of duplication on your site that needs addressing.  This topic has been debated for years, but the real answer lies in one simple question:

Should people be able to access both pages in question?

Should I use canonicals or 301 redirects?

If the answer to this questions is Yes, you want to use rel=canonical. Doing so will point search engines towards your preferred page, but won’t prevent people being able to access, read, and interact with both pages. Here are some times you might see the rel=canonical tag in action:

  • www & non-www versions of URLs
  • parameters that change how a product listing page is sorted
  • navigation pages that point to an equivalent SEO landing page (it doesn’t always make sense to put content on a nav page)

If the answer to your question is No, you should remove that page and 301 redirect it. Page removal is much more common among eCommerce sites where products are discontinued but you can’t just remove the page (what if someone is linking to it?!?). Occasionally, you’ll see cases where this needs to be done for SEO landing pages. In the case of large SEO projects, where there are hundreds or thousands of keywords, content can get duplicated easily. Keeping a perfect account of every single SEO landing page that’s been written is basically impossible, so you might end up with two different pages with URLs like this: /blue-widgets and /widgets-that-are-blue. Obviously, even if the content isn’t identical, you can’t keep both of those pages around. Figure out which one has the most authority, links, and traffic – keep that one, and redirect the other one to it.

Next time you come to this fork in the road, remember to ask yourself whether or not there is value in people being able to see both versions.

12 01, 2015

Matt Cutts Shares His 6 Lessons From The Early Days of Google

By | 2018-05-01T08:25:56+00:00 January 12th, 2015|Categories: SEO|Tags: , , |

From his presentation at University of North Carolina, January 8th, 2015

It was a long walk from the parking lot, down the brick sidewalk to the mostly glass edifice called the FedEx Center. On the way there, I met two UNC computer science professors, one rather tall and thin and the other short and stocky. Once having learned why I had come, they offered to show me to the auditorium where Matt Cutts would soon speak. That’s when it hit me. Brick sidewalk, tall and short, Matt Cutts….

We were off to see the Wizard.

While walking into the auditorium with said professors, I ran into a colleague from Beacon. The picture was nearly complete. If only she were carrying a miniature dog in one hand.

The event was well attended – so well in fact, that a good number of students (having neglected to register in advance) were being held out until it was determined that there would be enough room to accommodate them.

The Wizard was busy…Go Home.

lessonsIt was clear from the outset that Cutts was there to speak primarily to the students of his alma mater. There would be no pulling back of the SEO curtain or uncloaking of Google’s algorthm for the SEO’s in attendance. Still, Cutts would share some interesting stories from his experiences on the front lines of the war on Spam. Best of all, there would be a Q&A session (see details below).

Within the framework of his 6 lessons for students (and presumably those who wish a long, fruitful career in SEO), Cutts shared a number of fascinating experiences culled from his many years at Google – from his first major controversy involving the Digital Marketing Copyright Act (DMCA) and Church of Scientology to public policy and how it reshapes the environment under which start-ups operate.

The Wizard’s 6 lessons from the early days of Google were as follows:

1) Find creative solutions to apparent constraints

2) Be proactive – ask for what you want 3) Question your assumptions 4) Weird, bad things will happen 5) Take more pictures and have fun

successSpeaking for the majority of the 40+ minute session about his career evolution from Google’s ad department to Chief Spam Cop, Cutts covered a wide variety of subjects from data volume and AI to data safeguards and Fred Brooks. He impressed upon the attendees the fact that there are, indeed, faces behind Google – not every reconsideration request is answered via form letter. He shared the fact that every Google employee must spend some period of time on the user support team.

Take more pictures along your journey. That was another point of emphasis. You’ll want to remember the good times. And even if you love what you do, there will be dark days, too.

Click here for appropriate sound effect

While Cutts’ monologue was entertaining, it may have been the Q&A portion that was most interesting. Read on and you’ll find just a few of the questions posed to Matt along with his answers as I can best recall / summarize them. If you were there and feel I didn’t quite get it right, please put the record straight by leaving a comment at the bottom of the page.

Q: Safesearch – How insulated are you from backlash when it occurs?

A: Not as insulated as one would think. Matt has actually fielded parent complaints as part of the user support team. This and other lines of communication were the genesis for his debunking of popular internet myths through his Google Guy posts.

Q: What do you see as the future of search? A: Voice is important as well as context. With the informed consent, Google can make the user’s life a whole lot easier.

Q: What Safeguards are in place to protect emails and other proprietary information on Google servers? A: 1. The marketplace. People can move to Yahoo if Google does not do its job adequately. 2. Takeout.google.com. One can download all of their information, export it and take it to another company if they wish. 3. Regulators like the FTC.

Q: Have you given thought to when you’ll return to Google? A: His answer was somewhat vague. Cutts stated that while he had been a workaholic for some time now, he felt that his family should “get the relaxed version of myself for a little longer”.

Q: When the University of Kentucky plays North Carolina Chapel Hill, who do you root for? A: Much to the disappointment of the students on hand, Matt stated that he finds himself rooting for Kentucky but offered this nugget to appease the UNC faithful; “We can all agree on one thing…Duke sucks”.

Cutts’ presentation had it all – heart, courage and brains. And when all was said and done, everybody got what they came for, I suspect. Now for the long trek home from Emerald City…

There’s no place like home. There’s no place like home.

Thanks to Andrea Cole for the nifty pics of Matt.

17 04, 2014

Sub-domains vs. sub-directories – which is better for seo?

By | 2018-05-01T08:24:40+00:00 April 17th, 2014|Categories: SEO|Tags: |

It’s easy to recognize the difference between sub-domains and sub-directories. When creating sub-domains, the new section name is placed immediately in front of the primary domain name, separated by a dot. So if my site is www.teammascots.com, I can create a sub-domain specific to one type of mascot by creating the url www.dogs.teammascots.com. Dogs is a sub-domain of teammascots.com.

By contrast, if I want to create a sub-directory for this same page, dogs would follow the primary url and look like this: www.teammascots.com/dogs.

lee-corsoThey’re both readily accessible from your site’s navigation so there’s really no difference, right? As Lee Corso would say, NOT SO FAST.

(If you have no idea who Lee Corso is, you can learn more here.)

Here’s just a few reasons why sub-domains may be better suited to boost your site’s performance:

Google treats sub-domains similar to top level URLs. By contrast, a sub-directory creates another layer or level and is one level further removed from the main index or home page. Sub-directories enable one to create a more targeted top level and get the most out of your web design.

Creating a sub-domain allows you to use keyword triggers in your URL. Use your keyword toward the front of the URL. It may appear slightly more relevant to search engine crawlers while keeping the URL short.

You can avoid country code restrictions. Some countries require a company to have a presence within its borders before you can use the applicable country extension. Using a sub-domain, you can choose to address an audience or language demographic directly in the URL as opposed to having to use a country code. So instead of www.teammascots.ru, I’d opt for www.ru.teammascots.com. Problem solved and I didn’t have to outsource anything to a goat farmer in Kazakhstan.

CTR may be higher. It stands to reason that since folks read left to right, they’ll see the keyword they’re searching quicker in your URL. In theory, this should result in more clicks.

Sub-domains vs. sub-directories. Which is better for search engine optimization? In my humble opinion, there is a clear winner. Hey…where’d my UGA head go?

18 02, 2014

Google Artificial Intelligence and Hummingbird

By | 2017-06-16T12:52:40+00:00 February 18th, 2014|Categories: SEO|

Artificial Intelligence

In a previous post, I stated that Google named its new algorithm after a hummingbird because a hummingbird is quick and precise and has the tremendous ability to recall information about every flower it has ever visited. The name implies a more robust and faster search engine algorithm.

But besides extraordinary powers of recall, I overlooked the important fact that hummingbirds are also very smart. Hummingbirds have been observed learning new behaviors in the wild. Also, a hummingbird’s brain is 4.2% of its body weight. That percentage is the largest proportion in the bird kingdom. Maybe all of the above reasons explain why Google named the new algorithm Hummingbird. But above all, just like a hummingbird, the new search engine algorithm is pretty smart.

Many have pointed out Hummingbird’s new powers of semantic search. But I believe there is another component to Hummingbird. It’s an area of artificial intelligence researchers have been seeking to perfect for some time.

Almost four years ago Google’s Amit Singhal, the head of Google’s core ranking team, spoke to EnGadget about his dreams for search. His number one and most difficult goal was to include information that doesn’t come from text but from images. At that time, Amit describes the “computer vision algorithms” as still in a basic form. Most of the information about an image still came from text surrounding an image.

Peter Norvig alluded to the importance of images in his interview with Marty Wasserman for Future Talk in the fall of 2013 not long after Hummingbird was implemented at the end of August 2013.

Marty Wasserman: So it sounds like replicating vision is one of the most important things? Having a camera look at an object and interpreting what that object is?

Peter Norvig: I think that’s right and I think it’s a useful task and it connects you to the world so we have a broader connection then just typing at a keyboard. Now if a computer can see, it can interact a lot more and be more natural. And it’s also important in terms of learning. Because we have been able to teach our computers a lot by having them read text. There is a lot of text on the internet so you can get a lot out of that. Make a lot of connections and know that this word goes with this other word and other words don’t go together. But they are still just words and you would really like your computer to interact with the world and understand what it is like to live in the world. You can’t quite have that but it seems like video is the closest thing.

Marty Wasserman: I think Google’s worked on this problem a lot. You’re trying to interpret words. Basic search, and you’re an expert on search, doesn’t know what a word means but it can tell how frequently it occurs. But the next level of search would be to have a better understanding of what the word means so it can figure out the nuances of what the person is asking for.

Peter Norvig: That’s right. So we, you know, the first level is just they ask for this word and show me the pages that have that word. The next level is to say what did that word really mean and maybe there’s a page that talked about something but uses a slightly different words that are synonyms or related words. So we’re able to do that – figure out which other related words count and which ones don’t. And then the next level is saying well you asked me a string of words and it’s important what the relationships are between those words. And figuring out that out. So we have to attack understanding language at all levels and understanding the world at all levels what are these words actually refer to in the world.

In the interview, Peter Norvig stresses the importance of deriving meaning not only from text but also from images. This requires a computer to look at picture on a website and determine what those images are. Object identification sounds simple and is easy for humans to do, but it is extremely difficult for computers. However, Google has recently made significant advancements in this area.

In June 2013, the Google research blog announced that by using deep learning, Google had moved a step towards the toddler stage.

http://googleresearch.blogspot.com/2013/06/improving-photo-search-step-across.html

Images no longer have to be tagged and labeled to be identified.

This is powered by computer vision and machine learning technology, which uses the visual content of an image to generate searchable tags for photos combined with other sources like text tags and EXIF metadata to enable search across thousands of concepts like a flower, food, car, jet ski, or turtle… We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months. You can try it out at photos.google.com.– Chuck Rosenberg, Google Image Search Team

What does this mean for search? Images become more important as Google can label them more precisely with or without your help and associate those images with other textual concepts on your website and the knowledge graph. Doing so allows Google to enhance classification of websites and obtain a better understanding of how to match search user intent, the sometimes subtle nuances of search queries, and the best matching webpage.

Load More Posts