Measuring the Quality of Interesting Entities on the Web

Quality is defined in Wikipedia as the standard of something as measured against other things of a similar kind; the degree of excellence of something.

Today’s information and data pools on the Web focus on the quantity of information rather than its quality.

The assessment of the quality of information is especially important because decisions are often based on information from multiple and sometimes unknown sources, though, the reliability and accuracy of the information is questionable.

However, the web lacks quality dependent filter mechanisms, automatic identification of misuse patterns, as well as tools to establish user trust in information and authors.

Hence the need to develop mechanisms for estimating the quality of textual Web documents and to evaluate these mechanisms for their effectiveness and efficiency.


My entities of interest include Websitesnews feeds, social media feeds, digital adverts, and other Web articles. Other important entities include air, water, soil, and life.

Large scale machine learning is playing an increasingly important role in improving the quality and monetisation of Internet properties. A small number of techniques, such as regression, have proven to be widely applicable across Internet properties and applications.

Sibyl: A System for Large Scale Machine Learning at Google

A few hundreds of millions of times a day people will ask Google questions, and within a fraction of a second Google needs to decide which among the billions of pages on the web to show them — and in what order.

Users want the answer, not trillions of webpages.

Our goal is simple: to give people the most relevant answers to their queries as quickly as possible.

Google Search Quality


What are the properties that influence the overall rank, quality, and importance of a Website?

Every day, millions of useless spam pages are created. Every week, over 10 million users encounter harmful websites that deliver malware and scams. Many of these sites are compromised personal blogs or small business pages that have fallen victim due to a weak password or outdated software.The compromised site remains a problem that needs to be fixed.

Helping webmasters re-secure their sites

Remedying Web Hijacking: Notification Effectiveness and Webmaster Comprehension

Spam sites attempt to game their way to the top of search results through techniques like repeating keywords over and over, buying links that pass PageRank or putting invisible text on the screen.

This is bad for search because relevant websites get buried, and it’s bad for legitimate website owners because their sites become harder to find.

The good news is that Google’s algorithms can detect the vast majority of spam and demote it automatically.

Google fight spam through a combination of computer algorithms and manual review.

Google Inside Search

ECML/PKDD 2010 Discovery Challenge

High quality is not simply the opposite of Web Spam (any deliberate action that is meant to trigger an unjustifiably favorable [ranking], considering the page’s true value). There are other various and often subjective aspects.

The goal of the challenge is to develop an automatic site-level classifiers including aspects such as trustworthiness, authoritativeness, neutrality, etc. as well as genre classification (editorial, news, commercial, educational, Web spam and more).

The dataset is a large collection (23 million pages in 99,000 hosts in the .EU domain) of annotated Web hosts.

The dataset is composed of Training labelsURLs and hyperlinksContent-based and link-based Web spam featuresTerm frequencies, and Natural Language Processing features, all in one: v2-all_in_one.tgz.



Web articles

How good is Web data (Wikipedia articles, blog articles, e.t.c.)?

Important dimensions of data quality include accuracy, completeness, freshness, and consistency. Web data users will be more interested in accuracy and completeness of Wikipedia articles while freshness in addition to accuracy and completeness is vital for news articles.

Wikipedia articles

Given the daily increase in the amount of data on the Web, machine-based assessment of Information Quality (IQ) is becoming a topic of enormous interest.

The three main research lines related to IQ in Wikipedia include featured articles identification, development of quality measurement metrics,  and quality flaws detection.

Quality Flaw Prediction

Blog articles

Blogs serve multiple purposes, resulting in several types of blogs that vary greatly in terms of quality and content. There is a need to build automatic quality blog identification system for the purpose of assisting web users and information specialists to identify quality blogs.

User-centered evaluation of quality of blogs (P.h.D Thesis)



Online Adverts (Ads)

An ad is a slang or a short name for an advert which refers to something (short film, notice, image, etc.) shown/presented to the public to help sell a product or to make an announcement.

Mobile advertising is a rapidly growing industry that supports publishers worldwide- eMarketer tipping mobile ad spend to exceed $100 billion in 2016.

Popular Ad platforms: Social Media, Newspaper, Billboards, TV, Radio, Mobile Phones, Landlines, Emails, Mail Posts.

Adverts can be annoying, low-quality adverts are capable of hogging data and disrupting the user experience. Research Problem!

Despite our occasional annoyance, advertising has become an integral part of media precisely because we tolerate being shown relevant adverts that cater to our interests- Why Publishers Should Care About Ad Quality.

Research on most annoying TV ads based on 1600 votes obtained the following visualisation:


The most annoying adverts from the past 15 years

Annoying features: repetitive jingles, gender or nationality stereotyping, and patronising tone.

  • Advertisers assume brand awareness is the key to making consumers purchase. Research, however, suggests that advertising makes a stronger emotional and behavioural impact when consumers are paying less conscious attention to them- Dr. Haiming Hang
  • If consumers are annoyed because they feel an ad is not representing them or is in poor taste then with the power of social media they can let the brand and the world know- Dr. Natascha Radclyffe-Thomas

Existing Solutions

  • Use of ad-blocker (end user) to optimize (make it +ive) mobile experience (online shopping, research or communicating on social media) e.g. Adblock Plus. Ad-blockers:
    • protect users by hiding intrusive pop-ups and banners
    • force marketers to turn to higher quality, more efficient ads that can be viewed more positively by users.
  • Upping the Ad quality (publishers): raising advertising quality may likely improve the revenue performance and the reputation of mobile advertising.
    • taking inspiration from television by building better quality ad formats and placing them naturally within apps
    • publishers need to work with advertising partners to operate tighter guidelines aimed at improving adverts quality, vetting adverts to ensure they meet design guidelines, and filtering out fraudsters
      • it’s up to publishers to convince powerful mediation platforms to heighten creative transparency and introduce detective measures for weeding out unwanted ads- Cleaning Up Ad Quality
    • publishers need to make sure they are showing relevant adverts to their users without annoying them- e.g. Google Ads options and Microsoft Ads options.


News and Social Media Feeds

Facebook News Feed

Can news feed on mobile or Web platforms be shown to feed readers in the order they want to read them? According to Facebook, there are on average 1,500 potential stories (from friends and Pages) for people to see every time they visit Facebook News Feed, and most people don’t have enough time to see them all.

With so many stories, there is a good chance that users would miss updates they want to see if News Feeds are displayed in a continuous, unranked stream of information. Research Problem!

The goal of News Feed is to deliver the right content to the right people at the right time so they don’t miss the stories that are important to them- Facebook.

Existing Solutions

News Feeds can be ranked based on how users interact with it or in chronological order.

  • Research by Facebook has shown that the number of stories people read, like and comment decrease when News Feed ranking is just a chronological order.

News Feed algorithm responds to the following signals from users:

  • How often a user interact with friends, Pages, or public figures (like an actor or journalist) who posted
  • The number of likes, shares, and comments a post receives in total and from the users friends in particular
  • How often the user interacted with this type of post in the past
  • Whether or not the user and other users are hiding or reporting a given post

Organic stories that users were not able to scroll down far enough to see can reappear near the top of News Feed if the stories are still getting lots of likes and comments.

A better way to surface older stories


Twitter News Feed


Online Health Information

Inappropriate health information can lead people away from evidence-based healthcare. Low-quality health information on the Web can have serious consequences for public health and healthcare services.

Existing Solutions 

Existing approaches to measuring information quality (IQ) include the following: using Journal of the American Medical Association (JAMA) score, and Health-on-the-net (HON) criteria. Both JAMA score and HON criteria measure information quality (IQ) in terms of the presence of explicit metadata (such as authorship, ownership, and currency) or broad textual criteria such as readability, with the primary aim of assessing reliability and trustworthiness.

An open research in this area is to identify other useful dimensions of IQ based on a more detailed analysis of the text content of the pages, using techniques of Natural Language Processing (NLP). Such measures might range from relatively superficial analysis of text style or sentiment to deeper ‘understanding’ of the scientific basis of the information provided, particularly in respect to the type of interventions (therapeutic or preventative) presented to the reader-Advert.





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s