Startup Tools

1. Startup Tools Click Here 2. Lean LaunchPad Videos Click Here 3. Founding/Running Startup Advice Click Here 4. Market Research Click Here 5. Life Science Click Here 6. China Market Click Here …

Source: Startup Tools


Real Time Fraud Detection with Sequence Mining


Real time fraud detection  is one of the use cases, where multiple components of the Big Data eco system come into play in a significant way, Hadoop batch processing  for building the predictive model and Storm for predicting fraud from real time transaction  stream using the predictive model. Additionally, Redis is used as the glue between the different sub systems.

In this post I will go through the end to end solution for real time fraud detection, using credit card transactions as an example, although the same solution can be used for any kind of sequence based outlier detection. I will be building a Markov chain model using the Hadoop based implementation in my open source project avenir. The prediction algorithm implementation

View original post 1,555 more words

Fraudsters, Outliers and Big Data


Recently, I started working on Hadoop based solutions for fraud detection. Fraud detection is critical for many industries,  including  but not limited to financial,  insurance  and retail. Data mining is a key enabler in effective fraud detection.

In this and some following posts, I will cover commonly used  data mining solutions for fraud detection. I also have an accompanying open source project on github called beymani where the implementations of these algorithms will be available.

View original post 2,128 more words

Quality and Credibility Metrics of Online Entities: Academic Review

Quality Vs Credibility of Online Entities


The quality of being believable or worthy of trust.

Credibility challenges according to Stanford Web Credibility Research:

  • What causes people to believe (or not believe) what they find on the Web?
  • What strategies do users employ in evaluating the credibility of online sources?
  • What contextual and design factors influence these assessments and strategies?
  • How and why are credibility evaluation processes on the Web different from those made in face-to-face human interaction, or in other offline contexts?

P.h.D Thesis: How Do People Evaluate a Web Site’s Credibility?


The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs.

  • degree of excellence or fitness for use

Is quality an indicator of credibility or vice versa?

User Generated Content: How Good is It? Slide

  • How can we estimate the quality of UGC?
    • Directly evaluate the quality.
      • What are the elements of social media that can be used to facilitate automated discovery of high-quality content?
      • What is the utility of links between items, quality rating from members of the community, and other non-content information to the task of estimating the quality of UGC?
        • How are these different factors related?
        • Is content alone enough for identifying high-quality items?
        • Can community feedback approximate judgments of specialists?
    • In this work, the authors used a judged question/answer collection where good questions usually have good answers to model a classifier to predict good questions and good answers, obtaining an AUC (area under the curve of the precision-recall graph) of 0.76 and 0.88, respectively.
      • The drawback is that the quality gap is balanced by volume. The larger the volume of the UGC, the lower difficult the quality evaluation.
    • Obtaining indirect evidence of the quality.
      • use UGC for a given task and then evaluate the quality of the task results.
      • evaluation of the quality of extraction of semantic relations using the Open Directory Project (ODP). Precision of over 60%.
    • Crossing different UGC sources and infer from there the quality of those sources.
      • using collective knowledge (wisdom of crowds) to extend image tags, and prove that almost 70% of the tags can be semantically classified by using Wordnet and Wikipedia.

The Online Entities Quality Challenge

Entities: social media platforms (Facebook, Twitter, ….) or information systems, and information or contents on the internet (articles: posts, comments, …).

The advent and openness of online social media platforms often leaves them highly susceptible to abuse by suspicious entities. It therefore becomes increasingly important to automatically identify these suspicious entities and mitigate/eliminate their threats.

Anomaly Detection on Social Data: P.h.D Thesis

The rapid growth of the Internet and the lack of enforceable standards regarding the information it contains has lead to numerous information quality problems.

  • inability of Search Engines to wade through the vast expanse of questionable content and return “quality” results to a user’s query

Developing a Framework for Assessing Information Quality on the World Wide Web

Fundamental Definitions

“Data Quality” is described as data that is “Fit-for-use”:  data considered appropriate for one use may not possess sufficient attributes for another use!

Common Dimensions of Information or Data Quality

  • Accuracy: extent to which data are correct, reliable and certified free of error
  • Consistency: extent to which information is presented in the same format and compatible with previous data
  • Security: extent to which access to information is restricted appropriately to maintain its security
  • Timeliness: extent to which the information is sufficiently up-to-date for the task at hand
  • Completeness: extent to which information is not missing and is of sufficient breadth and depth for the task at hand
  • Concise: extent to which information is compactly represented without being overwhelming (i.e. brief in presentation, yet complete and to the point)
  • Reliability: extent to which information is correct and reliable
  • Accessibility: extent to which information is available, or easily and quickly retrievable
  • Availability: extent to which information is physically accessible
  • Objectivity: extent to which information is unbiased, unprejudiced and impartial
  • Relevancy: extent to which information is applicable and helpful for the task at hand
  • Useability: extent to which information is clear and easily used
  • Understandability: extent to which data are clear without ambiguity and easily comprehended
  • Amount of data: extent to which the quantity or volume of available data is appropriate
  • Believability: extent to which information is regarded as true and credible
  • Navigation: extent to which data are easily found and linked to
  • Reputation: extent to which information is highly regarded in terms of source or content
  • Useful: extent to which information is applicable and helpful for the task at hand
  • Efficiency: extent to which data are able to quickly meet the information needs for the task at hand
  • Value-Added: extent to which information is beneficial, provides advantages from its use

These attributes of data quality can vary depending on the context in which the data is to be used.

Defining what Information Quality means in the context of Search Engines will depend greatly on whether dimensions are being identified for the producers of information, the storage and maintenance systems used for information, or for the searchers and users of information.

  • Consider the information user,  quality dimensions of their interest include relevancy and usefulness. These dimensions are enormously important but extremely difficult to gauge.

Developing a Framework for Assessing Information Quality on the World Wide Web

Quality Metrics

Metrics for IQ in Information Retrieval

Metrics that can assess IQ and can be deployed in Search engines

Quality Datasets


Thesis and Dissertation Hubs

– ProQuest
– DiVA
– ETDs
– Ebook
Dart Europe
– OhioLINK
– UM Repository

Credit: Khalid Kyle

Here is the list of websites that i used to get ebook, thesis & dissertation for free in writing my thesis:

– ProQuest
– Queens
– DiVA
– ETDs
– Ebook
– Dart Europe
– OhioLINK
– UM Repository

Here is the list of websites that i used to get ebook, thesis & dissertation for free in writing my thesis:

– ProQuest
– Queens
– DiVA
– ETDs
– Ebook
– Dart Europe
– OhioLINK
– UM Repository

Codes of the Week

Kaggle: GOP Debate Twitter Sentiment Analysis

The original challenge was to understand what people thought (based on Twitter discussions) of the US Republican Debate that took place in Cleveland. The dataset consist of an annotated 20,000 randomly selected tweets from that night using #GOPDebate and related hashtags. The annotation was carried out by contributors based on the following guidelines:

  • Is this tweet relevant and from a person? not from news outlet or a brand
  • What candidate was mentioned in the tweet? this include “no candidate option”
  • What subject was mentioned in the tweet? several options provided
  • What was the sentiment of the tweet? +ive, -ive and neutral

Questions of interest:

  • What issues resonated with voters?
  • Which candidates were viewed most negatively?
  • Are people really considering voting for a well-monied, sentient toupee?




Guide to Data Science Competitions

Happy Endpoints

“Don't worry about a thing,every littleSummer is finally here and so are the long form virtual hackathons. Unlike a traditional hackathon, which focus on what you can build in one place in one limited time span, virtual hackathons typically give you a month or more to work from where ever you like.

And for those of us who love data, we are not left behind. There are a number of data science competitions to choose from this summer. Whether it’s a new Kaggle challenge (which are posted year round) or the data science component of Challenge Post’s Summer Jam Series, there are plenty of opportunities to spend the summer either sharpening or showing off your skills.

The Landscape: Which Competitions are Which?

  • Kaggle
    Kaggle competitions have corporate sponsors that are looking for specific business questions answered with their sample data. In return, winners are rewarded handsomely, but you have to win first.
  • Summer Jam

View original post 313 more words