A short introduction to NLP in Python with spaCy


Natural Language Processing (NLP) is one of the most interesting sub-fields of data science, and data scientists are increasingly expected to be able to whip up solutions that involve the exploitation of unstructured text data. Despite this, many applied data scientists (both from STEM and social science backgrounds) lack NLP experience.

In this post I explore some fundamental NLP concepts and show how they can be implemented using the increasingly popular spaCy package in Python. This post is for the absolute NLP beginner, but knowledge of Python is assumed.

spaCy, you say?

spaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at Explosion AI. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it’s fast. Incredibly fast (it’s implemented in Cython). If…

View original post 1,037 more words


How to manage state in Trident Storm topologies

Sumit Chawla's Blog

Code for this example@ https://github.com/sumitchawla/storm-examples

Trident API is in Storm Topologies is just another abstraction on how “Stream” of data is processed in Storm.

Basic Storm stream processing guarantees “at least once” message processing, whereas Trident API guarantees “exactly once” message processing.  In simple terms, that means,  basic stream processing makes sure that no message is ever lost. To achieve that, storm might replay the same message again and again, until it is certain that the message is processed successfully.   There is no direct way to figure out if the message has been played first time, or its being replayed due to an error or failure.   Trident API solves this problem partially by grouping this message into a batch.  If Trident API, needs to replay the same message again, it will come back with same Batch Id.  The application receiving this message will have to keep track of…

View original post 2,676 more words

Open Source Big Data Tools

A (rearranged) comprehensive list of open source big data tools from this paper https://cambridgeservicealliance….

Data Ingestion

Data Pre-processing


Distributed File System

Data Analysis

Distributed Architecture


Security & Governance

Cluster Management



Deduplication Internals : Part-1


Deduplication is one of the hottest technologies in the current market because of its ability to reduce costs. But it comes in many flavours and organizations need to understand each one of them if they are to choose the one that is best for them. Deduplication can be applied to data in primary storage, backup storage, cloud storage or data in flight for replication, such as LAN and WAN transfers. So eventually it offers the below benefits;

This concept is a familiar one which we see daily, a URL is a type of pointer; when someone shares a video on YouTube, they send the URL for the video instead of the video itself. There’s only one copy of the video, but it’s available to everyone. Deduplication uses this concept in a more sophisticated, automated way.


Data deduplication is a technique to reduce storage needs by eliminating redundant or duplicate data…

View original post 300 more words

My Learning Curve of Spark and Data Mining II


HI Guys,

I am back. I am sorry I didn’t update any post from September, due to focusing on my current jobs which is working as a Django developer from back-end to the front-end even involving using some D3.js, lol.

Anyway, I am trying to continue study on Big data and data mining at my free time and I will list the following resources I have been through in this half year, especially on Apache Spark.

1.Data Mining

1. 1 web Data Mining pdf and Programming Collective Intelligence pdf


Although these two books are relatively old, they are decently introduce the data mining on the website and Machine learning algorithms in python respectively, which are worthy to take a quick look.

1.2 Stanford University Class CS246

They very formally present the Machine Learning Algorithms with pdf download available. But with totally concentrating on algorithms and derivatives, it might be boring…

View original post 531 more words

Real Time Fraud Detection with Sequence Mining


Real time fraud detection  is one of the use cases, where multiple components of the Big Data eco system come into play in a significant way, Hadoop batch processing  for building the predictive model and Storm for predicting fraud from real time transaction  stream using the predictive model. Additionally, Redis is used as the glue between the different sub systems.

In this post I will go through the end to end solution for real time fraud detection, using credit card transactions as an example, although the same solution can be used for any kind of sequence based outlier detection. I will be building a Markov chain model using the Hadoop based implementation in my open source project avenir. The prediction algorithm implementation

View original post 1,555 more words

Fraudsters, Outliers and Big Data


Recently, I started working on Hadoop based solutions for fraud detection. Fraud detection is critical for many industries,  including  but not limited to financial,  insurance  and retail. Data mining is a key enabler in effective fraud detection.

In this and some following posts, I will cover commonly used  data mining solutions for fraud detection. I also have an accompanying open source project on github called beymani where the implementations of these algorithms will be available.

View original post 2,128 more words