Natural Language Processing (NLP) is one of the most interesting sub-fields of data science, and data scientists are increasingly expected to be able to whip up solutions that involve the exploitation of unstructured text data. Despite this, many applied data scientists (both from STEM and social science backgrounds) lack NLP experience.

In this post I explore some fundamental NLP concepts and show how they can be implemented using the increasingly popular spaCy package in Python. This post is for the absolute NLP beginner, but knowledge of Python is assumed.

spaCy, you say?

spaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at Explosion AI. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it’s fast. Incredibly fast (it’s implemented in Cython). If…

View original post 1,037 more words

How to manage state in Trident Storm topologies

February 3, 2018 cltmadminLeave a comment

Sumit Chawla's Blog

Code for this example@ https://github.com/sumitchawla/storm-examples

Trident API is in Storm Topologies is just another abstraction on how “Stream” of data is processed in Storm.

Basic Storm stream processing guarantees “at least once” message processing, whereas Trident API guarantees “exactly once” message processing. In simple terms, that means, basic stream processing makes sure that no message is ever lost. To achieve that, storm might replay the same message again and again, until it is certain that the message is processed successfully. There is no direct way to figure out if the message has been played first time, or its being replayed due to an error or failure. Trident API solves this problem partially by grouping this message into a batch. If Trident API, needs to replay the same message again, it will come back with same Batch Id. The application receiving this message will have to keep track of…

View original post 2,676 more words

Open Source Big Data Tools

January 22, 2018 cltmadminLeave a comment

A (rearranged) comprehensive list of open source big data tools from this paper https://cambridgeservicealliance….

Data Ingestion

Hydrograph: capitalone/Hydrograph
NSQ: https://github.com/nsqio/nsq
Metamorphosis: https://github.com/killme2008/Me…
Jafka: https://github.com/adyliu/jafka
Disque: https://github.com/antirez/disque
Open Messaging: openmessaging/openmessaging-java
VerneMQ: https://github.com/erlio/vernemq
Cherami-Server-Client: https://github.com/uber/cherami-…
Machinery: https://github.com/RichardKnop/m…
Suro: Netflix/suro
LogStash: elastic/logstash
Apache Chukwa: http://chukwa.apache.org/
Apache Flume: Apache Flume
Apache Gobblin: Apache Gobblin
Apache Kafka: http://kafka.apache.org/
Apache Nifi: Apache NiFi
Apache Pulsar: http://pulsar.incubator.apache.org
Apache RocketMQ: http://rocketmq.incubator.apache…
Apache Sqoop: http://sqoop.apache.org/

Data Pre-processing

OpenRefine: OpenRefine/OpenRefine
Data Cleaner: datacleaner/DataCleaner
Talend Open Studio: Talend/tbd-studio-se
Wherehow: linkedin/WhereHows
StreamSets-Data collector: streamsets/datacollector
CKAN: ckan/ckan
Boom Filters: https://github.com/tylertreat/Bo…
Apache AsterixDB: Apache AsterixDB
Apache Avro: Apache Avro!
Apache CarbonData: CarbonData
Apache Griffin: Apache Griffin

Storage

ClickHouse: yandex/ClickHouse
IndexR: shunfei/indexr
Smart Storage Management: Intel-bigdata/SSM
Grid DB: https://github.com/griddb/griddb…
Druid: druid-io/druid
Redis: antirez/redis
TIDB: pingcap/tidb
Titan: thinkaurelius/titan
OpenTSDB: OpenTSDB/opentsdb
TIDB: pingcap/tikv
Crate: crate/crate
RQLite: rqlite/rqlite
ActorDB: biokoda/actordb
JanusGraph: JanusGraph/janusgraph
AtlasDB: palantir/atlasdb
CurioDB: stephenmcd/curiodb
Ceres: graphite-project/ceres
RethinkDB: rethinkdb/rethinkdb
Tera: baidu/tera
Scylla: scylladb/scylla
DGraph: dgraph-io/dgraph
Bolt: boltdb/bolt
BuntDB: https://github.com/tidwall/buntdb
Voldemort voldemort/voldemort
SummitDB: https://github.com/tidwall/summitdb
Riak: https://github.com/basho/riak_kv
Hstore: apavlo/h-store
ElephantDB: nathanmarz/elephantdb
Apache Accumulo: Apache Accumulo
Apache Cassandra: Apache Cassandra
Apache CouchDB: http://couchdb.apache.org/
Apache Gora: Apache Gora&trade
Apache HBase: Apache HBase
Apache ORC: Apache ORC
Apache Parquet: Apache Parquet
Apache Rya: http://rya.incubator.apache.org/
Apache S2Graph: http://s2graph.incubator.apache….

Distributed File System

Ceph: ceph/ceph
Baidu File System: baidu/bfs
SeaweedFS: chrislusf/seaweedfs
GlusterFS: gluster/glusterfs
QFS: quantcast/qfs
XtreemFS: xtreemfs/xtreemfs
Hyperdrive mafintosh/hyperdrive
Ambry: linkedin/ambry
LizardFS GitHub lizardfs/lizardfs
FastDFS GitHub happyfish100/fastdfs
MooseFS GitHub moosefs/moosefs
Alluxio: Alluxio/alluxio

Data Analysis

Aperture Tiles: unchartedsoftware/aperture-tiles
PrestoDB: prestodb/presto
Simba: InitialDLab/Simba
Geomesa: locationtech/geomesa
FlashX: https://github.com/flashxio/FlashX
MOA: https://github.com/Waikato/moa
Squall: epfldata/squall
RapidMiner: https://github.com/rapidminer/ra…
Esper: espertechinc/esper
Drools: kiegroup/drools
Mondrian: pentaho/mondrian
Godot: nodejitsu/godot
Tensorflow: https://github.com/tensorflow/te…
MLPack: https://github.com/mlpack/mlpack
Conjecture: https://github.com/etsy/Conjecture
Photon-ML: https://github.com/linkedin/phot…
DMLC: https://github.com/dmlc/dmlc-core
H20: https://github.com/h2oai/h2o-3
DSSTNE https://github.com/amzn/amazon-d…
Angel: https://github.com/Tencent/angel
Oryx: https://github.com/OryxProject/oryx
Fregata: https://github.com/TalkingData/F…
Zen: https://github.com/cloudml/zen
BenchML: https://github.com/szilard/bench…
Cascalog: nathanmarz/cascalog
Cascading: Cascading/cascading
Scalding: twitter/scalding
Jubatus: https://github.com/jubatus/jubatus
PipelineDB: https://github.com/pipelinedb/pi…
StreamCQL: https://github.com/HuaweiBigData…
Apache Calcite: Apache Calcite
Apache Drill: http://drill.apache.org/
Apache HAWQ: Apache HAWQ&reg
Apache Horn: HORN Project
Apache Hive: http://hive.apache.org/
Apache Hivemall: http://hivemall.incubator.apache…
Apache Impala: http://impala.incubator.apache.org/
Apache Kudu: http://kudu.apache.org/
Apache Kylin: http://kylin.apache.org/
Apache Lens: http://lens.apache.org/
Apache MADLib: http://madlib.apache.org
Apache Mahout: http://mahout.apache.org/
Apache MetaModel: Apache MetaModel
Apache MRQL: A Query Processing and Optimization System
Apache Trafodion: http://trafodion.incubator.apach…
Apache Phoenix: Apache Phoenix
Apache Pig: http://pig.apache.org/
Apache SAMOA: http://samoa.incubator.apache.org/
Apache SINGA: http://singa.incubator.apache.org/
Apache VXQuery: http://vxquery.apache.org/
Apache SystemML: http://systemml.apache.org/
Apache Tajo: http://tajo.apache.org/

Distributed Architecture

Pentaho: pentaho/big-data-plugin
Thrill https://github.com/thrill/thrill
HPCC: https://github.com/hpcc-systems/…
JStorm: https://github.com/alibaba/jstorm
Riemann: https://github.com/riemann/riemann
Tigon: https://github.com/caskdata/tigon
Riko: https://github.com/nerevu/riko
SensorBee: sensorbee/sensorbee
Automi: vladimirvivien/automi
Goka: lovoo/goka
SpringCloudDataFlow: spring-cloud/spring-cloud-dataflow
GraphJET: twitter/GraphJet
PigPen: Netflix/PigPen
Disco: discoproject/disco
Infovore: paulhoule/infovore
Gleam: chrislusf/gleam
Glow: chrislusf/glow
Parkour: damballa/parkour
Onyx: onyx-platform/onyx
SummingBird: twitter/summingbird
Hydra: addthis/hydra
Apache Apex: Apache Apex
Apache Beam: Apache Beam
Apache DataFu: http://datafu.incubator.apache.org/
Apache Falcon: http://falcon.apache.org/
Apache Flink: http://flink.apache.org/
Apache Gearpump: Apache Gearpump
Apache Giraph: Apache Giraph!
Apache Hadoop: Apache™ Hadoop®!
Apache Hama: Hama
Apache Heron: Heron
Apache Ignite: http://ignite.apache.org/
Apache Samza: http://samza.apache.org/
Apache Spark: http://spark.apache.org/
Apache Storm: http://storm.apache.org/

Visualization

Lumify: lumifyio/lumify
Plywood: implydata/plywood
Kibana: elastic/kibana
Airpal: https://github.com/airbnb/airpal
Bokeh: https://github.com/bokeh/bokeh
Apache Zeppelin: http://zeppelin.apache.org/

Security & Governance

HiBench: intel-hadoop/HiBench
SpringXD: spring-projects/spring-xd
Redisson: redisson/redisson
Akka: https://github.com/akka/akka
Mist: https://github.com/Hydrosphereda…
Secor: https://github.com/pinterest/secor
Elephant-Bird: https://github.com/twitter/eleph…
Streaming Benchmark: https://github.com/yahoo/streami…
Apache Ambari: Ambari
Apache Atlas: Data Governance and Metadata framework for Hadoop
Apache Bigtop: Bigtop – Apache Bigtop
Apache BookKeeper: Apache BookKeeper™
Apache Curator: http://curator.apache.org/
Apache Eagle: http://eagle.apache.org/
Apache Geode: Apache Geode
Apache HTrace: http://htrace.incubator.apache.org/
Apache Kerby: http://directory.apache.org/kerby/
Apache Milagro: Milagro
Apache Metron: Apache Metron
Apache OODT: Apache OODT
Apache Ranger: http://ranger.apache.org/
Apache Spot: http://spot.incubator.apache.org/
Apache Sentry: http://sentry.apache.org/
Apache Thrift: http://thrift.apache.org
Apache ZooKeeper: http://zookeeper.apache.org/

Cluster Management

Apache Aurora: http://aurora.apache.org
Azkaban: azkaban/azkaban
Genie-Netflix: Genie by Netflix OSS
Chronos: mesos/chronos
Kubernetes: kubernetes/kubernetes
Tron: Yelp/Tron
Vitess: https://github.com/youtube/vitess
Schedoscope: https://github.com/ottogroup/sch…
Luigi: https://github.com/spotify/luigi
Serf: https://github.com/hashicorp/serf
Fineagle: https://github.com/twitter/finagle
Fenzo: Netflix/Fenzo
Apache Airavata: Apache Airavata
Apache CloudStack: http://cloudstack.apache.org
Apache Helix: Apache Helix
Apache Mesos: Apache Mesos
Apache Myriad: Apache Myriad
Apache REEF: http://reef.apache.org/
Apache Slider: http://slider.incubator.apache.org/
Apache Tez: http://tez.apache.org/
Apache Twill: http://twill.apache.org/
Apache Oozie: Apache Oozie

Application

ElasticSearch: elastic/elasticsearch
KilrWeather: killrweather/killrweather
Refarch: https://github.com/awslabs/lambd…
Dat- Node: datproject/dat-node
Redash: https://github.com/getredash/redash
Rakam-IO: https://github.com/rakam-io/rakam
Countly: https://github.com/Countly/count…
Kapacitor: https://github.com/influxdata/ka…
Apache Lucene: https://lucene.apache.org/core/
Apache Nutch: Apache Nutch™
Apache Solr: http://lucene.apache.org/solr/

Support

Stream Alert: https://github.com/airbnb/stream…
Finagle: https://github.com/twitter/finagle
Apache Bahir: Home
Apache Crunch: http://crunch.apache.org/
Apache Edgent: http://edgent.incubator.apache.org/
Apache Fluo: Apache Fluo
Apache Knox: http://knox.apache.org/
Apache River: http://river.apache.org/
Apache Tephra: http://tephra.incubator.apache.org/
Apache Omid: Apache Omid
Apache OpenWhisk: Apache OpenWhisk

Deduplication Internals : Part-1

January 2, 2018 cltmadminLeave a comment

pibytes

Deduplication is one of the hottest technologies in the current market because of its ability to reduce costs. But it comes in many flavours and organizations need to understand each one of them if they are to choose the one that is best for them. Deduplication can be applied to data in primary storage, backup storage, cloud storage or data in flight for replication, such as LAN and WAN transfers. So eventually it offers the below benefits;

This concept is a familiar one which we see daily, a URL is a type of pointer; when someone shares a video on YouTube, they send the URL for the video instead of the video itself. There’s only one copy of the video, but it’s available to everyone. Deduplication uses this concept in a more sophisticated, automated way.

Data deduplication is a technique to reduce storage needs by eliminating redundant or duplicate data…

View original post 300 more words

My Learning Curve of Spark and Data Mining II

December 21, 2017 cltmadminLeave a comment

ZephyrRapier

HI Guys,

I am back. I am sorry I didn’t update any post from September, due to focusing on my current jobs which is working as a Django developer from back-end to the front-end even involving using some D3.js, lol.

Anyway, I am trying to continue study on Big data and data mining at my free time and I will list the following resources I have been through in this half year, especially on Apache Spark.

1.Data Mining

1. 1 web Data Mining pdf and Programming Collective Intelligence pdf

41DT2X52buL._SX311_BO1,204,203,200_ lrg

Although these two books are relatively old, they are decently introduce the data mining on the website and Machine learning algorithms in python respectively, which are worthy to take a quick look.

1.2 Stanford University Class CS246

They very formally present the Machine Learning Algorithms with pdf download available. But with totally concentrating on algorithms and derivatives, it might be boring…

View original post 531 more words

Startup Tools

June 13, 2016 cltmadminLeave a comment

1. Startup Tools Click Here 2. Lean LaunchPad Videos Click Here 3. Founding/Running Startup Advice Click Here 4. Market Research Click Here 5. Life Science Click Here 6. China Market Click Here …

Source: Startup Tools

Real Time Fraud Detection with Sequence Mining

June 9, 2016 cltmadminLeave a comment

Mawazo

Real time fraud detection is one of the use cases, where multiple components of the Big Data eco system come into play in a significant way, Hadoop batch processing for building the predictive model and Storm for predicting fraud from real time transaction stream using the predictive model. Additionally, Redis is used as the glue between the different sub systems.

In this post I will go through the end to end solution for real time fraud detection, using credit card transactions as an example, although the same solution can be used for any kind of sequence based outlier detection. I will be building a Markov chain model using the Hadoop based implementation in my open source project avenir. The prediction algorithm implementation

View original post 1,555 more words

Fraudsters, Outliers and Big Data

June 9, 2016 cltmadminLeave a comment

Mawazo

Recently, I started working on Hadoop based solutions for fraud detection. Fraud detection is critical for many industries, including but not limited to financial, insurance and retail. Data mining is a key enabler in effective fraud detection.

In this and some following posts, I will cover commonly used data mining solutions for fraud detection. I also have an accompanying open source project on github called beymani where the implementations of these algorithms will be available.

View original post 2,128 more words

Startup Tools

June 8, 2016 cltmadminLeave a comment

Fighting Patent TrollsTrollingEffects – learn more about patent trollsUnifiedPatents – patent micropool to fight patent trollsElectronicFrontierFoundation(EFF) – on patents

Source: Startup Tools

Big Data Analytics Hub

Big data: research and practice

A short introduction to NLP in Python with spaCy

How to manage state in Trident Storm topologies

Open Source Big Data Tools

Deduplication Internals : Part-1

My Learning Curve of Spark and Data Mining II

1.Data Mining

1. 1 web Data Mining pdf and Programming Collective Intelligence pdf

1.2 Stanford University Class CS246

Startup Tools

Real Time Fraud Detection with Sequence Mining

Fraudsters, Outliers and Big Data

Startup Tools