Close

TECH: Spark’ing an Anti Money Laundering Revolution

Tresata and Databricks announced a real-time, Spark and Hadoop-powered Anti-Money Laundering solution earlier today. Tresata’s predictive analytics application TEAK, offers for the first time in the market an at-scale, real-time AML investigation and resolution engine. The performance, speed, predictive power and precision TEAK delivers would not have been possible without its Spark underpinnings. Additionally, by […]

Read more

TECH: tresata open sources scalding-on-spark

From our CTO Koert Kuipers, a look at our latest open source project scalding-on-spark. To see other Tresata open source projects, visit github.com/tresata. spark-scalding by tresata   Spark-scalding is a library that aims to make the transition from Cascading/Scalding to Spark a little easier by adding support for Cascading Taps, Scalding Sources and the Scalding […]

Read more

TECH: building a ‘full stack’ analytics applications software company in hadoop

In 2011 when we launched Tresata, it was centered around a massive bet – that Hadoop will become the ‘de-rigueur’ data processing and analytics stack in the world. Three and a half years later, that bet may seem prescient to some, but is simply a huge confirmation of our beliefs. Not only have we managed […]

Read more

TECH: tresata open sources spark columnar

Spark Columnar is a proof of concept project that uses Shapeless to optimize the in-memory data layout for RDDs in Spark. The basic idea is that a user-facing RDD of tuples and/or case classes is backed by another RDD in which there is only one item per partition, that represents all the tuples and/or case […]

Read more

TECH: CDO Richard Morris at Hadoop Summit 2014

Read more

TECH: CEO Abhishek Mehta at Hadoop Summit 2014

Read more

TECH: Ganitha – Naive-Bayes Classifiers

by Andy Perez, Tresata Senior Developer This post discusses the implementation of Naive-Bayes classification in Ganitha, Tresata’s open-source machine-learning library built on Scalding. A Naive-Bayes classifier is a probabilistic classifier used in machine-learning that involves the application of Bayes’ theorem. The underlying model is “naive” because of the assumption that the attributes are conditionally independent of […]

Read more

TECH: SpaceSaver – Efficient discovery of the most frequent items in Scalding, Spark and other distributed frameworks

by Koert Kuipers, Tresata CTO Anyone that has used map-reduce in production knows the key to scalability and performance in reduce operations is to push as much of the work to the map-side as possible. Scalding does this elegantly and automatically for you if you can express your operation as a Semigroup, which means you […]

Read more

TECH: Frustrating csv files? Cascading OpenCsv Scheme is here to help.

At Tresata we sometimes need to parse csv files (or text files with other delimiters such as bar, tilde, tab, or semi-colon) that cannot be handled by Cascading’s TextDelimited. The main issue seems to be csv files that are quoted and contain escaped quotes. It seems TextDelimited chose not to support this due to performance […]

Read more

TECH: ganitha – K-means|| clustering

This is the third post in the series of posts introducing our open source project Ganitha. Posts by Abhi and Koert previously have shared with you why we are doing this, and how we structured it. This post discusses the implementations of K-Means in Ganitha, a scalding library recently open-sourced by Tresata. We’ll discuss the […]

Read more
Go top