TECH: tresata open sources scalding-on-spark

From our CTO Koert Kuipers, a look at our latest open source project scalding-on-spark. To see other Tresata open source projects, visit spark-scalding by tresata   Spark-scalding is a library that aims to make the transition from Cascading/Scalding to Spark a little easier by adding support for Cascading Taps, Scalding Sources and the Scalding […]

Read more

TECH: tresata open sources spark columnar

Spark Columnar is a proof of concept project that uses Shapeless to optimize the in-memory data layout for RDDs in Spark. The basic idea is that a user-facing RDD of tuples and/or case classes is backed by another RDD in which there is only one item per partition, that represents all the tuples and/or case […]

Read more

TECH: SpaceSaver – Efficient discovery of the most frequent items in Scalding, Spark and other distributed frameworks

by Koert Kuipers, Tresata CTO Anyone that has used map-reduce in production knows the key to scalability and performance in reduce operations is to push as much of the work to the map-side as possible. Scalding does this elegantly and automatically for you if you can express your operation as a Semigroup, which means you […]

Read more

TECH: Frustrating csv files? Cascading OpenCsv Scheme is here to help.

At Tresata we sometimes need to parse csv files (or text files with other delimiters such as bar, tilde, tab, or semi-colon) that cannot be handled by Cascading’s TextDelimited. The main issue seems to be csv files that are quoted and contain escaped quotes. It seems TextDelimited chose not to support this due to performance […]

Read more
Go top