Scaling TB's of data with Apache Spark and Scala DSL at Production

Apache Spark is one of the top big-data processing platforms and has driven the adoption of Scala in many industry and academic settings. As entire Apache Spark framework has been written in Scala as a base, it’s real pleasure to understand beauty of functional Scala DSL with Spark.

This talk is intent to present :

  • Primary data structures (RDD, DataSet, Dataframe) usage in universal large scale data processing with Hbase (Data lake), Hive (Analytical Engine).
  • Case study: We will go through importance of physical data split up techniques such as coalesce, Partition, Repartition and other important spark internals in Scaling TB’s of data / ~17 billions records
  • Also, We will understand crucial part and very interesting way of understanding parallel & concurrent distributed data processing – tuning memory, cache, Disk I/O, Leaking memory, Internal shuffle, spark executor, spark driver etc.
Records
Video Recordings
Quick Info
Conference
Event Type
Is Topic
Yes
Timeslots
-
Content
Language
Level
Target Audience
Developer
Audience Requriement

Targeted audience:

  1. Who understands basic functional programming with scala or has understanding of Java.
     
  2. Who understands concurrent programming or multithreading in Java / Scala.
     
  3. Who has interest in distributed data processing and has keen interest in data scaling optimization.
     
  4. Who has earlier worked in Big Data, Fast Data or has keen interest."
Speaker

Chetankumar Khatri

Chetan Khatri is working as a Technical Lead at Accion labs, he has diverse experience in field of Data Science and Machine learning. He is a open source contributor at Apache Spark, Apache HBase, Apache Spark - HBase Connector, Elixir Lang and many other open source projects. He has been authored curriculum of Artificial Intelligence, Data Science, Distributed computing at KSKV Kachchh University, Government of Gujarat - INDIA. He has also reviewed couple of Books with Scala Machine learning, Tensorflow Deep learning, Machine learning for Web with Packt Publication. He has delivered many talks at Pycon India 2016, PyKutch 2016, FOSSASIA 2018

  • Distributing Machine learning with Apache Spark - Pycon India 2016
  • Think Machine learning with Scikit-learn - PyKutch 2016

Open Source Contributor:

  • Apache Spark
  • Apache HBase
  • Apache MXNet
  • ParlAI
  • Spark HBase Connector
Country / Region
India
Affiliations
Accion labs Inc.