Scala

Scala (Scalable Language) is a software programming language that mixes object-oriented methods with functional programming capabilities that support a more concise style of programming than other general-purpose languages like Java, reducing the amount of code developers have to write. Another benefit of the combined object-functional approach is that features that work well in small programs tend to scale up efficiently when run in larger environments.

First released publicly in 2004, Scala also incorporates some imperative, statement-oriented programming capabilities. In addition, it supports static typing, in which computations are formed as statements that change program state at compile time, an approach that can provide improved runtime efficiencies. It is typically implemented on a Java virtual machine (JVM), which opens up the language for mixed use with Java objects, classes and methods, as well as JVM runtime optimizations.

Scala also includes its own interpreter, which can be used to execute instructions directly, without previous compiling. Another key feature in Scala is a “parallel collections” library designed to help developers address parallel programming problems. Pattern matching is among the application areas in which such parallel capabilities have proved to be especially useful.

Scala was originally written by Martin Odersky, a professor at the Ecole Polytechnique Federale de Lausanne, in Switzerland. His previous work included creation of the Funnel language, which shared some traits with Scala but didn’t employ JVMs as an execution engine. Odersky began work on Scala in 2001 and continues to play a lead role in its development; he also co-founded Scala development tools maker Typesafe Inc. in 2011 and is the San Francisco company’s chairman and chief architect.

Updates to the Java language have added functional programming traits somewhat akin to Scala’s. One prominent Scala user, LinkedIn Corp., indicated in early 2015 that it planned to reduce its reliance on the language and focus more on Java 8 and other languages. But Scala continues to be one of the major tools for building software infrastructure at a number of other high-profile companies, including Twitter Inc. and local-search app developer Foursquare Labs Inc.

Apache Spark, an open source data processing engine for batch processing, machine learning, data streaming and other types of analytics applications, is very significant example of Scala usage. Spark is written in Scala, and the language is central to its support for distributed data sets that are handled as collective software objects to help boost resiliency. However, Spark applications can be programmed in Java and the Python language in addition to Scala.

Scala Lightweight functional programming for Java

Languages based in Java often involve verbose syntax and domain-specific languages for testing, parsing and numerical compute processes. These things can be the bane of developers, because the piles of repetitive code require developers to spend extra time combing through it to find errors.

As a general-purpose programming language, Scala can help alleviate these issues by combining both object-oriented and functional styles. To mitigate syntax complexities, Scala also fuses imperative programming with functional programming and can advantageously use its access to a huge ecosystem of Java libraries.

This article examines Scala’s Java versatility and interoperability, the Scala tooling and runtime features that help ensure reliable performance, and some of the challenges developers should watch out for when they use this language.

Scala attracted wide attention from developers in 2015 due to its effectiveness with general-purpose cluster computing. Today, it’s found in many Java virtual machine (JVM) systems, where developers use Scala to eliminate the need for redundant type information. Because programmers don’t have to specify a type, they also don’t have to repeat it.

Scala shares a common runtime platform with Java, so it can execute Java code. Using the JVM and JavaScript runtimes, developers can build high-performance systems with easy access to the rest of the Java library ecosystem. Because the JVM is deeply embedded in enterprise code, Scala offers a concise shortcut that guarantees diverse functionality and granular control.

Developers can also rely on Scala to more effectively express general programming patterns. By reducing the number of lines, programmers can write type-safe code in an immutable manner, making it easy to apply concurrency and to synchronize processing.

The power of objects

In pure object-oriented programming (OOP) environments, every value is an object. As a result, types and behaviors of objects are described by classes, subclasses and traits to designate inheritance. These concepts enable programmers to eliminate redundant code and extend the use of existing classes.

Scala treats functions like first-class objects. Programmers can compose with relatively guaranteed type safety. Scala’s lightweight syntax is perfect for defining anonymous functions and nesting. Scala’s pattern-matching ability also makes it possible to incorporate functions within class definitions.

Java developers can quickly become productive in Scala if they have an existing knowledge of OOP, and they can achieve greater flexibility because they can define data types that have either functional or OOP-based attributes.

Challenges of working with Scala

Some of the difficulties associated with Scala include complex build tools, a lack of support for advanced integrated development environment language features and project publishing issues. Other criticisms aim at Scala’s generally limited tooling and difficulties working with complex language features in the codebase.

Managing dependency versions can also be a challenge in Scala. It’s not unusual for a language to cause headaches for developers when it comes to dependency management, but that challenge is particularly prevalent in Scala due to the sheer number of Scala versions and upgrades. New Scala releases often mark a significant shift that requires massive developer retraining and codebase migrations.

Developers new to Scala should seek out the support of experienced contributors to help minimize the learning curve. While Scala still exists in a relatively fragmented, tribal ecosystem, it’s hard to say where Scala is heading in terms of adoption. However, with the right support, Scala functional programming can be a major asset.

Python vs Scala

Python is a high level, interpreted and general purpose dynamic programming language that focuses on code readability. Python requires less typing, provides new libraries, fast prototyping, and several other new features.


Scala is a high level language.it is a purely object-oriented programming language. The source code of the Scala is designed in such a way that its compiler can interpret the Java classes.

Below are some major differences between Python and Scala:

PYTHONSCALA
Python is a dynamically typed language.Scala is a statically typed language.
We don’t need to specify objects in Python because it is a dynamically typed Object Oriented Programming language.We need to specify the type of variables and objects in Scala because Scala is statically typed Object Oriented Programming language.
Python is easy to learn and use.Scala is less difficult to learn than Python.
An extra work is created for the interpreter at the runtime.No extra work is created in Scala and thus it is 10 times faster than Python.
The data types are decided by it during runtime.This is not the case in Scala that is why while dealing with large data process, Scala should be considered instead of Python
Python’s Community is huge compared to Scala.Scala also has good community support. But still, it is lesser than Python.
Python supports heavyweight process forking and doesn’t support proper multithreading.Scala has reactive cores and a list of asynchronous libraries and hence Scala is a better choice for implementing concurrency.
Its methodologies are much complex in Python as it is dynamic programming language.Testing is much better in scala because it is a statically typed language.
It is popular because of its English-like syntax.For scalable and concurrent systems, Scala play much bigger.
Python is easy for the developers to write code in it.Scala is less difficult to learn than Python and it is difficult to write code in Scala.
There is an interface in Python to many OS system calls and libraries. It has many interpretersIt is basically a compiled language and all source codes are compiled before execution
Python language is highly prone to bugs whenever there is any change to the existing code.No such problem is seen in Scala.
Python has libraries for Machine learning and proper data science tools and Natural Language Processing (NLP).Where as Scala has no such tools.
Python can be used for small-scale projects.Scala can be used for large-scale projects.
It doesn’t provide scalable feature support.It provides scalable feature support.

Apache Spark with Scala – Resilient Distributed Dataset

Data is growing even faster than processing speeds. To perform computations on such large data is often achieved by using distributed systems. A distributed system consists of clusters (nodes/networked computers) that run processes in parallel and communicate with each other if needed.

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. This rich set of functionalities and libraries supported higher-level tools like Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. In this article, we will be learning Apache spark (version 2.x) using Scala.

Some basic concepts : 

  1. RDD(Resilient Distributed Dataset) – It is an immutable distributed collection of objects. In the case of RDD, the dataset is the main part and It is divided into logical partitions.
  2. SparkSession –The entry point to programming Spark with the Dataset and DataFrame API.

We will be using Scala IDE only for demonstration purposes. A dedicated spark compiler is required to run the below code. Follow the link to run the below code.

Let’s create our first data frame in spark. 

Scala

// Importing SparkSession import org.apache.spark.sql.SparkSession    // Creating SparkSession object val sparkSession = SparkSession.builder()                    .appName("My First Spark Application")                    .master("local").getOrCreate()    // Loading sparkContext val sparkContext = sparkSession.sparkContext    // Creating an RDD  val intArray = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)    // parallelize method creates partitions, which additionally  // takes integer argument to specifies the number of partitions.  // Here we are using 3 partitions.    val intRDD = sparkContext.parallelize(intArray, 3)    // Printing number of partitions println(s"Number of partitons in intRDD : ${intRDD.partitions.size}")    // Printing first element of RDD println(s"First element in intRDD : ${intRDD.first}")    // Creating string from RDD // take(n) function is used to fetch n elements from  // RDD and returns an Array. // Then we will convert the Array to string using  // mkString function in scala. val strFromRDD = intRDD.take(intRDD.count.toInt).mkString(", ") println(s"String from intRDD : ${strFromRDD}")    // Printing contents of RDD // collect function is used to retrieve all the data in an RDD. println("Printing intRDD: ") intRDD.collect().foreach(println)

 
Output: 
 

Number of partitons in intRDD : 3
First element in intRDD : 1
String from intRDD : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Printing intRDD: 
1
2
3
4
5
6
7
8
9
10


Scala is a programming language that is an extension of Java as it was originally built on the Java Virtual Machine (JVM). So it can easily integrate with Java. However, the real reason that Scala is so useful for Data Science is that it can be used along with Apache Spark to manage large amounts of data. So when it comes to big data, Scala is the go-to language. Many of the data science frameworks that are created on top of Hadoop actually use Scala or Java or are written in these languages. However, one downside of Scala is that it is difficult to learn and there are not as many online community support groups as it is a niche language.