Python

Python is an interpreted, object-oriented programming language similar to PERL, that has gained popularity because of its clear syntax and readability. Python is said to be relatively easy to learn and portable, meaning its statements can be interpreted in a number of operating systems, including UNIX-based systems, Mac OS, MS-DOS, OS/2, and various versions of Microsoft Windows 98. Python was created by Guido van Rossum, a former resident of the Netherlands, whose favorite comedy group at the time was Monty Python’s Flying Circus. The source code is freely available and open for modification and reuse. Python has a significant number of users.

A notable feature of Python is its indenting of source statements to make the code easier to read. Python offers dynamic data type, ready-made class, and interfaces to many system calls and libraries. It can be extended, using the C or C++ language.

Python can be used as the script in Microsoft’s Active Server Page (ASP) technology. The scoreboard system for the Melbourne (Australia) Cricket Ground is written in Python. Z Object Publishing Environment, a popular Web application server, is also written in the Python language.

Python is everywhere!

With the widespread use of Python across major industry verticals, Python has become a hot topic of discussion in the town. Python has been acknowledged as the fastest-growing programming language, as per Stack Overflow Trends.

According to Stack Overflow Developers’ Survey 2019, Python is the second “most loved” language with 73% of the developers choosing it above other languages prevailing in the market.

Python is a general-purpose and open-source programming language used by big names such as Reddit, Instagram, and Venmo says a press release.

Why choose Python for Big Data?

Python and Big Data is the new combination invading the market space NOW. Python is in great demand among Big Data companies. In this blog, we will discuss the major benefits of using Python and why Python for big data has become a preferred choice among businesses these days.

Simple Coding

Python programming involves fewer lines of codes as compared to other languages available for programming. It is able to execute programs in the least lines of code. Moreover, Python automatically offers assistance to identify and associate data types.

Python is a truly wonderful language. When somebody comes up with a good idea it takes about 1 minute and five lines to program something that almost does what you want.” — Jack Jansen

Python programming follows an indentation based nesting structure. The language can process lengthy tasks within a short span of time. As there is no limitation to data processing, you can compute data in commodity machines, laptop, cloud, and desktop.

Earlier, Python was considered to be a slower language in comparison to some of its counterparts like Java and Scala but the scenario has changed now.

The advent of the Anaconda platform has offered a great speed to the language. This is why Python for big data has become one of the most popular options in the industry. You can also hire Python Developer who can implement these Python benefits in your business.

Open-Source

Developed with the help of a community-based model, Python is an open-source programming language. Being an open-source language, Python supports multiple platforms. Also, it can be run in various environments such as Windows and Linux.

My favorite language for maintainability is Python. It has simple, clean syntax, object encapsulation, good library support, and optional named parameters”, said Bram Cohen.

Library Support

Python programming offers the use of multiple libraries. This makes it a famous programming language in fields like scientific computing. As Big Data involves a lot of data analysis and scientific computing, Python and Big Data serve as great companions.

Python offers a number of well-tested analytics libraries. These libraries consist of packages such as,

  • Numerical computing
  • Data analysis
  • Statistical analysis
  • Visualization
  • Machine learning

Python’s Compatibility with Hadoop

Both Python and Hadoop are open-source big data platforms. This is the reason why Python is more compatible with Hadoop than other programming languages. You can incorporate these Python features in your business. To do this, you need to hire Python developers from a reputed Python development company.

What are the benefits of using the Pydoop Package?

1. Access to HDFS API

The Pydoop package( Python and Hadoop) provides you access to the HDFS API for Hadoop which allows you to write Hadoop MapReduce programs and applications.
How is the HDFS API beneficial for you? So, here you go. The HDFS API lets you read and write information easily on files, directories, and global file system properties without facing any hurdles.

2. Offers MapReduce API

Pydoop offers MapReduce API for solving complex problems with minimal programming efforts. This API can be used to implement advanced data science concepts like ‘Counters’ and ‘Record Readers’ which makes Python programming the best choice for Big Data.

AlsoRead — “Is Python for Financial App Development the Right fit?”

Speed

Python is considered to be one of the most popular languages for software development because of its high speed and performance. As it accelerates the code well, Python is an apt choice for big data.

Python programming supports prototyping ideas which help in making the code run fast. Moreover, while doing so, Python also sustains the transparency between the code and the process.

Python programming contributes to making code readable and transparent thus rendering great assistance in the maintenance of the code.

Scope

Python allows users in simplifying data operations. As Python is an object-oriented language, it supports advanced data structures. Some of the data structures that Python manages include lists, sets, tuples, dictionaries and many more.

Besides this, Python helps in supporting scientific computing operations such as matrix operations, data frames, etc. These incredible features of Python help to enhance the scope of the language thus enabling it to speed up data operations. This is what makes Python and Big Data a deadly combination.

Data Processing Support

Python has an inbuilt feature of supporting data processing. You can use this feature to support data processing for unstructured and unconventional data. This is the reason why big data companies prefer to choose Python as it is considered to be one of the most important requirements in big data. So, hire offshore Python programmers and avail the advantages of using Python in your business.

Final words

These were some of the benefits of using Python. So by now, you would have got a clear idea of why Python for big data is considered the best fit. Python is a simple and open-source language possessing high speed and robust Library support.

“Big data is at the foundation of all the megatrends that are happening.” –Chris Lynch

With the use of big data technology spreading across the globe, meeting the requirements of this industry is surely a daunting task. But, with its incredible benefits, Python has become a suitable choice for Big Data. You can also leverage Python in your business for availing its advantages.

Exploratory Data Analysis in Python

EDA is a phenomenon under data analysis used for gaining a better understanding of data aspects like:

  • main features of data
  • variables and relationships that hold between them
  • identifying which variables are important for our problem

We shall look at various exploratory data analysis methods like:

  • Descriptive Statistics, which is a way of giving a brief overview of the dataset we are dealing with, including some measures and features of the sample
  • Grouping data [Basic grouping with group by]
  • ANOVA, Analysis Of Variance, which is a computational method to divide variations in an observations set into different components.
  • Correlation and correlation methods

The dataset we’ll be using is chile voting dataset, which you can import in python as:

import pandas as pd Df = pd.read_csv("https://vincentarelbundock.github.io/ Rdatasets/csv/car/Chile.csv")

Descriptive Statistics

Descriptive statistics is a helpful way to understand characteristics of your data and to get a quick summary of it. Pandas in python provide an interesting method describe(). The describe function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of distribution of data.

DF.describe()

Here’s the output you’ll get on running above code:

Another useful method if value_counts() which can get count of each category in a categorical attributed series of values. For an instance suppose you are dealing with a dataset of customers who are divided as youth, medium and old categories under column name age and your dataframe is “DF”. You can run this statement to know how many people fall in respective categories. In our data set example education column can be used

DF["education"].value_counts()

The output of the above code will be:

One more useful tool is boxplot which you can use through matplotlib module. Boxplot is a pictorial representation of distribution of data which shows extreme values, median and quartiles. We can easily figure out outliers by using boxplots. Now consider the dataset we’ve been dealing with again and lets draw a boxplot on attribute population

import pandas as pd import matplotlib.pyplot as plt DF = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/ data/master/airline-safety/airline-safety.csv") y = list(DF.population) plt.boxplot(y) plt.show()

The output plot would look like this with spotting out outliers:

Grouping data

Group by is an interesting measure available in pandas which can help us figure out effect of different categorical attributes on other data variables. Let’s see an example on the same dataset where we want to figure out affect of people’s age and education on the voting dataset.

DF.groupby(['education', 'vote']).mean()

The output would be somewhat like this:

If this group by output table is less understandable further analysts use pivot tables and heat maps for visualization on them.

ANOVA

ANOVA stands for Analysis of Variance. It is performed to figure out the relation between the different group of categorical data.


Under ANOVA we have two measures as result:
– F-testscore : which shows the variaton of groups mean over variation
– p-value: it shows the importance of the result

This can be performed using python module scipy method name f_oneway()
Syntax:

 
import scipy.stats as st
st.f_oneway(sample1, sample2, ..)

These samples are sample measurements for each group.
As a conclusion, we can say that there is a strong correlation between other variables and a categorical variable if the ANOVA test gives us a large F-test value and a small p-value.

Correlation and Correlation computation

Correlation is a simple relationship between two variables in a context such that one variable affects the other. Correlation is different from act of causing. One way to calculate correlation among variables is to find Pearson correlation. Here we find two parameters namely, Pearson coefficient and p-value. We can say there is a strong correlation between two variables when Pearson correlation coefficient is close to either 1 or -1 and the p-value is less than 0.0001.
Scipy module also provides a method to perform pearson correlation analysis, syntax:

import scipy.stats as st
st.pearsonr(sample1, sample2)

Loading Libraries:

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import trim_mean

Loading Data:

data = pd.read_csv("state.csv")    # Check the type of data print ("Type : ", type(data), "\n\n")    # Printing Top 10 Records print ("Head -- \n", data.head(10))    # Printing last 10 Records  print ("\n\n Tail -- \n", data.tail(10))

Output :

Type : class 'pandas.core.frame.DataFrame'
 
 
Head -- 
          State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE
8      Florida    18801310          5.8           FL
9      Georgia     9687653          5.7           GA
 
 
 Tail -- 
             State  Population  Murder.Rate Abbreviation
40   South Dakota      814180          2.3           SD
41      Tennessee     6346105          5.7           TN
42          Texas    25145561          4.4           TX
43           Utah     2763885          2.3           UT
44        Vermont      625741          1.6           VT
45       Virginia     8001024          4.1           VA
46     Washington     6724540          2.5           WA
47  West Virginia     1852994          4.0           WV
48      Wisconsin     5686986          2.9           WI
49        Wyoming      563626          2.7           WY

Code #1 : Adding Column to the dataframe

# Adding a new column with derived data     data['PopulationInMillions'] = data['Population']/1000000    # Changed data print (data.head(5))

Output :

        State  Population  Murder.Rate Abbreviation  PopulationInMillions
0     Alabama     4779736          5.7           AL              4.779736
1      Alaska      710231          5.6           AK              0.710231
2     Arizona     6392017          4.7           AZ              6.392017
3    Arkansas     2915918          5.6           AR              2.915918
4  California    37253956          4.4           CA             37.253956

Code #2 : Data Description

data.describe()

Output :

Code #3 : Data Info

data.info()

Output :

 
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
State           50 non-null object
Population      50 non-null int64
Murder.Rate     50 non-null float64
Abbreviation    50 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 1.6+ KB

Code #4 : Renaming a column heading

# Rename column heading as it  # has '.' in it which will create # problems when dealing functions     data.rename(columns ={'Murder.Rate': 'MurderRate'}, inplace = True)    # Lets check the column headings list(data)

Output :

['State', 'Population', 'MurderRate', 'Abbreviation']

Code #5 : Calculating Mean

Population_mean = data.Population.mean() print ("Population Mean : ", Population_mean)    MurderRate_mean = data.MurderRate.mean() print ("\nMurderRate Mean : ", MurderRate_mean)

Output:

Population Mean :  6162876.3
 
MurderRate Mean :  4.066

Code #6 : Trimmed mean

# Mean after discarding top and  # bottom 10 % values eliminating outliers    population_TM = trim_mean(data.Population, 0.1) print ("Population trimmed mean: ", population_TM)    murder_TM = trim_mean(data.MurderRate, 0.1) print ("\nMurderRate trimmed mean: ", murder_TM)

Output :

Population trimmed mean:  4783697.125
 
MurderRate trimmed mean:  3.9450000000000003

Code #7 : Weighted Mean

# here murder rate is weighed as per  # the state population    murderRate_WM = np.average(data.MurderRate, weights = data.Population) print ("Weighted MurderRate Mean: ", murderRate_WM)

Output :

Weighted MurderRate Mean:  4.445833981123393

Code #8 : Median

Population_median = data.Population.median() print ("Population median : ", Population_median)    MurderRate_median = data.MurderRate.median() print ("\nMurderRate median : ", MurderRate_median)

Output :

Population median :  4436369.5
 
MurderRate median :  4.0

We have discussed some basic techniques to analyze the data, now let’s see the visual techniques.

Let’s see the basic techniques –

# Loading Libraries    import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import trim_mean    # Loading Data data = pd.read_csv("state.csv")     # Check the type of data print ("Type : ", type(data), "\n\n")     # Printing Top 10 Records print ("Head -- \n", data.head(10))     # Printing last 10 Records  print ("\n\n Tail -- \n", data.tail(10))    # Adding a new column with derived data   data['PopulationInMillions'] = data['Population']/1000000     # Changed data print (data.head(5))    # Rename column heading as it  # has '.' in it which will create # problems when dealing functions      data.rename(columns ={'Murder.Rate': 'MurderRate'},                                     inplace = True)     # Lets check the column headings list(data)

Output :

Type : class 'pandas.core.frame.DataFrame'
 
 
Head -- 
          State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE
8      Florida    18801310          5.8           FL
9      Georgia     9687653          5.7           GA
 
 
 Tail -- 
             State  Population  Murder.Rate Abbreviation
40   South Dakota      814180          2.3           SD
41      Tennessee     6346105          5.7           TN
42          Texas    25145561          4.4           TX
43           Utah     2763885          2.3           UT
44        Vermont      625741          1.6           VT
45       Virginia     8001024          4.1           VA
46     Washington     6724540          2.5           WA
47  West Virginia     1852994          4.0           WV
48      Wisconsin     5686986          2.9           WI
49        Wyoming      563626          2.7           WY
 
 
        State  Population  Murder.Rate Abbreviation  PopulationInMillions
0     Alabama     4779736          5.7           AL              4.779736
1      Alaska      710231          5.6           AK              0.710231
2     Arizona     6392017          4.7           AZ              6.392017
3    Arkansas     2915918          5.6           AR              2.915918
4  California    37253956          4.4           CA             37.253956
 
 
['State', 'Population', 'MurderRate', 'Abbreviation']

Visualizing Population per Million

# Plot Population In Millions fig, ax1 = plt.subplots() fig.set_size_inches(15,  9)       ax1 = sns.barplot(x ="State", y ="Population",                    data = data.sort_values('MurderRate'),                                          palette ="Set2")    ax1.set(xlabel ='States', ylabel ='Population In Millions') ax1.set_title('Population in Millions by State', size = 20)    plt.xticks(rotation =-90)

Output:

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
 a list of 50 Text xticklabel objects)

Visualizing Murder Rate per Lakh

# Plot Murder Rate per 1, 00, 000    fig, ax2 = plt.subplots() fig.set_size_inches(15,  9)       ax2 = sns.barplot(     x ="State", y ="MurderRate",      data = data.sort_values('MurderRate', ascending = 1),                                           palette ="husl")    ax2.set(xlabel ='States', ylabel ='Murder Rate per 100000') ax2.set_title('Murder Rate by State', size = 20)    plt.xticks(rotation =-90)

Output :

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
 a list of 50 Text xticklabel objects)


Although Louisiana is ranked 17 by population (about 4.53M), it has the highest Murder rate of 10.3 per 1M people.

Code #1 : Standard Deviation

Population_std = data.Population.std() print ("Population std : ", Population_std)    MurderRate_std = data.MurderRate.std() print ("\nMurderRate std : ", MurderRate_std)

Output :

Population std :  6848235.347401142
 
MurderRate std :  1.915736124302923

Code #2 : Variance

Population_var = data.Population.var() print ("Population var : ", Population_var)    MurderRate_var = data.MurderRate.var() print ("\nMurderRate var : ", MurderRate_var)

Output :

Population var :  46898327373394.445
 
MurderRate var :  3.670044897959184

Code #3 : Inter Quartile Range

# Inter Quartile Range of Population population_IQR = data.Population.describe()['75 %'] -                   data.Population.describe()['25 %']    print ("Population IQR : ", population_IRQ)    # Inter Quartile Range of Murder Rate MurderRate_IQR = data.MurderRate.describe()['75 %'] -                   data.MurderRate.describe()['25 %']    print ("\nMurderRate IQR : ", MurderRate_IQR)

Output :

Population IQR :  4847308.0
 
MurderRate IQR :  3.124999999999999

Code #4 : Median Absolute Deviation (MAD)

Population_mad = data.Population.mad() print ("Population mad : ", Population_mad)    MurderRate_mad = data.MurderRate.mad() print ("\nMurderRate mad : ", MurderRate_mad)

Output :

Population mad :  4450933.356000001
 
MurderRate mad :  1.5526400000000005

Data analysis and Visualization with Python

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analyzing data much easier. In this article, I have used Pandas to analyze data on Country Data.csv file from UN public Data Sets of a popular ‘statweb.stanford.edu’ website.


Installation
Easiest way to install pandas is to use pip:

#pip install pandas

Creating A DataFrame in Pandas

Creation of dataframe is done by passing multiple Series into the DataFrame class using pd.Series method. Here, it is passed in the two Series objects, s1 as the first row, and s2 as the second row.


Example:

# assigning two series to s1 and s2 s1 = pd.Series([1,2]) s2 = pd.Series(["Ashish", "Sid"]) # framing series objects into data df = pd.DataFrame([s1,s2]) # show the data frame df    # data framing in another way # taking index and column values dframe = pd.DataFrame([[1,2],["Ashish", "Sid"]],         index=["r1", "r2"],         columns=["c1", "c2"]) dframe    # framing in another way  # dict-like container dframe = pd.DataFrame({         "c1": [1, "Ashish"],         "c2": [2, "Sid"]}) dframe

Output:

Importing Data with Pandas

The first step is to read the data. The data is stored as a comma-separated values, or csv, file, where each row is separated by a new line, and each column by a comma (,). In order to be able to work with the data in Python, it is needed to read the csv file into a Pandas DataFrame. A DataFrame is a way to represent and work with tabular data. Tabular data has rows and columns, just like this csv file.


Example:

# Import the pandas library, renamed as pd import pandas as pd    # Read IND_data.csv into a DataFrame, assigned to df df = pd.read_csv("IND_data.csv")    # Prints the first 5 rows of a DataFrame as default df.head()    # Prints no. of rows and columns of a DataFrame df.shape

Output:

29,10

Indexing DataFrames with Pandas

Indexing can be possible using the pandas.DataFrame.iloc method. The iloc method allows to retrieve as  many as rows and columns by position.

Examples:

# prints first 5 rows and every column which replicates df.head() df.iloc[0:5,:] # prints entire rows and columns df.iloc[:,:] # prints from 5th rows and first 5 columns df.iloc[5:,:5]

Indexing Using Labels in Pandas

Indexing can be worked with labels using the pandas.DataFrame.loc method, which allows to index using labels instead of positions.

Examples:

# prints first five rows including 5th index and every columns of df df.loc[0:5,:] # prints from 5th rows onwards and entire columns df = df.loc[5:,:]

The above doesn’t actually look much different from df.iloc[0:5,:]. This is because while row labels can take on any values, our row labels match the positions exactly. But column labels can make things much easier when working with data. Example:

# Prints the first 5 rows of Time period # value  df.loc[:5,"Time period"]  

DataFrame Math with Pandas

Computation of data frames can be done by using Statistical Functions of pandas tools.
Examples:

# computes various summary statistics, excluding NaN values df.describe() # for computing correlations df.corr() # computes numerical data ranks df.rank()
 
 
 

Pandas Plotting

Plots in these examples are made using standard convention for referencing the matplotlib API which provides the basics in pandas to easily create decent looking plots.
Examples:

# import the required module  import matplotlib.pyplot as plt # plot a histogram  df['Observation Value'].hist(bins=10)    # shows presence of a lot of outliers/extreme values df.boxplot(column='Observation Value', by = 'Time period')    # plotting points as a scatter plot x = df["Observation Value"] y = df["Time period"] plt.scatter(x, y, label= "stars", color= "m",              marker= "*", s=30) # x-axis label plt.xlabel('Observation Value') # frequency label plt.ylabel('Time period') # function to show the plot plt.show()  

Storing DataFrame in CSV Format :

Pandas provide to.csv(‘filename’, index = “False|True”) function to write DataFrame into a CSV file. Here filename is the name of the CSV file that you want to create and index tells that index (if Default) of DataFrame should be overwritten or not. If we set index = False then the index is not overwritten. By Default value of index is TRUE then index is overwritten.

Example :

import pandas as pd    # assigning three series to s1, s2, s3 s1 = pd.Series([0, 4, 8]) s2 = pd.Series([1, 5, 9]) s3 = pd.Series([2, 6, 10])    # taking index and column values dframe = pd.DataFrame([s1, s2, s3])    # assign column name dframe.columns =[‘Geeks’, ‘For’, ‘Geeks’]    # write data to csv file dframe.to_csv(‘geeksforgeeks.csv’, index = False)   dframe.to_csv(‘geeksforgeeks1.csv’, index = True)

Output :

geeksforgeeks1.csv

geeksforgeeks2.csv

Handling Missing Data

The Data Analysis Phase also comprises of the ability to handle the missing data from our dataset, and not so surprisingly Pandas live up to that expectation as well. This is where dropna and/or fillna methods comes into the play. While dealing with the missing data, you as a Data Analyst are either supposed to drop the column containing the NaN values (dropna method) or fill in the missing data with mean or mode of the whole column entry (fillna method), this decision is of great significance and depends upon the data and the affect would create in our results.

  • Drop the missing Data :
    Consider this is the DataFrame generated by below code :
import pandas as pd    # Create a DataFrame dframe = pd.DataFrame({‘Geeks’: [23, 24, 22],                         ‘For’: [10, 12, np.nan],                        ‘geeks’: [0, np.nan, np.nan]},                        columns =[‘Geeks’, ‘For’, ‘geeks’])    # This will remove all the # rows with NAN values    # If axis is not defined then # it is along rows i.e. axis = 0 dframe.dropna(inplace = True) print(dframe)    # if axis is equal to 1 dframe.dropna(axis = 1, inplace = True)    print(dframe)

Output :

axis=0

axis=1

  • Fill the missing values :
    Now, to replace any NaN value with mean or mode of the data, fillna is used, which could replace all the NaN values from a particular column or even in whole DataFrame as per the requirement.
import numpy as np import pandas as pd    # Create a DataFrame dframe = pd.DataFrame({‘Geeks’: [23, 24, 22],                          ‘For’: [10, 12, np.nan],                         ‘geeks’: [0, np.nan, np.nan]},                         columns = [‘Geeks’, ‘For’, ‘geeks’])    # Use fillna of complete Dataframe     # value function will be applied on every column dframe.fillna(value = dframe.mean(), inplace = True) print(dframe)    # filling value of one column dframe[‘For’].fillna(value = dframe[‘For’].mean(),                                     inplace = True) print(dframe)

Output :

Groupby Method (Aggregation) :

The groupby method allows us to group together the data based off any row or column, thus we can further apply the aggregate functions to analyze our data. Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Consider this is the DataFrame generated by below code :

import pandas as pd import numpy as np    # create DataFrame dframe = pd.DataFrame({‘Geeks’: [23, 24, 22, 22, 23, 24],                          ‘For’: [10, 12, 13, 14, 15, 16],                         ‘geeks’: [122, 142, 112, 122, 114, 112]},                         columns = [‘Geeks’, ‘For’, ‘geeks’])     # Apply groupby and aggregate function # max to find max value of column     # "For" and column "geeks" for every # different value of column "Geeks".    print(dframe.groupby([‘Geeks’]).max())

Output :

Analysis of test data using K-Means Clustering in Python

This demonstrates an illustration of K-means clustering on a sample random data using open-cv library.

Pre-requisites: Numpy, OpenCV, matplot-lib
Let’s first visualize test data with Multiple Features using matplot-lib tool.

# importing required tools import numpy as np from matplotlib import pyplot as plt    # creating two test data X = np.random.randint(10,35,(25,2)) Y = np.random.randint(55,70,(25,2)) Z = np.vstack((X,Y)) Z = Z.reshape((50,2))    # convert to np.float32 Z = np.float32(Z)    plt.xlabel('Test Data') plt.ylabel('Z samples')    plt.hist(Z,256,[0,256])    plt.show()

Here ‘Z’ is an array of size 100, and values ranging from 0 to 255. Now, reshaped ‘z’ to a column vector. It will be more useful when more than one features are present. Then change the data to np.float32 type.

Output:

Now, apply the k-Means clustering algorithm to the same example as in the above test data and see its behavior.


Steps Involved:
1) First we need to set a test data.
2) Define criteria and apply kmeans().
3) Now separate the data.
4) Finally Plot the data.

import numpy as np import cv2 from matplotlib import pyplot as plt    X = np.random.randint(10,45,(25,2)) Y = np.random.randint(55,70,(25,2)) Z = np.vstack((X,Y))    # convert to np.float32 Z = np.float32(Z)    # define criteria and apply kmeans() criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0) ret,label,center = cv2.kmeans(Z,2,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)    # Now separate the data A = Z[label.ravel()==0] B = Z[label.ravel()==1]    # Plot the data plt.scatter(A[:,0],A[:,1]) plt.scatter(B[:,0],B[:,1],c = 'r') plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's') plt.xlabel('Test Data'),plt.ylabel('Z samples') plt.show()

Output:

This example is meant to illustrate where k-means will produce intuitively possible clusters.

Applications:
1) Identifying Cancerous Data.
2) Prediction of Students’ Academic Performance.
3) Drug Activity Prediction.