Skip to content

Tools to Mine Big Data Analytics

    Before it deployed a Hadoop cluster five years ago, retailer Macy’s Inc. had big problems analyzing all of the sales and marketing data its systems were generating. And the problems were only getting bigger as Macy’s pushed aggressively to increase its online business, further ratcheting up the data volumes it was looking to explore.

    The company’s traditional data warehouse architecture had severe processing limitations and couldn’t handle unstructured information, such as text. Historical data was also largely inaccessible, typically having been archived on tapes that were shipped to off-site storage facilities. Data scientists and other analysts “could only run so many queries at particular times of the day,” said Seetha Chakrapany, director of marketing analytics and customer relationship management (CRM) systems at Macy’s. “They were pretty much shackled. They couldn’t do their jobs.”

    The Hadoop system has alleviated the situation, providing a big data analytics architecture that also supports basic business intelligence (BI) and reporting processes. Going forward, the cluster “could truly be an enterprise data analytics platform” for Macy’s, Chakrapany said. Already, along with the analytics teams using it, thousands of business users in marketing, merchandising, product management and other departments are accessing hundreds of BI dashboards that are fed to them by the system.

    But there’s a lot more to the Macy’s big data environment than the Hadoop cluster alone. At the front end, for example, Macy’s has deployed a variety of analytics tools to meet different application needs. For statistical analysis, the Cincinnati-based retailer uses SAS and Microsoft’s R Server, which is based on the R open source statistical programming language.

    Several other tools provide predictive analytics, data mining and machine learning capabilities. That includes H2O, Salford Predictive Modeler, the Apache Mahout open source machine learning platform and KXEN — the latter an analytics technology that SAP bought three years ago and has since folded into its SAP BusinessObjects Predictive Analytics software. Also in the picture at Macy’s are Tableau Software’s data visualization tools and AtScale’s BI on Hadoop technology.

    A better way to analyze big data

    All the different tools are key elements in making effective use of the big data analytics architecture, Chakrapany said in a presentation and follow-up interview at Hadoop Summit 2016 in San Jose, Calif. Automating the advanced analytics process through statistical routines and machine learning is a must, he noted.

    “We’re constantly in a state of experimentation. And because of the volume of data, there’s just no humanly possible way to analyze it manually,” Chakrapany said. “So, we apply all the statistical algorithms to help us see what’s happening with the business.” That includes analysis of customer, order, product and marketing data, plus clickstream activity records captured from the website.

    Similar scenarios are increasingly playing out at other organizations, too. As big data platforms such as Hadoop, NoSQL databases and the Spark processing engine become more widely adopted, the number of companies deploying advanced analytics tools that can help them take advantage of the data flowing into those systems is also on the rise.

    In an ongoing survey on the use of BI and analytics software conducted, 26.7% of some 7,000 respondents, as of November 2016, said their organizations had installed predictive analytics tools. And, looking forward, predictive analytics topped the list of technologies for planned investments. It was cited by 39.5% of the respondents, putting it above data visualization, self-service BI and enterprise reporting all more mainstream BI technologies.

    A TDWI survey conducted in the second half of 2015 also found increasing plans to use predictive analytics software to bolster business operations. In that case, 87% of 309 BI, analytics and data management professionals said their organizations were already active users of the technology or expected to implement it within three years. Other forms of advanced analytics, what-if simulations and prescriptive analytics, for example are similarly in line for increased usage, according to a report on the survey, which was published last December (see “Predicting High Growth” chart).

    Predictive analytics use is on the rise.

    Algorithms find meaning in data sets

    Machine learning tools and other types of artificial intelligence technologies deep learning and cognitive computing among them are also getting increased attention from technology users and vendors as analytics teams look to automated algorithms to help them make sense of data sets that are getting larger and larger.

    Progressive Casualty Insurance Co. is another company that’s already there. The Mayfield Village, Ohio-based insurer uses a Hadoop cluster partly to power its Snapshot program, which awards policy discounts to safe drivers based on operational data collected from their vehicles through a device that plugs into the on-board diagnostics port.

    The cluster is based on the Hortonworks distribution of Hadoop, as is the one at Macy’s. About 60 compute nodes are dedicated to the Snapshot initiative, and Progressive’s big data analytics architecture includes tools such as SAS, R and H2O, which the company’s data scientists use in analyzing the driving data processed in the Hadoop system.

    The data scientists run predictive algorithms backed up by heavy-duty data visualizations to help score customers participating in the program on their driving safety. They also look for bad driving habits and possible mechanical problems in vehicles, such as alternator issues signaled by abnormal voltage fluctuations captured as part of the incoming data.

    The predictive analytics and machine learning capabilities are “huge,” said Pawan Divakarla, Progressive’s data and analytics business leader. “You have so much data, and you have fancier and fancier models for analyzing it. You need something to assist you, to see what works.”

    Going deeper on big data analytics

    Yahoo was the first production user of Hadoop in 2006, when the technology’s co-creator, Doug Cutting, was working at the web search and internet services company, and it claims to be the largest Hadoop user today. Yahoo’s big data analytics architecture includes more than 40,000 nodes running 300-plus applications across 40 clusters that mix Hadoop with its companion Apache HBase database, the Apache Storm real-time processing engine and other big data technologies. But the Sunnyvale, Calif., company’s use of those technologies continues to expand into new areas.

    “Even after 10 years, we’re still uncovering benefits,” said Andy Feng, vice president in charge of Yahoo’s big data and machine learning architecture. Feng estimated that, over the past three years, he has spent about 95% of his time at work focusing on machine learning tools and applications. In the past, the automated algorithms that could be built and run with existing machine learning technologies “weren’t capable of leveraging huge data sets on Hadoop clusters,” Feng said. “The accuracy wasn’t that good.”

    “We always did machine learning, but we did it in a constrained fashion, so the results were limited,” added Sumeet Singh, senior director of product development for cloud and big data platforms at Yahoo. However, he and Feng said things have changed for the better in recent years, and in a big way. “We’ve seen an amazing resurgence in artificial intelligence and machine learning, and one of the reasons is all the data,” Singh noted.

    For example, Yahoo is now running a machine learning algorithm that uses a semantic analysis process to better match paid ads on search results pages to the search terms entered by web users; it has led to a 9% increase in revenue per search, according to Feng. Another machine learning application lets users of Yahoo’s Flickr online photo and video service organize images based on their visual content instead of the date on which they were taken. The algorithm can also flag photos as not suitable for viewing at work to help users avoid potentially embarrassing situations in the office, Feng said.

    These new applications were made possible partly through the addition of graphics processing units to Hadoop cluster nodes, Feng said the GPUs do image processing that conventional CPUs can’t handle. Yahoo also added Spark to the big data analytics architecture to take over some of the processing work.

    In addition, it deployed MLlib, Spark’s built-in library of machine learning algorithms. However, those algorithms turned out to be too basic, Singh said. That prompted the big data team to develop CaffeOnSpark, a library of deep learning algorithms that Yahoo has made available as an open source technology on the GitHub website.

    Data Analytics Programming Languages

    A programming language is a formal language comprising a set of instructions that produce various kinds of output. These languages are used in computer programs to implement algorithms and have multiple applications.  There are several programming languages for data science as well. Data scientists should learn and master at least one language as it is an essential tool to realize various data science functions. 

    Data science is a concept of bringing together statistics, data analysis and their related strategies to understand and analyze real wonders with data. It engages theories and techniques drawn from various fields within the wide regions of statistics, mathematics, computer science, and information science.

    Before becoming an expert in data science, learning a programming language is a crucial requirement. Data scientists should weigh the pros and cons of the different types of programming languages for data science before making a decision.

    Data science is an exciting field to work in, combining quantitative skills and advanced statistical with real-world programming ability. Various potential programming languages are aspiring data scientist should think about having some expertise. Now, let’s take a look at top programming languages a data scientist should master.