What is data mining?
Data mining is the process of finding and examining patterns and other relationships within large data sets to discover critical insights.
The findings of data mining operations can help predict potential outcomes, determine strategies, improve decision-making and ultimately solve business problems of all kinds. As such, mining is sometimes referred to as "knowledge discovery." Numerous processes fall under the broad umbrella of data mining, including data cleansing, modeling, analyzing, testing, classification, clustering, and reporting.
Data mining is so important precisely because it can be applied to business problems and objectives across countless areas of an enterprise, ranging from marketing and customer service to supply chain management and fraud detection.
Mining for data in the modern era
The specifics of data mining projects and processes will vary from organization to organization. That being said, there are a number of established standards informing data mining and analysis in today's enterprises. The best known of these is the Cross-Industry Standard Process for Data Mining (CRISP-DM).
CRISP-DM offers a particularly effective way to represent data mining as a logical and understandable process to those who aren't well-versed in the field. It is broken down into six distinct steps:
6 Phases of CRISP-DM
- Business understanding: Data teams must clearly establish a primary objective for any data mining project. This can often be expressed as the desire to find the answer to a question. E.g., "How is the new customer relationship management (CRM) platform that the sales team just adopted affecting individual reps' productivity and revenue generation?" This step also involves defining success criteria—which, for this example, would include data points detailing positive effects on the sales team's productivity and contribution to the bottom line.
- Data understanding: Determining what data will be necessary to approach and eventually solve the business problem, and then collecting it. Keeping with the CRM example, much of the data that will answer the question comes from the CRM—but other applications, including enterprise resource planning (ERP) and human resources information systems (HRIS), can also provide contextually relevant data.
- Data preparation: After data collection is complete, analysts and scientists must cleanse it—eliminating redundancies through deduplication, discarding unnecessary outliers and missing values. Then, the data is transformed into an appropriate format for analysis of the business question. In some cases, some of the data set's dimensions may be eliminated if failing to do so would slow down modeling and computation too much.
- Modeling: A data mining algorithm is selected as the basis for a model—usually more than one algorithm, because most mining projects will benefit from multiple models. Data teams then build models using programming languages like Python or R.
- Evaluation: Determines which models produce results that best meet the defined success criteria. Evaluation also involves taking a step back to review the entire mining process, checking for and correcting any mistakes that might have been made.
- Deployment: Finally, data teams will run data through their models for analysis, and report the data mining results to all relevant stakeholders and decision-makers.
These days, many elements of the CRISP-DM steps can be automated, so the whole process isn't as drawn out as it might initially appear.
Following it to the letter won't be ideal for every organization. But it's important to remember that there's nothing stopping a senior data analyst and their team from tweaking the specifics of CRISP-DM—or any other data mining methodology—to better suit their needs. It's far from uncommon, especially in enterprises, for data teams to use different methods for different data mining projects.
Data mining techniques
Within the framework of methodologies like CRISP-DM, certain data mining techniques form the basis of algorithms and models. The following are some of the most common:
- Association rules: This technique describes relationships between variables in data using "if-then" statements—e.g., "If sales reps use the new CRM for two months or more, their productivity jumps by at least 10%."
- Decision tree: Using a tree-like visualization, this data mining technique depicts the known or projected potential outcomes of a series of decisions.
- Clustering: Data is grouped into clusters with this technique, each of which contains elements that have various commonalities.
- K-nearest neighbor: Based on an algorithm that assumes similar data points will be found near one another, k-nearest neighbor mining classifies data points based on proximity and relationship.
- Neural networks: Perhaps the most advanced data mining method, neural networks use the most sophisticated form of machine learning—deep learning—to mine and process data using layers of nodes that approximate a human brain's function.
Data mining use cases
The best way to understand the impact of data mining is to examine how two enterprises—both Teradata customers—used it to remarkable effect.
Groupon perfects user recommendations
The success of Groupon is contingent on deals from its client merchants showing up in the feeds of users most likely to take advantage of those deals. Through sophisticated data mining of its cloud-hosted data warehouse, Groupon can query its data inventory in real time and efficiently craft customized recommendations for website and mobile users. Data mining is also important for the company's financial analytics workloads, helping to facilitate effective compliance reporting.
Medibank improves the customer experience
About 15% of Australia's population receives either health insurance or other healthcare-related services from Medibank. As such, it's critical that the needs of its 3.76 million customers are met—and data mining helps the health services organization do it. Medibank runs data mining and analytics operations in the cloud to boost the performance of its customer loyalty, marketing, and member health business units, ensuring customers always have personalized useful information at their fingertips.
Benefits and challenges of data mining
The key advantage of data mining is that, when properly implemented and executed using the right techniques and tools, it can help you address any issues that involve data, across all units of the business.
- Marketing teams, for example, can recognize data trends that help them deliver a company's message to the most promising leads—whom sales reps can then negotiate deals with more successfully—through critical insights obtained from data mining.
- Recommendation engines can be created based on data from those sales to ensure new customers become consistent customers.
- In industrial sectors, data mining can help facilities managers make comprehensive assessments of equipment inventory and know which machines are at greatest risk for failure.
- Amid a startling, recent rise in fraud, the ability to discover anomalies through data mining allows for faster, more efficient fraud detection. Anomaly detection of this kind will only become more important in the years to come.
- Data mining can help enable enterprises to embrace truly cutting-edge approaches to leveraging their data, such as decision intelligence and advanced predictive analytics.
Data mining also comes with potential challenges, and it's important to do everything possible to avoid such pitfalls.
- For example, if a data scientist uses a data mining model that isn't properly suited to the particular business problem, results won't provide the insight they should, so it's critical to have a data team with a wide range of mining experience.
- As awareness of data's value—and how that value is sometimes exploited—increases among the general public, compliance with data sovereignty and privacy laws is a must.
- The larger a data set, the more complex it is to mine. Data mining algorithms and models should be devised with scalability in mind.
- There will be redundant, unnecessary, and otherwise "noisy" data in almost every set. Much of it can be eliminated early in the mining process, but data teams must keep an eye out from beginning to end for data noise that jeopardizes the project's integrity.
Maximizing mined data with a cloud analytics solution
The complexity and scope of data mining requires tools, systems, and best practices that are suited to those attributes of the process—such as the cloud and its all but limitless resources. A hybrid multi-cloud deployment may be particularly advantageous, allowing data teams to freely go back and forth between cloud and on-premises infrastructure in the mining process.
Teradata Vantage is the ideal solution for the analysis and reporting phases of data mining efforts. It allows for seamless data integration from all sources and is compatible with all major cloud providers, including Amazon Web Services, Microsoft Azure, and Google Cloud. To learn more about Vantage, contact us today.
Learn how to advance your analytics