Data Mining in Science and Technology

Data Mining in Science and Technology

This article provides a comprehensive review from historical development of data mining to its applications in various fields of science and technology.

Summary:?In Science and Technology, huge amounts of data are collected and stored in computers so that the useful information could be extracted later on. Sometimes it is not known at the time of data collection what data will later be requested, therefore the database is not designed to distill any particular information, and rather it is, to large extent, unstructured. The science of information from large collection of data sets is referred to as “Data Mining”, sometimes called “Knowledge Discovery”. This paper provides a comprehensive review from historical development of data mining to its applications in various fields of science and technology.

Introduction:?Scientists, refer the 21st?century as the age of data. Technological advances in science and technology have enabled us to collect large amounts of data in fields such as signals, images, texts, spatial and other complex data [22]. This data set arises in diverse fields such as financial markets, meteorology, medical imaging, remote sensing, physics, chemistry, material sciences, astronomy, bioinformatics etc. These can be obtained from simulations, experiments or observations. Data Mining is the process concerned with uncovering patterns, associations, anomalies, significant features and unstructured data.

It is a multidisciplinary field borrowing and enhancing ideas from different domains including image processing, signal processing, machine learning, optimization, high performance computing, information retrieval and computer vision. It holds the promise of helping scientific community and technology in the analysis of massive, complex data sets, enabling them to make a reasonable decisions and discoveries after gaining fundamental insights.

2.?Developmental History of Data Mining:?Data Mining emerged about 60 years ago with joint work of mathematicians, statisticians, logicians, and computer scientists to create artificial intelligence and machine learning.

The term Data Mining was started during 1960’s when the artificial intelligence and statisticians practitioners developed new algorithms such as regression analysis, maximum likelihood estimates, neural network etc. In this decade the field of information retrieval made its contribution in the forms of clustering techniques and similarities measures at the time these techniques where applied to text documents but they would later be utilized when mining data in databases and other large distributed dataset. By the end of 1960’s information retrieval and database systems where developing in parallel.

In 1971, Gerard Salton published his work on the SMART information retrieval system this represented a new approach to information retrieval which utilized the algebra based vector space model (VSM). This was proved very important in the data mining toolkit.

During 1970-1990’s the confluence of discipline (Artificial Intelligence, Information Retrieval, Statistics and Database systems) and the availability of fast micro computers opened possibilities for retrieving and analyzing data. In 1977, the journal “Knowledge Discovery and Data Mining” was launched which focuses advances in data collection methods distribution to need for computation methods and its techniques to add in data analysis.

In early 1990’s the huge volume of data available had made essence for new techniques to handle quantities of information much of it was located in huge databases, during this decade data mining changed from being and interesting new technology becoming part of standard business practice.

In 2001, William S. Clevelan published “Data Science”: An Action Plan for Expanding the Technical Areas of the Field of Statistics”. It is a plan “to enlarge the major areas of technical work of the field of statistics. In 2003, The Journal of Data Science was launched: “By ‘Data Science’ we mean almost everything that has something to do with data: Collecting, analyzing, modeling etc.

Nowadays Data Mining is applied to many industries and sectors such as retail, medical, telecommunications, banking, finance, pharmaceuticals, marketing etc.

3. What is Data Mining and knowledge Discovery?

With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making.

Data Mining also known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process.

Witten and Frank?defines “Data Mining refers to the process of finding the interesting patterns in the data that are not explicitly part of the data”. The interesting patterns can be used to tell us something new and to make predictions. The process of data mining is composed of several steps including selecting data to analyze, preparing the data mining algorithms, and then interpreting and evaluating the results.

?Data Mining or the term Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:

  • Data cleaning: it is a phase in which noise data and irrelevant data are removed from the collection.
  • Data integration:?in this step, multiple data sources, often heterogeneous, may be combined in a common source.
  • Data selection:?in this step, the data relevant to the analysis is decided on and retrieved from the data collection.
  • Data transformation:?this step is also known as data consolidation, in which the selected data is transformed into forms appropriate for the mining procedure.
  • Data mining:?it is the crucial step in which clever techniques are applied to extract patterns potentially useful.
  • Interpretation/Evaluation:?in this step, strictly interesting patterns representing knowledge are identified based on given measures.
  • Knowledge representation:?In the final step, the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.

The first four steps are the different forms of data preprocessing, where the data are prepared for mining. The data mining steps may interact with the user. The interesting patterns are presented to user and may be stored as new knowledge in the knowledge base. Data Mining is only one step in the entire process, although an essential one because it uncovers hidden patterns for evaluation.

?4.?Data Mining Methods

There are different types of data mining methods such as summarization, classification, clustering, regression, dependency modeling and change and deviation detection.

  • Classification:?Classification is to build a model to use a set of classified data items so that the model can be used to classify items without a class level.
  • Clustering:?Clustering involves a finite set of groups (clusters) the clusters can be mutually exclusive hierarchical are over lapping each member of a cluster should be very similar to other members in its clusters and dissimilar to other clusters.
  • Regression:?Regression analysis is used to map a data item to real valued prediction variable. There are some regression algorithms including regression trees, decision trees with averaged values at the leaves.
  • Summarization:?Summarization is a key data mining concept which involves techniques for finding a compact description of a dataset. Simple summarization methods such as tabulated the mean and standard deviations are often applied for data analysis, data visualization and automated report generalization.
  • Dependency Modeling (Association Rules):?Dependency rules techniques or Dependency modeling, searching the interesting relationships between items in a data set. Market basket analysis is a simple example of this model. The goal of association rule to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories.
  • Change and Deviation Detection:?Change Deviation method focuses on discovering the most significant changes in the data from previously measured or normative values.

?5.?Application of Data Mining in Science and Technology

The field of data mining has been growing rapidly due to its broad applicability, achievements and scientific progress, understanding. A number of data mining applications have been in various domains such as astronomy, remote sensing, Medical science, security and surveillance, computer simulation, information retrieval, chemistry etc.

5.1. Astronomy:?Astronomy has a long history of acquiring, systematizing, and interpreting large quantities of data. Starting from the earliest sky atlases through the first major photographic sky surveys of the 20th?century, this tradition is continuing today, and at an ever increasing rate. Astronomers classically have focused on clustering and classification problems as standard practice in the research.

5.2 Remote Sensing:?In recent years, with the development of remote sensing and data storage technique, a great number of image data are generated every day. Remote sensing images, whether obtained from satellites or aerial photography, are very rich source of data analysis.

Remote sensing systems play an important role in monitoring the earth, (global climate change detection through the identification of deforestation and global warming; yield prediction in agriculture; land use mapping for urban growth; resource exploration for minerals and natural gas; as well as military surveillance and reconnaissance for the purpose of tactical assessment), identification of man-made structures (such as building, roads, bridges, airports, etc), metrological (to analyze and predict typhoons using satellite images that capture the cloud the cloud patterns of the typhoon), high resolution satellite imaginary from different sensors.

5.3 Medical science:?In recent years, Data Mining has been widely used in area of Medical science such as Biomedical, DNA, Genetics and Medicine etc. In the area of Genetics, the important goal is to understand the mapping relationship between the variation in human DNA sequences and the disease susceptibility. Data Mining is very important tool to help improve the diagnosis, prevention and treatment of the diseases.

Image processing also play an important role in biomedical Data Mining such as Electrocardiogram (ECG), Electroencephalogram (EEG), Magnetic Resonance Image (MRI), functional magnetic resonance imaging (fMRI). Image processing also helps to present complex genes structure in graphs, trees and chains. The visual representation helps to better understanding of complex genes structures, for knowledge discovery and data exploration.

5.4 Security and Surveillance:?Another broad and emerging area of research in data mining techniques is security and surveillance. The field of privacy-preserving data mining (PPDM) has been around for seven years including diverse applications as biometrics with fingerprints, iris face, signature, and voice recognition; automated target recognition in aerial and satellite imagery; video surveillance; and network intrusion detection etc.

5.5 Computer Simulation:?Computer simulation often generates large data sets whose sheer size and complexity make them difficult to analyze. There are many different ways such as detection of coherent structures, dimension reduction, code validation understanding simulations etc, in which data mining playing an important role in the analysis of simulation data sets.

5.6 Information Retrieval:?Information retrieval research involves techniques from machine learning and other theoretical models, together with extensive experimentation to develop more accurate, fast and advanced information retrieval and search techniques for a variety of applications such as?Retrieval Models,?New features,?Optimization and Learning, Measurement and effectiveness.

5.7 Chemistry:?The analysis of data sets is one of the most important tasks in the investigation of properties of chemical compounds. Especially in drug design, methods are used to characterize complete sets of chemical compounds instead of describing individual molecules.?Data mining, i.e. the exploration of large amounts of data in search for consistent patterns, correlations and other systematic relationships, can be a helpful tool to evaluate "hidden" information in a set of molecules.

Data Mining Service - Chemistry (DMSC) is a project for the development of a centralized service for the exploration of chemical data sets. With this service it will be possible to analyze chemical data sets for molecular patterns and systematic relationships using the following methods:

  • statistical analyses of individual molecules within a data set,
  • self-organizing neural networks for the characterization of complex properties of molecules, e.g., biological activity,
  • genetic algorithms for the optimization of fuzzy results of data analysis,
  • expert systems, that are able to provide proposals for a complex information space that is produced during data analysis

DMSC opens a new way of chemical information processing using the newest WWW techniques to visualize complex trends, patterns and relationships in chemical datasets in a most effective way.

6.?Conclusion:?In this paper we briefly reviewed the historical development of data mining and its various applications in science and technology. Though very few areas are named here in this paper, yet they are those which are commonly forgotten. This paper provides a new perspective of a researcher regarding applications of data mining in science and technology.

Shalini Singh

Student at Netaji Subhas Open University

2 年

Interesting

Sharique Ahmad

Student at Netaji Subhas University

2 年

Science ??

Md Hasnain

Operations Executive@ PyNet Labs India

2 年

thanks

要查看或添加评论,请登录

社区洞察

其他会员也浏览了