Share on Tumblr

Social Media Toolkit Workshop 2014 eGovForum2014

Using Data Mining to Enhance Web Search Engines

1. Introduction
The amount of information that is potentially available from the World Wide Web (WWW), including such areas as web pages, page links, accessible documents, and databases, continues to increase. The WWW has added an abundance of data and information to the complexity of information for disseminators and users alike. With this complexity has come the problem of finding useful and relevant information. Current search engines are primarily traditional normal tools.
Data mining, which is defined as the process of extracting previously unknown knowledge, and detecting interesting patterns from a massive set of data, has been a very active research. Data mining is a process by which previously unknown information and patterns are extracted from large quantities of data. Data mining could be used to enhance the work of search engines for mining the web.
This article will try to give an idea about search engines, data mining, and web mining.

2. Search Engines
With so much data on the internet, it can be difficult, frustrating, and seemingly impossible to find the exact information you need. The web is enormous and growing at an incredibly fast pace. It has been said that if you spent only one minute per page, 10 hours a day, it would take four-and-a-half years to explore only 1 million web pages. Thus, a real need exists for some way to search this huge resource. There are many powerful search utilities on the web such as Yahoo!, AltaVista, Lycos, InfoSeek, Excite, and WebCrawler. In the Internet world, these search utilities are called search engine.
Search engines are composed of large databases. These databases contain information about Web pages that have registered with a particular search engine, such as Yahoo! At Yahoo!, registrations are entered by humans, who categories entries by subject. Through keywords, you can find information on any subject that you need to investigate. A search engine database typically contains information such as the title of the page, the URL, a short abstract of the contents, and keywords to help the search engine. The URL is added to the search engine database. Not every web page is registered with every search engine. Some web site managers register their site only with particular search engines, neglecting others. Because of this, and the vast array of search engines, you may receive completely different results from each search engine.
Depending on the particular search engine, a web site can be indexed, scored and ranked using many different methods. Search engines’ ranking algorithms are often based on the use of the position and frequency of keywords for their search. The web pages with the most instances of a keyword, and the position of the keywords in the web page, can determine the higher document ranking. Search engines usually provide the users with the top 10 to 20 relevant hits.
3. Data Mining
Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein of valuable ore. General terms such as Knowledge Discovery in Databases (KDD) describe a more complete process. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object oriented databases, data warehouses, transactional databases, unstructured and semi structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files.
The data mining functionalities and the variety of knowledge they discover are briefly presented in the following list:
Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules.
Discrimination: Data discrimination produces what is called discriminate rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class.
Association analysis: Association analysis is the discovery of what is commonly called association rules. It studies the frequency of items occurring together in transactional databases, and based on a threshold called support, identifies the frequent item sets. Another threshold, confidence, which is the conditional probability of an item appears in a transaction when another item appears, is used to pinpoint association rules.
Classification: Classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model. The model is used to classify new objects.
Clustering: Similar to classification, clustering is the organization of data in classes. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes.
4. Web Mining
Presently an enormous wealth of information is available on the Web. The objective is to mine interesting nuggets of information, like which airline has the cheapest flights in December, or search for an old friend, etc. Internet is definitely the largest multimedia data depository or library that ever existed.
It is the most disorganized library as well. Hence mining the Web is a challenge. The Web is a huge collection of documents that comprises (i) semi structured (HTML, XML) information, (ii) hyper-link information, and (iii) access and usage information and is (iv) dynamic.
The Web has made cheaper the accessibility of a wider audience to various sources of information. The advances in all kinds of digital communication have provided greater access to networks. It has also created free access to a large publishing medium. These factors have allowed people to use the Web and modern digital libraries as a highly interactive medium. However, present-day search engines are plagued by several problems like the: abundance problem, as 99% of the information is of no interest to 99% of the people, limited coverage of the Web, as Internet sources are hidden behind search interfaces, limited query interface, based on keyword-oriented search, and limited customization to individual users.
Web mining refers to the use of data mining techniques to automatically retrieve, extract, and evaluate (generalize or analyze) information for knowledge discovery from Web documents and services. Considering the Web as a huge repository of distributed hypertext, the results from text mining have great influence in Web mining and information retrieval. Web data are typically unlabeled, distributed, heterogeneous, semi structured, time-varying, and high-dimensional. Hence some sort of human interface is needed to handle context-sensitive and imprecise queries and provide for summarization, deduction, personalization, and learning. The major components of Web mining include: information retrieval, information extraction, generalization, and analysis.
Information retrieval refers to the automatic retrieval of relevant documents, using document indexing and search engines. Information extraction helps identify document fragments that constitute the semantic core of the Web. Generalization relates to aspects from pattern recognition or machine learning, and it utilizes clustering and association rule mining. Analysis corresponds to the extraction, interpretation, validation, and visualization of the knowledge obtained from the Web.
References
[1] Hillal Hadi Saleh, Mohammad Ala’a AL-Hamami, “A Proposed System to Improve Relevant Information Retrieval on the Web”, the 1st International Conference on Digital Communications and Computer Applications (DCCA2007), the Jordan University of Science and Technology, Irbid, Jordan.2007.
[2] Alaa H. AL-Hamami, Mohammad A. AL-Hamami, Soukaena H. Hashem, “Using Data Mining Confidence and Support for Privacy Preserving Secure Database”, Journal of Statistical Sciences, Volume 1, No. 1, Issued by Arab Institute for Training and Research in Statistics, July –December 2009.
[3] Smith J. R., and Chang S. F., “Visually Searching the Web for Content”, IEEE Multimedia Magazine, vol. 4, pp. 12-20, 1997.

MSc, PhD degree in Computer Science.

Head of the Management Information Systems, Department Council for the Faculty of Information Technology and Computer Science in Delmon University for Science and Technology. Member of the Information and Communication Technology Knowledge Club (IKC) in Kingdom of Bahrain. Member of the Social Media Club (SMC), Bahrain Chapter.

Member of the Bahrain Internet Society (BIS). Supervisor for many Undergraduate Final Projects. Supervisor for man