Text mining

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources."^[1] Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process.^[2] Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via the application of natural language processing (NLP), different types of algorithms and analytical methods. An important phase of this process is the interpretation of the gathered information.

A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted. The document is the basic element when starting with text mining. Here, we define a document as a unit of textual data, which normally exists in many types of collections.^[3]

is important technique for pre-processing data. Technique is used to identify the root word for actual words and reduce the size of the text data.

Dimensionality reduction

or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for analysis.

Information retrieval

Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive , such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.^[9]

natural language processing

is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on.

Named entity recognition

Disambiguation—the use of clues—may be required to decide where, for instance, "Ford" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity.^[10]

contextual

Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other .

pattern matches

: identification of sets of similar text documents.^[11]

Document clustering

: identification of noun phrases and other terms that refer to the same object.

Coreference

Relationship, fact, and event Extraction: identification of associations among entities and other information in texts.

involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques help analyze sentiment at the entity, concept, or topic level and distinguish opinion holders and objects.^[12]

Sentiment analysis

Quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of etc.^[13]

psychological profiling

Pre-processing usually involves tasks such as tokenization, filtering and stemming.

Subtasks—components of a larger text-analytics effort—typically include:

The (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the University of Manchester^[37] in close collaboration with the Tsujii Lab,^[38] University of Tokyo.^[39] NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the Joint Information Systems Committee (JISC) and two of the UK research councils (EPSRC & BBSRC). With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of social sciences.

National Centre for Text Mining

In the United States, the at University of California, Berkeley is developing a program called BioText to assist biology researchers in text mining and analysis.

School of Information

The (TAPoR), currently housed at the University of Alberta, is a scholarly project to catalogue text analysis applications and create a gateway for researchers new to the practice.

Text Analysis Portal for Research

Implications[edit]

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial market sentiment.

Ananiadou, S. and McNaught, J. (Editors) (2006). Text Mining for Biology and Biomedicine. Artech House Books. 978-1-58053-984-5

ISBN

Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons. 978-0-470-17643-6

ISBN

Feldman, R., and Sanger, J. (2006). The Text Mining Handbook. New York: Cambridge University Press. 978-0-521-83657-9

ISBN

Hotho, A., Nürnberger, A. and Paaß, G. (2005). "A brief survey of text mining". In Ldv Forum, Vol. 20(1), p. 19-62

Indurkhya, N., and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition. Boca Raton, FL: CRC Press. 978-1-4200-8592-1

ISBN

Kao, A., and Poteet, S. (Editors). Natural Language Processing and Text Mining. Springer. 1-84628-175-X

ISBN

Konchady, M. Text Mining Application Programming (Programming Series). Charles River Media. 1-58450-460-9

ISBN

Manning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. 978-0-262-13360-9

ISBN

Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Academic Press. 978-0-12-386979-1

ISBN

McKnight, W. (2005). "Building business intelligence: Text data mining in business intelligence". DM Review, 21-22.

Srivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications. Boca Raton, FL: CRC Press. 978-1-4200-5940-3

ISBN

Zanasi, A. (Editor) (2007). Text Mining and its Applications to Intelligence, CRM and Knowledge Management. WIT Press. 978-1-84564-131-3

ISBN

(October, 2003)

Marti Hearst: What Is Text Mining?

Archived 2013-09-25 at the Wayback Machine

Text mining

Dimensionality reduction

Information retrieval

natural language processing

Named entity recognition

contextual

pattern matches

Document clustering

Coreference

Sentiment analysis

psychological profiling

National Centre for Text Mining

School of Information

Text Analysis Portal for Research

Implications[edit]

ISBN

ISBN

ISBN

ISBN

ISBN

ISBN

ISBN

ISBN

ISBN

ISBN

Marti Hearst: What Is Text Mining?

Automatic Content Extraction, Linguistic Data Consortium

Automatic Content Extraction, NIST