"Content-based" versus "request-based" classification[edit]

Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned.[1] In automatic classification it could be the number of times given words appears in a document.


Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230[2]).


Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as policy-based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.

Classification versus indexing[edit]

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21[3]). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,[4] 2004;[5] Broughton, 2008;[6] Riesthuis & Bliedung, 1991[7]). Therefore, the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that label.

Artificial neural network

Concept Mining

such as ID3 or C4.5

Decision trees

(EM)

Expectation maximization

Instantaneously trained neural networks

Latent semantic indexing

Multiple-instance learning

Naive Bayes classifier

approaches

Natural language processing

-based classifier

Rough set

-based classifier

Soft set

(SVM)

Support vector machines

K-nearest neighbour algorithms

tf–idf

a process which tries to discern E-mail spam messages from legitimate emails

spam filtering

email , sending an email sent to a general address to a specific address or mailbox depending on topic[15]

routing

automatically determining the language of a text

language identification

genre classification, automatically determining the genre of a text

[16]

automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system

readability assessment

determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.

sentiment analysis

health-related classification using social media in public health surveillance

[17]

article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology

[18]

Classification techniques have been applied to

Fabrizio Sebastiani. . ACM Computing Surveys, 34(1):1–47, 2002.

Machine learning in automated text categorization

Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Archived 2020-10-05 at the Wayback Machine. MIT Press, 2010.

Information Retrieval: Implementing and Evaluating Search Engines

Introduction to document classification

Archived 2019-09-26 at the Wayback Machine

Bibliography on Automated Text Categorization

Archived 2019-10-02 at the Wayback Machine

Bibliography on Query Classification

analysis page

Text Classification

(available online)

Learning to Classify Text - Chap. 6 of the book Natural Language Processing with Python

Archived 2020-02-14 at the Wayback Machine

TechTC - Technion Repository of Text Categorization Datasets

David D. Lewis's Datasets

BioCreative III ACT (article classification task) dataset