Multimedia information retrieval explained

Multimedia information retrieval (MMIR or MIR) is a research discipline of computer science that aims at extracting semantic information from multimedia data sources.^[1] Data sources include directly perceivable media such as audio, image and video, indirectly perceivable sources such as text, semantic descriptions,^[2] biosignals as well as not perceivable sources such as bioinformation, stock prices, etc. The methodology of MMIR can be organized in three groups:

Methods for the summarization of media content (feature extraction). The result of feature extraction is a description.
Methods for the filtering of media descriptions (for example, elimination of redundancy)
Methods for the categorization of media descriptions into classes.

Feature extraction methods

Feature extraction is motivated by the sheer size of multimedia objects as well as their redundancy and, possibly, noisiness.^[1] Generally, two possible goals can be achieved by feature extraction:

Summarization of media content. Methods for summarization include in the audio domain, for example, mel-frequency cepstral coefficients, Zero Crossings Rate, Short-Time Energy. In the visual domain, color histograms^[3] such as the MPEG-7 Scalable Color Descriptor can be used for summarization.
Detection of patterns by auto-correlation and/or cross-correlation. Patterns are recurring media chunks that can either be detected by comparing chunks over the media dimensions (time, space, etc.) or comparing media chunks to templates (e.g. face templates, phrases). Typical methods include Linear Predictive Coding in the audio/biosignal domain,^[4] texture description in the visual domain and n-grams in text information retrieval.

Merging and filtering methods

Multimedia Information Retrieval implies that multiple channels are employed for the understanding of media content.^[5] Each of this channels is described by media-specific feature transformations. The resulting descriptions have to be merged to one description per media object. Merging can be performed by simple concatenation if the descriptions are of fixed size. Variable-sized descriptions – as they frequently occur in motion description – have to be normalized to a fixed length first.

Frequently used methods for description filtering include factor analysis (e.g. by PCA), singular value decomposition (e.g. as latent semantic indexing in text retrieval) and the extraction and testing of statistical moments. Advanced concepts such as the Kalman filter are used for merging of descriptions.

Categorization methods

Generally, all forms of machine learning can be employed for the categorization of multimedia descriptions^[1] though some methods are more frequently used in one area than another. For example, hidden Markov models are state-of-the-art in speech recognition, while dynamic time warping – a semantically related method – is state-of-the-art in gene sequence alignment. The list of applicable classifiers includes the following:

Metric approaches (Cluster analysis, vector space model, Minkowski distances, dynamic alignment)
Nearest Neighbor methods (K-nearest neighbors algorithm, K-means, self-organizing map)
Risk Minimization (Support vector regression, support vector machine, linear discriminant analysis)
Density-based Methods (Bayes nets, Markov processes, mixture models)
Neural Networks (Perceptron, associative memories, spiking nets)
Heuristics (Decision trees, random forests, etc.)

The selection of the best classifier for a given problem (test set with descriptions and class labels, so-called ground truth) can be performed automatically, for example, using the Weka Data Miner.

Models of Multimedia Information RetrievalSpoken Language Audio RetrievalSpoken Language Audio Retrieval focuses on audio content containing spoken words. It involves the transcription of spoken content into text using Automatic Speech Recognition (ASR) and indexing the transcriptions for text-based search.

Key Features:Techniques: ASR for transcription and text indexing.Query Types: Text-based queries.Applications:Searching podcast transcripts.Analyzing customer service call logs.Finding specific phrases in meeting recordings.Challenges:Errors in ASR can reduce retrieval accuracy.Multilingual and accent variability requires robust systems.Non-Speech Audio RetrievalNon-Speech Audio Retrieval handles audio content without spoken words, such as music, environmental sounds, or sound effects. This model relies on extracting audio features like pitch, rhythm, and timbre to identify relevant audio.

Key Features:Techniques: Acoustic feature extraction (e.g., spectrograms, MFCCs).Query Types: Audio samples or textual descriptions.Applications:Music recommendation systems.Environmental sound detection (e.g., gunshots, animal calls).Sound effect retrieval in media production.Challenges:Difficulty in bridging the semantic gap between user queries and low-level audio features.Efficient indexing of large datasets.Graph RetrievalGraph Retrieval retrieves information represented as graphs, which consist of nodes (entities) and edges (relationships). It is widely used in social networks, knowledge graphs, and bioinformatics.

Key Features:Techniques: Graph matching, adjacency list/matrix storage, and graph databases (e.g., Neo4j).Query Types: Subgraphs, patterns, or textual queries.Applications:Social network analysis.Searching knowledge graphs.Molecular structure retrieval.Challenges:Computationally intensive subgraph matching.Scalability for large, complex graphs.Imagery RetrievalImagery Retrieval retrieves images based on user input, such as textual descriptions or visual samples. It leverages both low-level features and semantic analysis for search.

Key Features:Techniques: Content-Based Image Retrieval (CBIR), visual feature extraction, semantic analysis.Query Types: Text, sketches, or example images.Applications:Stock image search.E-commerce product matching.Medical imaging analysis.Challenges:Bridging the semantic gap between user queries and image content.Efficient indexing of large-scale image datasets.Video RetrievalVideo Retrieval is the process of finding specific video content based on user queries. It involves analyzing both the visual and temporal features of videos.

Key Features:Techniques: Keyframe extraction, motion pattern analysis, temporal indexing.Query Types: Textual descriptions, sample clips, or temporal queries.Applications:Streaming service recommendations.Surveillance footage analysis.Sports analytics.Challenges:Managing the large file sizes of video content.Efficient analysis of temporal sequences and multimodal features.Comparison of Retrieval ModelsModel Data Type Query Types ApplicationsSpoken Language Audio Speech recordings Text queries Podcasts, meeting logs, call centersNon-Speech Audio Music, sound effects Audio samples or text Music apps, environmental soundsGraph Retrieval Graph structures Subgraphs, patterns Knowledge graphs, bioinformaticsImagery Retrieval Images Text, sketches, or images E-commerce, medical imagingVideo Retrieval Videos (visual + temporal) Text, clips, or time queries Surveillance, sports analysisConclusionMultimedia Information Retrieval plays a crucial role in organizing and accessing vast multimedia data repositories. The variety of retrieval models ensures that users can effectively interact with and extract insights from complex multimedia datasets. Future advancements in artificial intelligence and machine learning are expected to improve the accuracy and scalability of MIR systems.

Related areas

MMIR provides an overview over methods employed in the areas of information retrieval.^[6] ^[7] Methods of one area are adapted and employed on other types of media. Multimedia content is merged before the classification is performed. MMIR methods are, therefore, usually reused from other areas such as:

The International Journal of Multimedia Information Retrieval^[8] documents the development of MMIR as a research discipline that is independent of these areas. See also Handbook of Multimedia Information Retrieval^[9] for a complete overview over this research discipline.

Notes and References

H Eidenberger. Fundamental Media Understanding, atpress, 2011, p. 1.
Sikos . L. F. . 2016 . RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing: a comprehensive review . Multimedia Tools and Applications . 76 . 12 . 14437–14460 . 10.1007/s11042-016-3705-7 . 254832794 .
A Del Bimbo. Visual Information Retrieval, Morgan Kaufmann, 1999.
HG Kim, N Moreau, T Sikora. MPEG-7 Audio and Beyond", Wiley, 2005.
MS Lew (Ed.). Principles of Visual Information Retrieval, Springer, 2001.
H Eidenberger. Professional Media Understanding, atpress, 2012.
Raieli . Roberto . Introducing Multimedia Information Retrieval to libraries . JLIS.it . 7 . 3 . 9–42 . 10.4403/jlis.it-11530. 2016 . 56652314 .
"International Journal of Multimedia Information Retrieval", Springer, 2011, Retrieved 21 October 2011.
H Eidenberger. Handbook of Multimedia Information Retrieval, atpress, 2012.