Team:Tsinghua-A/Feature Extraction

Test

Overview

Image similarity search is widely applied both in industry and in our daily life, which helps us retrieve enormous amount of images from the Internet in no time. The early versions of image similarity search algorithms used in search engines are usually based a non-neural network method, known as Perceptual Hashing Algorithm. It computes a digital fingerprint of each image by image normalization, color simplification, gray scale comparison, hash value calculation and so on. This hash-based algorithm measures image similarity by comparing the fingerprint of different pictures.

In our project, there are a certain similarity between retrieval in DNA storage library and data retrieval in the server. These two kinds of retrieval both require feature extraction from pictures as a measurement of image similarity. Based on their comparability,, two specific problems needs to be solved:

  • Image feature extraction:

At present, there are many existing neural networks that can effectively handle image feature extraction. Here in this project, we choose pre-trained VGG16 network and Principal Component Analysis to perform feature extraction on Caltech-256 database, and validate the feasibility of this method as well as determine the length of the final retained feature vector.

  • Similarity criteria

In feature space, there are multiple methods to measure similarity between feature vectors, such as Euclidean distance, Cosine distance and Hamming distance. Through experiments, we find that cosine distance has the highest accuracy and stability under various experimental conditions (including different feature dimensions and image types), therefore we finally chose cosine distance as the ultimate measure of similarity.

Algorithm structure & principles

  • VGG16 Network
< img src="https://static.igem.org/mediawiki/2019/f/f7/T--Tsinghua-A--vgg16.png">

VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes.1

In this project, we utilize VGG16's 'intermediate product': the output of ReLU layer, which is a 1*1*4096 vector. We then tested two schemes to further reduce the size of feature vector: the first one is adding another layer of neural network, which is similarly with the original VGG16 shown above but with adjustable vector length; the second one performs dimension reduction using PCA.

  • Dataset preprocessing

In our experiments, we used images from Caltech-256 dataset. In order to meet the input conditions of VGG16 network, we preprocessed and screened the images to 224*224*3 size, selected 25 of the 256 kinds of images, and divided them into training set and test set.

  • Establishment of Image Feature Library

Feature vectors that are effective for similarity search tend to be high dimensional. The nearest neighbors in each of these subspaces (with respect to Euclidean or Cosine distance) appear more similar to the query than the nearest neighbors in lower-dimensional spaces.2

  • Query image feature vector calculation

For input query images, we first use VGG16 network to get high dimension feature, and use same parameters computed in the PCA step above to reduce the dimension of features and obtain vectors with the length we desire.

  • Similarity calculation and searching

By calculating cosine distance between query image feature vector and feature vectors of pictures from the database (stored in DNA), and sort the images in the database by similarity score, we can obtain search results.

< img src="https://static.igem.org/mediawiki/2019/d/dd/T--Tsinghua-A--cosine_dist.png">

Implementation

As mentioned above, we use pre-trained VGG16 network to perform feature extraction.

In initial experiments, we use pre-trained single hidden layered net work to perform feature dimension reduction, in which the number of nodes in the hidden layer is the number of dimensions after dimension reduction. At first, this method has achieved good results, but because of its supervised calculation process, it takes some 'experience' to organize image label, and the algorithm also lacks scalability. Therefore, we then use PCA method to reduce the dimension of feature and achieved similar accuracy.

By using the methods above, we have successfully established a image feature database, when a picture is queried in the system, all we have to do is to run it through the same pipeline.

Experiments

For the above-mentioned image database, we have carried out a series of experiments to verify the feasibility of this method. At the same time, we have determined the length of the features to be retained and the similarity criteria to be used through experiments.

  • Fixed number of query image (N=5),use different length of feature vectors (L=1 to 100) to carry out search:
< img src="https://static.igem.org/mediawiki/2019/4/4c/T--Tsinghua-A--feat_len17.png">
< img src="https://static.igem.org/mediawiki/2019/9/91/T--Tsinghua-A--feat_len19.png">
  • Fixed feature vector length (L=10),using Euclidean distance and cosine similarity as criteria to search different numbers of pictures (1 to 30):
< img src="https://static.igem.org/mediawiki/2019/3/30/T--Tsinghua-A--euc_cos.png">
  • Clustering method is used to preprocess the feature database, obtain all kinds of central features, and then search in the clustering results:
< img src="https://static.igem.org/mediawiki/2019/a/a8/T--Tsinghua-A--bear.png">
< img src="https://static.igem.org/mediawiki/2019/7/78/T--Tsinghua-A--cluster.png">

 

 


2 A Content-Addressable DNA Database with Learned Sequence Encodings, Kendall Stewart et. al.