Document Recovery from Bag-of-Word Indices

Loading...
Thumbnail Image

Date

Authors

Fillmore, Nathanael
Goldberg, Andrew B.
Zhu, Xiaojin

Advisors

License

DOI

Type

Technical Report

Journal Title

Journal ISSN

Volume Title

Publisher

University of Wisconsin-Madison Department of Computer Sciences

Grantor

Abstract

Motivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a document's bag-of-words (BOW) vector or other type of index, reconstruct the original ordered document. We investigate a variety of index types, including count-based BOW vectors, stopwords-removed count BOW vectors, indicator BOW vectors, and bigram count vectors. We formulate the problem as hypothesis rescoring using A* search with the Google Web 1T 5-gram corpus. Our experiments on five domains indicate that if original documents are short, the documents can be recovered with high accuracy.

Description

Keywords

Related Material and Data

Citation

TR1645

Sponsorship

Endorsement

Review

Supplemented By

Referenced By