2010
12.09

This note relates to CouchDb 1.0.1

In CouchDb, documents accessible via a view can be mapped to multiple keys. When querying for multiple keys, it is possible for a document to be returned multiple times. In some circumstances, this might be the desired behaviour. However, when the desired semantics are to retrieve only one copy of each document matching any key, without duplicates, a different approach is required.

As a note of caution, this article might provide a complicated solution to a problem easily solved another way. I was under the impression that the work covered here could be easily done using a special flag on a view query. However, I can not readily find it. I am hoping someone will come around and comment on this article with a simpler approach. Until then, the solution presented here will suffice.

The result of a view query is a JSON object that contains an array of rows, each row reporting a document matching the query. A list function is used to transform the result of a view query into a format desired for output. One advantage of using a list function is that a list function has a chance of inspecting each row (or document) before sending to the output.

In this approach, we use a list function to output a result in the exact same format as a view query, suppressing duplicates of documents that were already sent.

The following list function is generic enough to be used any view that emit the documents as values:

The input parameter called “head” is used to retrieve the total number of rows and the offset. Then, the list function outputs the “rows” member. Each row is sent as a JSON string, so the list function must take care of inserting the commas at the right place. A map (javascript object) called “ids” is used recall which documents have already been sent. The key used in the map is the identifier of the document. When a document has already been sent, it is skipped.

For example, if a query to a view named “testview” yielded duplicates of a document using the following URL:
http://127.0.0.1:5984/db/_design/test/_view/testview
then duplicates would be removed if the above function was named “noduplicate” and the following URL employed:
http://127.0.0.1:5984/db/_design/test/_list/noduplicate/testview

In conclusion, the presented function is generic enough to be reused in many situations. However, I suspect that a much easier way to perform this will be designed shortly, if it does not already exist.

3 comments so far

Add Your Comment
  1. So I know this post is a bit old, but I’ve been trying to figure out this issue for a while, and I feel like you’re on to something, but there are some issues.

    First of all, I rewrote your function to be a bit cleaner, and offload all the string manipulation stuff to the toJSON function:

    Second, there’s a fundamental flaw with this approach: You may set a limit, but that limit is handled before the duplicate removal process. Because of this, if you have any duplicates, then your final result will be less than the limit passed in on the URL. You won’t be able to reliably page through your results because of this discrepancy.

    Again, I like your thinking here, but there has to be a better way to do this.

  2. Apologies for the terrible indentation in my example above. Apparently wordpress doesn’t like tabs in comments 🙂

  3. I fixed the indentation. Thanks for your comment.