{"id":385,"date":"2015-09-18T09:57:52","date_gmt":"2015-09-18T17:57:52","guid":{"rendered":"https:\/\/www.myzips.com\/blog\/constant-database\/"},"modified":"2015-03-31T14:57:11","modified_gmt":"2015-03-31T22:57:11","slug":"constant-database","status":"publish","type":"post","link":"https:\/\/www.myzips.com\/blog\/constant-database\/","title":{"rendered":"Constant Database"},"content":{"rendered":"<p>When coding in python I&#8217;m often performing text processing and I end up with some form of inverted index or associative array in memory and I want to persist it.<br \/>\n<!--adsense--><br \/>\nOn and off I&#8217;ve tried using the <a href=\"http:\/\/www.oracle.com\/technology\/products\/berkeley-db\/index.html\">Berkeley Database<\/a> from Oracle. Inevitably I find that it takes forever to write out large data sets. There are some tuning parameters, especially the cache size, but it seems that the software just doesn&#8217;t scale well. <\/p>\n<p>I recently rediscovered <a href=\"http:\/\/pilcrow.madison.wi.us\/\">CDB<\/a>, which was written by Dan Bernstein, with Python bindings. This has the basic functionality I need (large data sets, can split a dict out in a reasonable time span, and reasonably compact storage) and is amazingly simple. For more details see the <a href=\"http:\/\/www.unixuser.org\/~euske\/doc\/cdbinternals\/\">internals page<\/a>. The only disadvantage of it is that with CDB you can&#8217;t perform updates or deletes &#8212; instead, you need to be able to create your data set in one fell swoop, persist it all at once, and thereafter treat it as read-only. For me this works, as in typical and simple IR tasks you create some data structure that you save and then later use. Because of all the performance problems I&#8217;ve had with Sleepykat I plan on reducing the use of it and using CDB more. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>When coding in python I&#8217;m often performing text processing and I end up with some form of inverted index or associative array in memory and I want to persist it. On and off I&#8217;ve tried using the Berkeley Database from Oracle. Inevitably I find that it takes forever to write out large data sets. There [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/posts\/385"}],"collection":[{"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/comments?post=385"}],"version-history":[{"count":1,"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/posts\/385\/revisions"}],"predecessor-version":[{"id":939,"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/posts\/385\/revisions\/939"}],"wp:attachment":[{"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/media?parent=385"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/categories?post=385"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.myzips.com\/blog\/wp-json\/wp\/v2\/tags?post=385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}