[LTER-im] PASTA dataset download tallying tools

Tue Feb 14 14:07:13 PST 2017

Hi John --

That's awesome, and sounds really useful.

If you're having speed issues due to multiple requests, you might find the
DataONE log aggregation service useful as well.  We collate access logs
from all of the DataONE members, and then index those along with additional
information such as the time and geo location where the access occurred.
So, its pretty fast and easy to get summaries of the usage logs for
individual identifiers, for groups of identifiers, for all identifiers
owned by a user or group, etc.  You can also do temporal summaries of that
same data (e.g., downloads by month), and downloads by spatial location.
Here are some example SOLR queries, no program needed:

1) Download counts for all SBC LTER pids (identifiers) that have been
registered with DataONE (assuming they follow the 'knb-lter-sbc' naming
convention):
https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=pid:*knb-lter-sbc*&fq=event:read&facet=true&facet.field=pid&facet.mincount=1&facet.limit=10000&rows=0&wt=xml

If you want it in JSON format, just change the last parameter to 'wt=json'.

Download stats are more meaningful if we exclude web crawlers for search
engines.  We provide a simple filter for that as well, so if you set
`fq=inPartialRobotList:false` then you will exclude most web robots:
https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=pid:*knb-lter-sbc*&fq=event:read&fq=inPartialRobotList:false&facet=true&facet.field=pid&facet.mincount=1&facet.limit=10000&rows=0&wt=xml

For that query, this reduces the total downloads form 87,709 to 56,284, so
it has a significant impact on interpreting results.

2) Monthly breakdown of download counts for a particular identifier:
https://cn.dataone.org/cn/v2/query/logsolr/?q=*:*&fq=pid:knb-lter-sbc.5.3*&fq=event:read&facet=true&facet.field=pid&facet.mincount=1&facet.limit=10000&rows=0&facet.range=dateLogged&facet.range.start=2000-01-01T01:01:01Z&facet.range.end=2017-01-31T24:59:59Z&facet.range.gap=%2B1MONTH&wt=xml

These are just two examples -- you can easily get many other reports.  This
index is what drives the user interface display of download counts on the
DataONE view of a data set, such as this one for the knb-lter-sbc.1002.6,
showing the individual metadata views and how many times each data file was
downloaded:
https://search.dataone.org/#view/https://pasta.lternet.edu/package/metadata/eml/knb-lter-sbc/1002/6

Of course, what we show is only as accurate as what is reported by member
repositories -- we've found that some repositories either don't report or
under-report their downloads, so you should probably view these as minimum
counts rather than absolute values.  But YMMV from member to member.

More details about the log aggregation service is in our documentation (
https://releases.dataone.org/online/api-documentation-v2.0.1/design/UsageStatistics.html
).

Hope this is helpful.

Matt

On Tue, Feb 14, 2017 at 12:02 PM, Ken Ramsey <kramsey at jornada-vmail.nmsu.edu
> wrote:

> Hi John,
>
> Thanks!
>
> Ken
>
>
> >>> John Porter <jhp7e at eservices.virginia.edu> 2017-02-14 01:58 PM >>>
> During the VTC yesterday, several folks expressed interest in code to
> tally dataset and metadata downloads of data in PASTA.  PASTA keeps
> excellent logs, but it is up to us to do the desired aggregations.
>
> https://github.com/lter/VCR
>
> has several Python programs that may be of help.
>
> PastaUseCountBasic.py (attached) writes to standard output a CSV file
> containing Scope, Identifier, Revision, Title, Entity, DownloadCount,
> StartDate, EndDate for each entity downloaded during a specified time
> period.
>
> Some notes:
>
> The program produces output to STDOUT based on command line options.  A
> typical command line might be:
>  python ./PastaUseCountBasic.py --fromdate 2017-01-01 --todate
> 2017-02-14  knb-lter-jrn >jrn_2016.csv
>
> The program is NOT particularly fast, due to the large number of web
> service calls required and latency associated with PASTA processing.
> Shorter time periods are processed faster than longer ones due to the
> smaller number of log entries needed to be retrieved.
>
> The program uses a number of modules (listed at the top in the import
> statements) that need to be installed prior to running.
>
> It requires that you give it an authorized login to access the needed
> records and prompts you for them, or you can set up an "authorization
> file" that eliminates the need to manually login. Contact me for
> details....
>
>
> --
> John H. Porter
> Dept. of Environmental Sciences
> University of Virginia
> 291 McCormick Road
> PO Box 400123
> Charlottesville, VA 22904-4123
> ORCID: http://orcid.org/0000-0003-3118-5784
>
>
> _______________________________________________
> Long Term Ecological Research Network
> im mailing list
> im at lternet.edu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lternet.edu/pipermail/im/attachments/20170214/809dabc1/attachment-0001.html>