PhysioNet  ·  PhysioBank  ·  PhysioToolkit

PhysioNet
the research resource for
complex physiologic signals

Advanced Search | Tour | Mirrors
How to Cite | Contributing | FAQ
Publications  ·  Tutorials  ·  Challenges  ·  Discussions  ·  Contributors  ·  Events  ·  Sponsors  ·  About PhysioNet  ·  What's New? Site Map

How to set up a mirror of PhysioNet

We encourage you to set up a mirror of PhysioNet if you wish. If you do so, please use rsync as described below to minimize the impact on other users.

If you simply wish to retrieve a (possibly large) set of files from PhysioNet without downloading them one at a time, you don't need to set up a mirror of PhysioNet; see the PhysioNet FAQ for instructions on downloading an entire PhysioBank database in one step using rsync, or use GNU wget to download any desired selection of files.

PhysioNet is organized in "volumes" that can be mirrored separately. All PhysioNet mirrors provide the "base volume", which contains all of the software, tutorials, reference manuals, publications, and many of the PhysioBank databases. Additional volumes, which are optional for the mirrors, contain very large data sets that don't fit in the base volume. The base volume, and optional volumes 2, 3, and 4, will not exceed 25 Gb each. Volumes 5 through 8 will be allowed to grow up to one Tb each. More than one volume can occupy a single sufficiently large disk partition.

We use and recommend the following configuration for a PhysioNet web server:

All of the software needed is freely available open-source software. The cost of suitable hardware can be under US$500; of course it is possible to spend considerably more. If you cannot or do not wish to run Apache under Linux, many other configurations are possible, but we will not be able to help you troubleshoot your setup. Other versions of Linux or Unix, including Mac OS X, should be usable without difficulty. Although we do not recommend or support MS Windows, versions of all of the necessary software are freely available for MS Windows as optional components of Cygwin. The remainder of these notes assume that you are using Fedora 7.

Currently, 100 to 1000 Gb SATA drives are widely available and are usually least expensive per gigabyte. IDE (PATA) drives can also be used in many older PCs. SATA drives, and any drives larger than 127 GB, may require a controller card in PCs made before 2006. Most current PC motherboards include integrated SATA controllers, but many no longer support IDE drives.

Preparing your site

Please let us know if you encounter any difficulties with this procedure!

If you are updating a mirror created before October 2007: Please replace your copy of /etc/httpd/conf/httpd.conf with a standard version (i.e., one that has not been customized for use with a PhysioNet mirror), since the previously used customizations will interfere with those that are installed as /etc/httpd/conf.d/pnm.conf by the procedure below. Also, please remove the line containing /usr/local/bin/mirror-physionet from your crontab; it will be replaced by a line that runs mirror-update when you run ./install in step 8 of the procedure below.

You will need to have root (administrative) privileges for some of the steps below. Once step 1 is complete, the remaining numbered steps can be finished in 10 to 15 minutes.

  1. If you have not already done so, begin by installing Linux and the additional software listed above, following the instructions provided with your Linux distribution. If any of the additional software is not included in your Linux distribution, download it (use the links above, or your favorite repositories) and install it.
  2. Choose (if necessary, create) a user account that will "own" the mirrored PhysioNet files. Don't choose an account with special privileges, such as root. The rest of these instructions assume that this user's name is "joe". Important: joe's home directory must not be /home/physionet or any subdirectory of /home/physionet.
  3. Create a directory called /home/physionet, owned and writable by "joe" and readable by anyone. If /home's disk partition does not have at least 25GB free, make a physionet directory on a disk partition that does have at least 25GB free, and make a symbolic link from this directory to /home/physionet.
  4. Log in as joe (or as whoever will "own" the mirrored files). While within joe's home directory, run the command
         rsync -a physionet.org::mirror-setup mirror-setup
    
    to verify that you are able to communicate with the PhysioNet master server, and to download a few short files for setting up your mirror, which will go into a subdirectory called mirror-setup within the current directory (e.g., /home/joe/mirror-setup). If mirror-setup doesn't exist already, rsync will create it.
  5. Enter the mirror-setup directory by typing
         cd mirror-setup
    
    If you wish, read through the various files you have downloaded to see how they work, and then run:
         ./configure
    
    The first time you run it, configure will ask a few questions, but it will remember your answers (in mirror.conf) and will not ask them again.
  6. [Optional] Your daily mirror updates will begin at a time when the load on the master PhysioNet server is expected to be low, to limit its impact on other users; configure chooses this time automatically (and tells you what time it has chosen). If that time is inconvenient, change the variables UPHOUR and UPMIN in mirror.conf.
  7. [Optional] If you wish to mirror any of the optional volumes, create a directory for each of them, and be sure there is sufficient free disk space for the files to be mirrored. Warning: Make sure that these directories do not overlap (don't, for example, create them within /home/physionet, or within each other). These directories should be owned by "joe"; they should be world-readable and searchable, and "joe" needs to be able to write in them (i.e., permissions should be 755).

    When the directories are ready, set the variables P2, P3, ... to the names of these directories by editing mirror.conf, and then run configure again.

  8. As root, run
         ./install
    
    in order to schedule daily mirror and WFDB software updates, and purging of the temporary/cache directory. Once you have done this, these processes will begin automatically within 24 hours.
  9. [Optional] If you wish to begin downloading files for the mirror immediately, you can do so (once again as "joe", not as root) by running
         mirror-update -v
    
    The -v option (which may be omitted) causes the updater to report each file transfer as it happens.

It may take several hours (or even days, if your Internet connection is slow) to retrieve the entire PhysioNet web site the first time. If the process is interrupted at any point, simply run mirror-update again, and it will continue from the point where it was interrupted. Subsequent updates (see below) will be much faster; if your mirror is reasonably up-to-date, the mirror script will typically finish in 1 to 10 minutes.

When the update is completed, mirror-update sends a report to physionet.org, which it will digest and incorporate into the PhysioNet Mirrors page. The report lists the URL and geographic location of your mirror, and the time at which it was most recently updated, to help PhysioNet visitors choose a suitable mirror.

In order to give you a chance to test and make any necessary adjustments to your new mirror site, it will not appear on the Mirrors page until it has been running for a few days and has been updated at least once.

Search engines and robots.txt

Once your site is listed on the Mirrors page (or linked to from any other public web page), the web spiders of the major search engines, such as Google, will begin indexing it. Typically these spiders will consume a significant amount of bandwidth when they first visit your site, but this will decrease to a much lower amount once your site is fully indexed. You can avoid almost all of this traffic if you wish (for example, if your mirror shares a network connection, or if your total monthly throughput is limited or metered by your ISP). Unless the network bandwidth consumed by the spiders is a problem, don't do this (it is useful if users can find pages on your mirror using a search engine, after all).

You can exclude most web spider traffic by modifying /home/physionet/html/robots.txt. Before doing this, modify your /usr/bin/mirror-update script by changing the line that reads:

rsync $RSOPTS physionet.org::physionet /home/physionet
to
rsync $RSOPTS --exclude robots.txt physionet.org::physionet /home/physionet
(This change will prevent the daily updates from replacing your customized robots.txt with the original one.)

Next, edit (or replace) /home/physionet/html/robots.txt so that its contents are:

User-agent: *
Disallow: /

Make sure that the edited version has the same ownership and file access permissions as the original (owned by "joe", readable by anyone).

The robots.txt protocol is advisory, not mandatory, so making this change may not eliminate all traffic from web spiders, but it should greatly reduce that traffic at the very least.

Maintaining your mirror site

Very little if any maintenance is required once your mirror site has been established as described above. Your web server will begin generating access logs, which will be rotated periodically and eventually discarded. If your logs are stored on a file system with little free space, you may need to clear them manually if your log file system fills up. (The names and locations of the logs are usually specified in httpd.conf. Under Fedora Linux, rotation and disposal of old log files is handled by the logrotate utility, run periodically by cron; no special setup is required unless you have renamed the log files in httpd.conf, or if you wish to keep old log files for more than four weeks.)

PhysioNet itself is growing, and you should occasionally check to see that your mirror site has room to grow with PhysioNet. Given that the cost of a gigabyte of disk storage is continually decreasing as density and speed are continually increasing, there is little reason to purchase more storage than you will need in the next six months to one year.

If you wish to begin mirroring an optional PhysioNet volume, and you have not previously mirrored any of the optional volumes, it is best to fetch a fresh copy of the PhysioNet mirror kit (see above). Edit mirror.conf (which will not have been altered by fetching a fresh kit), and run configure and install again as above. If you do this, please avoid changing your mirror's hostname, location, and maintainer as recorded in mirror.conf, so that your e-mail notifications will continue to be properly recognized by the master PhysioNet server.

If you wish to move a mirror to another host, or simply to discontinue mirroring for whatever reason, use crontab -e to edit your crontab and remove the mirror-update command line. Your mirror will be removed from the Mirrors page after a few days of inactivity.
Send feedback about this page to PhysioNet

Your comments and suggestions are welcome. We encourage you to use our feedback form to comment on this page. If you would like to receive a reply, please send your comments by email to webmaster@physionet.org, or post them to:

PhysioNet
MIT Room E25-505A
77 Massachusetts Avenue
Cambridge, MA 02139 USA

Updated Saturday, 29-Dec-2007 11:04:14 EST National Institute of Biomedical Imaging and Bioengineering National Institutes of Health National Institute of General Medical Sciences