{ "cells": [ { "cell_type": "markdown", "id": "97a89d6c-4f8e-4597-a077-692a13d51756", "metadata": {}, "source": [ "# `geochemdb` guide" ] }, { "cell_type": "code", "execution_count": 1, "id": "aec30348-8244-4f12-af3b-653d07575260", "metadata": {}, "outputs": [], "source": [ "import geochemdb" ] }, { "cell_type": "markdown", "id": "f6f2fc84-9195-46ac-88be-5b8c190f2b7e", "metadata": {}, "source": [ "This notebook provides example utilization of the module with some example zircon laser ablation data reduced by Iolite 4. You will also need to have [iolite_tools](https://github.com/sarttiso/iolite_tools) to facilitate processing the example dataset into dataframes that are ready for `geochemdb`.\n", "\n", "This notebook assumes that you are working from the preconfigured example database `geochemdb_example.db`, which is in the *example_data* directory. \n", "\n", "For use with your own data, you will also need to have properly configured a database. The [schema.sql](schema.sql) is available in the repository. For your own database, you can use the schema to create an empty database file with the correct tables and relationships by running \n", "\n", "`sqlite3 your-geochem-database.db < schema.sql` \n", "\n", "at a command line after having installed sqlite3 (which can be done via anaconda). \n", "\n", "Alternatively, you can copy `geochemdb_example.db` and edit it using a program like the free [SQLiteStudio](https://sqlitestudio.pl/)." ] }, { "cell_type": "code", "execution_count": 2, "id": "09a979e3-662f-49e2-8a57-8336a25d3030", "metadata": {}, "outputs": [], "source": [ "import iolite_tools" ] }, { "cell_type": "markdown", "id": "e2c24837-267f-4895-85da-f98d037196af", "metadata": {}, "source": [ "## Create `GeochemDB` Object" ] }, { "cell_type": "markdown", "id": "ad02b992-b5cd-49a2-9a08-fedf4e0788b7", "metadata": {}, "source": [ "The core class for the `geochemdb` module is `GeochemDB`, which generates an object that allows you to interface with a geochemical SQLite database.\n", "\n", "The `example_data` directory can be found at the root of the repository on [Github](https://github.com/sarttiso/geochemdb)." ] }, { "cell_type": "code", "execution_count": 3, "id": "415edb37-3268-4282-a90e-19ca658a08ae", "metadata": {}, "outputs": [], "source": [ "database_path = 'example_data/geochemdb_example.db'\n", "db = geochemdb.GeochemDB(database_path)" ] }, { "cell_type": "markdown", "id": "3c78e24c-334f-4688-979c-39d7dab7dd02", "metadata": {}, "source": [ "This object allows you to \n", "- add\n", "- update \n", "- retrieve\n", "\n", "measurements from the database." ] }, { "cell_type": "markdown", "id": "c94df893-3e8e-46e4-a8bd-ba4d81e1f6ed", "metadata": {}, "source": [ "This guide walks through each of these steps." ] }, { "cell_type": "markdown", "id": "e5df13a2-4a70-4351-bac8-37a07458b10b", "metadata": {}, "source": [ "## Adding Measurements" ] }, { "cell_type": "markdown", "id": "7b789fc8-9bf9-403d-9bb1-886457975214", "metadata": {}, "source": [ "`GeochemDB.measurements_add()` adds new measurements to the *Measurements* table of the database." ] }, { "cell_type": "markdown", "id": "20f0b801-8588-4b70-8ad7-927ddd311a51", "metadata": {}, "source": [ "**Note:** *`GeochemDB.measurements_add()` does not add missing samples to the `Samples` table.* Samples should be added to the database manually with all of their corresponding metadata." ] }, { "cell_type": "markdown", "id": "f41eb2f8-da2a-41af-a9e7-a4614d5b58d9", "metadata": {}, "source": [ "To add measurements to the database, you have to provide three dataframes:\n", "1. **measurements** : This dataframe contains geochemical measurements as rows, alongside analysis names. Measurements are the values of geochemical concentrations or ratios, along with their uncertainties.\n", "2. **analyses** : This dataframe lists analyses and metadata. An analysis refers to measurement of an aliquot by a machine. All analyses referenced in measurements should be present here. No other analyses should be present.\n", "3. **aliquots** : This dataframe contains aliquots to which analyses belong. An aliquot is a portion of a sample. All aliquots referenced in the analsyes dataframe should be present here. No other aliquots should be present. " ] }, { "cell_type": "markdown", "id": "77fa77b8-54cc-425d-9360-9ec4c8dc7375", "metadata": {}, "source": [ "In the example here, measurements are trace element, U, and Pb concentrations in zircons via laser ablation. These measurements were gathered via the split-stream configuration at the UCSB petrochronology lab. This procedure takes the aerosol generated via laser ablation and splits it, delivering portions to different mass spectrometers (one to measure trace elements, another U and Pb). \n", "\n", "In this case, an aliquot refers to the aerosol generated by the laser, which is designated by a spot number (e.g., *S_z48*). Each aliquot has two analyses: one for trace elements and another for U/Pb. Each analysis has as many measurements as the mass spectrometers were configured to collect." ] }, { "cell_type": "markdown", "id": "5aa3892d-8066-4307-804a-3d5320a94fc6", "metadata": {}, "source": [ "The code below prepares the necessaary dataframes using example LASS-ICPMS datasets. I've developed `iolite_tools` to facilitate processing these reduced data from excel spreadsheets." ] }, { "cell_type": "code", "execution_count": 4, "id": "c7e44b54-e57c-432f-bcf2-5b5d7722059e", "metadata": {}, "outputs": [], "source": [ "df = iolite_tools.excel2measurements('example_data/2023-03_run-5_trace.xlsx',\n", " '2023-03', 5, 'trace')\n", "\n", "df_measurements = iolite_tools.measurements2sql(df, refmat='91500')\n", "df_analyses = iolite_tools.analyses2sql(df, date='2023-03-17',\n", " instrument='Nu Plasma 3D',\n", " technique='LASS ICPMS')\n", "df_aliquots = iolite_tools.aliquots2sql(df, material='zircon')" ] }, { "cell_type": "markdown", "id": "b4947814-62a7-44cb-8728-e5f5eca2b562", "metadata": {}, "source": [ "While `iolite_tools` is specifically designed for this particular type of data, the structure of the dataframes above is what is necessary for adding data to the database. As long as you can produce these dataframes from your own reduced datasets, then you will be able to add your own data." ] }, { "cell_type": "markdown", "id": "d44b5d51-85d5-419f-9a5d-a93f433e7ea8", "metadata": {}, "source": [ ":::{note}\n", "You will have to make sure that the auxiliary tables have been appropriately populated with the relevant values before trying to add measurements. These tables include *MeasurementUnits*, *Quantities*, *QuantitiesMeasurementUnits*, *Instruments*, *Techniques*, and *Materials*. See [readme](readme.md) for more information. You will notice that the columns in the dataframes generated above reference these types of information, which is all about facilitating the metadata collection and standardization that is often lacking. The `geochem_example.db` database already has the values added, but you may have to add your own depending on your data collection and reduction practices.\n", ":::" ] }, { "cell_type": "markdown", "id": "c727042f-07cc-4338-9cd6-7a88ab6aa676", "metadata": {}, "source": [ "Now that we have made the *measurements*, *analyses*, and *aliquots* dataframes, we can use `db.measurements_add()` to add them to the database.\n", "\n", "Prior to running this function, however, you can check to see what samples will be matched based on the strings in the dataframes. The code uses fuzzy string matching on sample names, so it can be convenient to run `db.matchrows_strings()` to ensure that samples are being matched appropriately." ] }, { "cell_type": "code", "execution_count": 5, "id": "b33324b4-9d7b-4aa6-960f-99426bf9d4d8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([False, False, True, False, False, False]),\n", " {'AT2310 16 4': 'AT2310 16.4'})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.matchrows_strings('Samples', df_aliquots['sample'].unique(), 'name', score_threshold=95)" ] }, { "cell_type": "markdown", "id": "7ab4daf3-3c53-4061-af9c-adad75401834", "metadata": {}, "source": [ "We can see that only sample AT2310 16.4 is matched since it's the only one in the database. If samples do not match, then they either are not present in the database and need to be added, or the sample names in the dataframes differ too much from the sample names in the database. The degree of string similarity for the fuzzy name matching is controlled by the `score_threshold` parameter, which can be decreased to match more dissimilar strings. In this case, a threshold of 95 is sufficient to match 'AT2310 16 4' to 'AT2310 16.4'." ] }, { "cell_type": "markdown", "id": "b5157687-47de-4b96-bf9a-8e4390691886", "metadata": {}, "source": [ "Once you're satisfied that the dataframes are ready and sample names will match, you can add the measurements to the database." ] }, { "cell_type": "code", "execution_count": 6, "id": "46150a18-e936-4baf-a6cb-08205869a4df", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sample names not matched:\n", "{'91500', 'Ples', 'GJ1', 'AT22 FRAN 17 29', 'NIST612'}\n", "Added:\n", "122 aliquots,\n", "122 analyses,\n", "3416 measurements\n" ] } ], "source": [ "db.measurements_add(df_measurements, df_analyses, df_aliquots)" ] }, { "cell_type": "markdown", "id": "ce66950e-72a3-4007-94fd-7582d0e3198b", "metadata": {}, "source": [ "## Updating Measurements" ] }, { "cell_type": "markdown", "id": "acaa0ac7-56f6-4f14-928a-9acb77779e57", "metadata": {}, "source": [ "You might find that you've improved your data reduction at some point in the future, and you have new values for measurements from existing analyses. If you have already added these measurements to the database, you can update them using `db.measurements_update()`. Note that analysis names must be identical to the existing analyses in the database." ] }, { "cell_type": "code", "execution_count": 7, "id": "810fcfe2-90d5-4457-a9d3-d0c7335a6520", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Updated:\n", "3416 measurements\n" ] } ], "source": [ "db.measurements_update(df_measurements)" ] }, { "cell_type": "markdown", "id": "564dfb46-690a-4cdb-b45a-ac27b14142a1", "metadata": {}, "source": [ "## Get Measurements by Sample" ] }, { "cell_type": "markdown", "id": "87d4fcae-0369-43f5-a729-c4b865764220", "metadata": {}, "source": [ "Retrieving measurements from the database can be done with `db.measurements_by_sample()`." ] }, { "cell_type": "code", "execution_count": 8, "id": "0bc86fcc-b95d-4da4-98d4-2d995fb8a0d0", "metadata": {}, "outputs": [], "source": [ "df_measurements = db.measurements_by_sample('AT2310 16.4')" ] }, { "cell_type": "markdown", "id": "4babc5dc-99db-40fc-aa8b-d2b03f5fe8e0", "metadata": {}, "source": [ "This function returns a dataframe with all measurements as rows, whereas most end-users will likely want measured quantities in columns. The `geochemdb.aliquot_average()` function generates such a dataframe by averaging aliquots with repeat measurements of any quantities." ] }, { "cell_type": "code", "execution_count": 9, "id": "5108f67f-b37f-4852-8455-495281e86be2", "metadata": {}, "outputs": [], "source": [ "df_aliquots = geochemdb.aliquot_average(df_measurements)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }