Data Preparation

So far, out of your assigend tasks for The Guaviare Project, you have only got familiar with the data collected by your fellow researchers. In this section you will learn how to get pamflow to read this data in order to standardize them and extract more information.

Summary:

Get pamflow to read input data

The first step towards using pamflow is to inform where the audio_root_directory is located. When you installed the project as explained in the Setup page, you ended up with this folder structure

kedroPamflow/
├── conf/                # Configuration files (catalog, parameters, etc.)
├── data/                # Data directory (raw, intermediate, processed, etc.)
├── docs/                # Documentation files
├── logs/                # Logs generated during pipeline runs
├── notebooks/           # Jupyter notebooks for exploration and prototyping
├── src/                 # Source code for the project
│   ├── kedroPamflow/    # Main package containing pipelines and utilities
│   └── tests/           # Unit and integration tests
├── .gitignore           # Git ignore file
├── [README.md](http://_vscodecontentref_/0)            # Project overview and setup instructions
├── requirements.txt     # Python dependencies
└── setup.py             # Installation script for the project

To hand your input files over to pamflow you will only need two out of these folders, namely, data/ and conf/. Let’s focus on conf/ first for informing pamflow specifically about your audio_root_directory . Inside conf/ you will find this folder structure:

conf/
├── local/               
│   ├── parameters.yml
│   └──   
└── 

Now open the file conf/local/parameters.yml and write the path to the audio_root_directory. The external disk provided to you is called guaviare_project_external_disk and inside it there is the folder we get familiar with in previous section called pam_data_guaviare. Thus, the conf/base/parameters.yml file should look like this now you have changed it.

audio_root_directory: "/media/pamResearcher/guaviare_project_external_disk/pam_data_guaviare"

Now, for providing pamflow with your custom field_deployments_sheet and target_species go to the data/ folder which should look like this

data/
├── input/                       # Folder containing all the input data
│   ├── field_deployments/       # Folder containing field_deployments_sheet 
│   └── target_species/          # Folder containing target_species
└── output/                      # Folder containing all outputs

Intuitively enough, copy field_deployments_sheet to the path data\input\field_deployments_sheet\field_deployments_sheet.xlsx and target_species to the path data\input\target_species\target_species.csv.

⚠️ Warning: Ensure the field_deployments_sheet and target_species files are in the correct format. This means to check the files are named properly: field_deployments_sheet.xlsx and target_species.csv. Also make sure that the columns are properly named and the info in field_deployments_sheet.xlsx is stored in a sheed called Plantilla Usuario.

Now that your data is properly stored, you can use pamflow to complete your asigned tasks

Standardized metadata from each audio and each sensor

You already got familiar with the provided data and handed it over to pamflow. Now you are ready to complete your second task: Extract metadata from each audio file and each passive acoustic sensor.

Now that pamflow has access to the audio_root_directory and field_deployments_sheet, we can ask it to generate the media@pamDP and deployments@pamDP formats. The former is a .csv containing one row per each .WAV file in the audio_root_directory and displaying important information related to each audio. The latter, contains information about each deployed sensor. The content, schema and structure of these datasets is further explained in the Data Exchange Formats section. These formats are the baseline for the rest of the processess carried out through pamflow.

For generating them, run

kedro run --pipeline data_preparation

The message

INFO     Pipeline execution completed successfully.  

will tell you the process is over and that now you are able to access media@pamDP and deployments@pamDP. They will be stored in

data/
├── input/                        # Folder containing all the input data
└── output/                       # Folder containing all outputs
    └── data_preparation/         # Folder containing outputs of the pipeline data_preparation
        └── media.csv             # `media@pamDP` file
        └── deployments.csv       # `deployments@pamDP` file

As soon as you open media@pamDP you will find the following information regarding your audio files (along with other columns)

mediaID

deploymentID

timestamp

filePath

sampleRate

bitDepth

fileLength

MC-013_20240302_070000.WAV

MC-013

2024-03-02T07:00:00

…/MC-013/MC-013_20240302_070000.WAV

48000

16

60.0

MC-013_20240229_063000.WAV

MC-013

2024-02-29T06:30:00

…/MC-013/MC-013_20240229_063000.WAV

48000

16

60.0

MC-013_20240304_053000.WAV

MC-013

2024-03-04T05:30:00

…/MC-013/MC-013_20240304_053000.WAV

48000

16

60.0

As for deployments@pamDP you’ll find a file that looks like this

deploymentID

locationID

latitude

longitude

deploymentStart

deploymentEnd

recorderModel

habitat

MC-002

EL REBALSE

2.117463

-72.779575

2024-02-15T15:04:45

2024-03-06T15:04:45

AudioMoth v 1.2.0

Pastos limpios

MC-007

SAN MIGUEL

2.059644

-72.920236

2024-02-15T15:32:00

2024-03-06T15:32:00

AudioMoth v 1.2.0

Pastos limpios

MC-009

LA TORTUGA

2.183335

-72.987016

2024-02-16T20:48:06

2024-03-07T20:48:06

AudioMoth v 1.2.0

Pastos limpios

MC-013

LA TORTUGA

2.183335

-72.987016

2024-02-16T20:48:06

2024-03-07T20:48:06

AudioMoth v 1.2.0

Pastos limpios

In the next section you will learn how to check for sensor behavior and performance.