---
title: "Basics"
slug: "basics"
updated: 2025-11-12T01:04:13Z
published: 2025-11-12T01:04:13Z
---

> ## Documentation Index
> Fetch the complete documentation index at: https://docs.imat.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Basics

Pandas is a part of the SciPy (Scientific Python) technology stack. It allows easy and efficient data processing and visualization. Below are some examples of some uses for Pandas:

- tabular data, consisting of rows and columns similar to SQL tables or spreadsheets
- statistical and other algorithmic analysis
- plotting and visualization
- time series and time-date calculations
- importing and exporting data

There are three primary data structures that Pandas uses:

- series (1 dimensional)
- data frame (2 dimensional)
- panel (3 dimensional)

A series is very much like an array in which there is an index associated with a value. An index of a series can be any data type and not just the position of the array.

Data frames are the primary object Pandas uses, and they are essentially a series of series and are similar to a spreadsheet.

Panels are generally not needed for compound queries but they might be useful in very specific applications.

---

## Get a Data Frame

The primary method of getting a data frame through IMAT is using the `get_hits()` function.

#### Query:

```python
dataframe = get_hits('()UNIVERSE', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex')
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex |
| --- | --- | --- | --- |
| 0 | 99541 | 2 | male |
| 1 | 99541 | 2 | male |
| 2 | 99541 | 3 | male |
| 3 | 99541 | 2 | male |
| 4 | 99541 | 0 | male |

This data frame pulled out five records from the "Records" store and displayed the requested information about each record. The first column is the index which defaults to the position of each row. The other three columns are the fields requested in the query: patient_id, age and gender.

---

There are other methods to create a data frame, which can be found on the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) website.

One example is to create a data frame using a dictionary:

#### Query:

```python
import pandas as pd
dataframe = pd.DataFrame({'col1': [0,1,2,3], 'col2': ['a', 'b', 'c', 'd']})
print(dataframe)
```

#### Response:

|  | col1 | col2 |
| --- | --- | --- |
| 0 | 0 | a |
| 1 | 1 | b |
| 2 | 2 | c |
| 3 | 3 | d |

---

## Access Columns

Some data frames can be quite large and the report writer may wish to view a specific column to check for all values in the column or to use for statistical analysis.

Refer to the first query and response in [Get a Data Frame](/basics#get-a-data-frame); these will be used in the following examples.

Using the previous *dataframe* query, enter the following code:

#### Query:

```python
age=dataframe['extract.age']
print(age)
```

#### Response:

| 0 | 2 |
| --- | --- |
| 1 | 2 |
| 2 | 3 |
| 3 | 2 |
| 4 | 0 |
| Name: extract.age, dtype:int64 |

---

This is returning a series object. To get a Python list object from any series, use `series.tolist()`:

#### Query:

```python
age = dataframe['extract.age'].tolist()
print(age)
```

#### Response:

```python
 [2, 2, 3, 2, 0]
```

---

It is possible to access multiple columns to get a smaller data frame:

#### Query:

```python
dataframe = dataframe[['extract.age', 'extract.patient_sex']]
print(dataframe)
```

#### Response:

|  | extract.age | extract.patient_sex |
| --- | --- | --- |
| 0 | 2 | male |
| 1 | 2 | male |
| 2 | 3 | male |
| 3 | 2 | male |
| 4 | 0 | male |

---

## Iterate Over a Data Frame

Sometimes it is necessary to do a complex operation on each row of the data frame. To do this, use `.iterrows()`, which returns a tuple of the index and the row. Once again, refer to the *dataframe* query and response at the beginning of [Get a Data Frame](/basics#get-a-data-frame).

#### Query:

```python
for index, row in dataframe.iterrows():
    print("INDEX = %s\nROW = %s\n" % (index, row))
```

#### Response:

```python
INDEX = 0
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 0, type: object

INDEX = 1
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 1, type: object

INDEX = 2
ROW = extract.patient_id  99541
extract.age      3
extract.patient_sex   male
Name: 2, type: object

INDEX = 3
ROW = extract.patient_id  99541
extract.age      2
extract.patient_sex   male
Name: 3, type: object

INDEX = 4
ROW = extract.patient_id  99541
extract.age      0
extract.patient_sex   male
Name: 4, type: object
```

---

## Select Rows

There are many methods of selecting rows from a data frame. The five most common ways are below. Print the function directly, `print(dataframe.iloc[2])`.

- Select by index label:  

```python
dataframe.loc[dataframe['extract.patient_id'].isin([1668])] #will return all rows that have a patient ID of 1668
```
- Select by index location:  

```python
dataframe.iloc[2] #will return the third row (Python starts with 0, not 1)
```
- Select by index label or index location:  

```python
dataframe.ix[] #primarily label-based but will fall back to location based
```
- Select by slices:  

```python
dataframe[0:3] #will return the first three rows(0, 1, 2)
```
- Select by []:  

```python
dataframe[dataframe['extract.patient_id'].isin([1668])] #will work like dataframe.loc["label"]
dataframe[1] #will work like dataframe.iloc[1]
```

When using `.ix[]` or `[]` there can be some confusion between labels and locations. For example, there might be a label called *2* and a location at *2*. In these cases, it is better to be specific using `.loc[]` or `.iloc[]`.

Two useful functions for series (or data frame columns) are `.idxmax()` and `.idxmin()`. They return the index location of the max or min value in the series:

```python
print(dataframe['extract.age'].idxmax())
```

This displays *2* because it is the index of the row where the extract age is 3, which is the largest age value.

---

## Sort

After the original data frame has been established, it is possible to sort the table by any field desired. The following query and response are used for the examples in this section as well as in the [Filter](/basics#filter) and [Remove Duplicates](/basics#remove-duplicates) sections:

#### Query:

```python
dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_sex, extract.patient_full_name')
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | female | Gail Hill |
| 1 | 94451 | 43 | female | Gail Hill |
| 2 | 92830 | 36 | female | Rhoda Sweet |
| 3 | 92721 | 13 | female | Jescie Lyons |
| 4 | 92570 | 70 | male | Lyle Cash |

---

Given the data frame above, it may be desirable to sort by the patient_full_name column. Add this code after the query above has been written:

```python
dataframe = dataframe.sort_values(by=['extract.patient_full_name'])
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | female | Gail Hill |
| 1 | 94451 | 43 | female | Gail Hill |
| 3 | 92721 | 13 | female | Jescie Lyons |
| 4 | 92570 | 70 | male | Lyle Cash |
| 2 | 92830 | 36 | female | Rhoda Sweet |

---

It is also possible to sort in descending order:

```python
dataframe = dataframe.sort_values(by=['extract.patient_full_name'], ascending=False)
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name |
| --- | --- | --- | --- | --- |
| 2 | 92830 | 36 | female | Rhoda Sweet |
| 4 | 92570 | 70 | male | Lyle Cash |
| 3 | 92721 | 13 | female | Jescie Lyons |
| 0 | 94451 | 43 | female | Gail Hill |
| 1 | 94451 | 43 | female | Gail Hill |

---

## Filter

Filtering is both very common and very easy to do with Pandas. Filtering can be done by any field in the data frame. Remember to use proper Python comparators and quotes around any string. Use the same base query and responses from the [Sort](/basics#sort) section for the following examples.

To filter the data frame to print only those that are over 40 years old, use the following code:

#### Query:

```python
dataframe = dataframe[dataframe['extract.age'] > 40]
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | female | Gail Hill |
| 1 | 94451 | 43 | female | Gail Hill |
| 4 | 92570 | 70 | male | Lyle Cash |

To understand how this works, examine the inner bracket expression, `dataframe['extract.age'] &gt; 40`. This is essentially the same as writing the code like this:

```python
above40=dataframe['extract.age'] > 40
print(above40)
```

#### Response:

| 0 | True |
| --- | --- |
| 1 | True |
| 2 | False |
| 3 | False |
| 4 | True |
| Name: extract.age, type: bool |

This is returning a series which shows the index of the row correlating to that index matched the Boolean expression `&nbsp;&gt; 40`.

The expression could be rewritten as `dataframe = dataframe[above40]`.

This displays each index and row where the index of *above40* is True.

---

## Remove Duplicates

Pandas has a `.drop_duplicates()` function which can easily remove duplicate information.

Use the `.drop_duplicates()` function without any parameters to remove only complete duplicates:

```python
dataframe = dataframe.drop_duplicates()
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | female | Gail Hill |
| 2 | 92830 | 36 | female | Rhoda Sweet |
| 3 | 92721 | 13 | female | Jescie Lyons |
| 4 | 92570 | 70 | male | Lyle Cash |

Notice the second occurrence of Gail Hill has been removed.

---

It is also possible to select duplicates based on columns. In this example, all rows with a gender of *female* is considered a duplicate and can be removed. To do so, the `subset` parameter must be used:

```python
dataframe = dataframe.drop_duplicates(subset=['extract.patient_sex'])
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_sex | extract.patient_full_name |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | female | Gail Hill |
| 4 | 92570 | 70 | male | Lyle Cash |

---

## Remove NaN Values

NaN (**N**ot **a** **N**umber) Values are used to specify that a value is missing, frequently NaNs need to be removed or substituted for certain algorithms.

#### Query:

```python
dataframe = get_hits('()diabetes', store='Records', limit=5, fields='extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis')
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_full_name | extract.diagnosis |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | Gail Hill | NaN |
| 1 | 94451 | 43 | Gail Hill | NaN |
| 2 | 92830 | 36 | Rhoda Sweet | NaN |
| 3 | 92721 | 13 | Jescie Lyons | NaN |
| 4 | 92570 | 70 | Lyle Cash | NaN |

---

Use the `.dropna()` function to remove rows that contain a single NaN value or if every value is NaN:

```python
dataframe = dataframe.dropna(how='any')
print(dataframe)
```

#### Response:

```python
Columns: [extract.patient_id, extract.age, extract.patient_full_name, extract.diagnosis]
Index: []
```

This removed all rows because they all contained NaN.

---

The `.dropna()` function can also be used with columns, but parameters must be used:

```python
dataframe = dataframe.dropna(how='any', axis=1)
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_full_name |
| --- | --- | --- | --- |
| 0 | 94451 | 43 | Gail Hill |
| 1 | 94451 | 43 | Gail Hill |
| 2 | 92830 | 36 | Rhoda Sweet |
| 3 | 92721 | 13 | Jescie Lyons |
| 4 | 92570 | 70 | Lyle Cash |

---

Sometimes it is necessary to replace the NaN values with another value, such as a negative number or 0. Below *Not Found* replaces all NaN values.

```python
dataframe = dataframe.fillna(value="Not Found")
print(dataframe)
```

#### Response:

|  | extract.patient_id | extract.age | extract.patient_full_name | extract.diagnosis |
| --- | --- | --- | --- | --- |
| 0 | 94451 | 43 | Gail Hill | Not Found |
| 1 | 94451 | 43 | Gail Hill | Not Found |
| 2 | 92830 | 36 | Rhoda Sweet | Not Found |
| 3 | 92721 | 13 | Jescie Lyons | Not Found |
| 4 | 92570 | 70 | Lyle Cash | Not Found |

There are also methods, such as forward fill, which will use the last found value.

---

## Export

Pandas makes exporting the data simple. The export functions all start with `to_`. The most common exports used in IMAT are `to_csv`, `to_html`, and `to_json`.

---

## Index

Sometimes a user needs to make one of the columns in a data frame into the index; for example, dates and IDs are common index columns.

The `set_index()` function sets the index:

```python
dataframe = dataframe.set_index('extract.patient_id')
```

| extract.patient_id | extract.age | extract.patient_full_name | extract.weight_standard |
| --- | --- | --- | --- |
| 94451 | 43 | Gail Hill | 170.0 |
| 94451 | 43 | Gail Hill | 170.0 |
| 92830 | 36 | Rhoda Sweet | 220.5 |
| 92721 | 13 | Jescie Lyons | 138.0 |
| 92570 | 70 | Lyle Cash | 257.0 |

---

The `reset_index()` does the opposite of `set_index()` and moves the index column into a new column and uses a sequential index.

```python
dataframe = dataframe.reset_index()
```

| extract.age | extract.patient_full_name | extract.patient_id | extract.weight_standard |
| --- | --- | --- | --- |
| 0 | 43 | Gail Hill | 94451 | 170.0 |
| 1 | 43 | Gail Hill | 94451 | 170.0 |
| 2 | 36 | Rhoda Sweet | 92830 | 220.5 |
| 3 | 13 | Jescie Lyons | 92721 | 138.0 |
| 4 | 70 | Lyle Cash | 92570 | 257.0 |
