- 16 Oct 2021
- 2 Minutes to read
- Print
- DarkLight
Rows and Columns
- Updated on 16 Oct 2021
- 2 Minutes to read
- Print
- DarkLight
At times it becomes necessary to manipulate the data in columns and rows to clean up data or make accessing the data more efficient.
Apply
The .apply()
function applies a function to every row or column in a data frame. There are two main applications for using .apply()
:
- inplace changes
- new column creation
An inplace change generally involves cleaning up the data or converting the data type. For example, there is an extra comma found in the full_name column in the table below. It is possible to remove that comma using an inplace change.
extract.age | extract.patient_full_name | extract.patient_id | |
---|---|---|---|
0 | 43 | Gail Hill, | 94451 |
1 | 43 | Gail Hill, | 94451 |
2 | 36 | Rhoda Sweet, | 92830 |
3 | 13 | Jescie Lyons, | 92721 |
4 | 70 | Lyle Cash, | 92570 |
To remove the comma, we can create a custom function and use .apply()
:
dataframe['extract.patient_full_name'] = dataframe['extract.patient_full_name'].apply(lambda value: value.replace(',', ''))
The code above is applying a lambda function, which removes commas from the extract.patient_full_name column and then overwrites the old extract.patient_full_name column.
extract.age | extract.patient_full_name | extract.patient_id | |
---|---|---|---|
0 | 43 | Gail Hill | 94451 |
1 | 43 | Gail Hill | 94451 |
2 | 36 | Rhoda Sweet | 92830 |
3 | 13 | Jescie Lyons | 92721 |
4 | 70 | Lyle Cash | 92570 |
Another use for .apply
is to convert columns to numeric, date, or object (string) types by including the line import pandas as pd
at the beginning of the report. By converting strings to numeric objects, the values can be used in mathematical functions.
import pandas as pd
...
dataframe = dataframe.apply(pd.to_numeric, errors="ignore")
Splitting Columns
Splitting a column can be useful if a column contains multiple pieces of information or if a complex operation needs an intermediate calculation for simplicity or efficiency.
In the above table, it may be desirable to split the full name into two columns for the first and last names. To do so, use the .apply()
function once again:
dataframe['extract.patient_first_name'] = dataframe['extract.patient_full_name'].apply(lambda value: value.split()[0])
dataframe['extract.patient_last_name'] = dataframe['extract.patient_full_name'].apply(lambda value: value.split()[1])
extract.age | extract.patient_full_name | extract.patient_id | extract.patient_first_name | extract.patient_last_name | |
---|---|---|---|---|---|
0 | 43 | Gail Hill | 94451 | Gail | Hill |
1 | 43 | Gail Hill | 94451 | Gail | Hill |
2 | 36 | Rhoda Sweet | 92830 | Rhoda | Sweet |
3 | 13 | Jescie Lyons | 92721 | Jescie | Lyons |
4 | 70 | Lyle Cash | 92570 | Lyle | Cash |
Combining Columns
It is also possible to combine columns. In the example below, the first name and patient name concatenate to become a single column:
dataframe['nameId'] = dataframe.apply(lambda row: row['extract.patient_first_name'] + str(row['extract.patient_id']), axis=1)
This time the .apply()
function is working across the entire row so that multiple columns can be used.
extract.age | extract.patient_full_name | extract.patient_id | extract.patient_first_name | extract.patient_last_name | nameId | |
---|---|---|---|---|---|---|
0 | 43 | Gail Hill | 94451 | Gail | Hill | Gail94451 |
1 | 43 | Gail Hill | 94451 | Gail | Hill | Gail94451 |
2 | 36 | Rhoda Sweet | 92830 | Rhoda | Sweet | Rhoda92830 |
3 | 13 | Jescie Lyons | 92721 | Jescie | Lyons | Jescie92721 |
4 | 70 | Lyle Cash | 92570 | Lyle | Cash | Lyle92570 |
Useful Column Functions
The list below contains some of the more useful pre-built functions available when using a data frame. Either assign a variable to these functions, dataframe_sum = dataframe['extract.age'].sum()
or print the function directly, print(dataframe['extract.age'].sum())
.
- mean
dataframe['extract.age'].mean()
returns 41.0 - sum
dataframe['extract.age'].sum()
returns 205 - unique
dataframe['extract.age'].unique()
returns [43 36 13 70] - tolist
dataframe['extract.age'].tolist()
returns [43, 43, 36, 13, 70] - describe
dataframe['extract.age'].describe()
returns the following information:count 5.000000 mean 41.000000 std 20.359273 min 13.000000 25% 36.000000 50% 43.000000 75% 43.000000 max 70.000000 Name: extract.age, dtype: float64
dataframe.describe()
returns the following table:extract.age extract.patient_id count 5.000000 5.000000 mean 41.000000 93404.600000 std 20.359273 959.679269 min 13.000000 92570.000000 25% 36.000000 92721.000000 50% 43.000000 93830.000000 75% 43.000000 94451.000000 max 70.000000 94451.000000
Stack and Unstack
The .stack()
function creates a multi-index based on the columns. To print the function, use print(dataframe.stack())
, where dataframe is the variable assigned to the query.
0 | extract.age | 43 |
extract.patient_full_name | Gail Hill | |
extract.patient_id | 94451 | |
extract.patient_first_name | Gail | |
extract.patient_last_name | Hill | |
nameId | Gail94451 | |
1 | extract.age | 43 |
extract.patient_full_name | Gail Hill | |
extract.patient_id | 94451 | |
extract.patient_first_name | Gail | |
extract.patient_last_name | Hill | |
nameId | Gail94451 | |
2 | extract.age | 36 |
extract.patient_full_name | Rhoda Sweet | |
extract.patient_id | 92830 | |
extract.patient_first_name | Rhoda | |
extract.patient_last_name | Sweet | |
nameId | Rhoda92830 | |
3 | extract.age | 13 |
extract.patient_full_name | Jescie Lyons | |
extract.patient_id | 92721 | |
extract.patient_first_name | Jescie | |
extract.patient_last_name | Lyons | |
nameId | Jescie92721 | |
4 | extract.age | 70 |
extract.patient_full_name | Lyle Cash | |
extract.patient_id | 92570 | |
extract.patient_first_name | Lyle | |
extract.patient_last_name | Cash | |
nameId | Lyle92570 | |
dtype: object |
dataframe.loc[0, 'extract.age']
returns 43
The .unstack()
function produces the opposite effect of the .stack()
function.
Transposing Columns to Rows
The transpose()
function switches the rows and columns.
extract.age | extract.patient_full_name | extract.patient_id | extract.patient_first_name | extract.patient_last_name | nameId | |
---|---|---|---|---|---|---|
0 | 43 | Gail Hill | 94451 | Gail | Hill | Gail94451 |
1 | 43 | Gail Hill | 94451 | Gail | Hill | Gail94451 |
2 | 36 | Rhoda Sweet | 92830 | Rhoda | Sweet | Rhoda92830 |
3 | 13 | Jescie Lyons | 92721 | Jescie | Lyons | Jescie92721 |
4 | 70 | Lyle Cash | 92570 | Lyle | Cash | Lyle92570 |
Using the code below transposes the table:
dataframe = dataframe.transpose()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
extract.age | 43 | 43 | 36 | 13 | 70 |
extract.patient_full_name | Gail Hill | Gail Hill | Rhoda Sweet | Jescie Lyons | Lyle Cash |
extract.patient_id | 94451 | 94451 | 92830 | 92721 | 92570 |
extract.patient_first_name | Gail | Gail | Rhoda | Jescie | Lyle |
extract.patient_last_name | Hill | Hill | Sweet | Lyons | Cash |
nameId | Gail94451 | Gail94451 | Rhoda92830 | Jescie92721 | Lyle92570 |