Reading and Writing Data in Pandas
Pandas provides multiple functions to read files in several formats.
CSV
Pandas easily reads files in CSV (comma separated values) format. The separator does not have to be a comma, but anything else must be specified through the sep
keyword argument.
Suppose we have a file weather.txt
containing weather data over a year for one site. It is in comma-separated form with exactly one line of column headers.
wdata=pd.read_csv('weather.txt')
The read_csv
function stores the column headers as the column names and the rest of the data as the columns in a dataframe object. To see the data we would type
wdata
The read_csv
method has many optional arguments. Some important ones are
sep=
Column separator character. Default is commaheader=
Number of header lines. Default is 1. Specify header=None if no headernames=
List of column names if no header. Should be provided if header=None. If there is a header, can be used to rename the columns, but then header=0 should be given.skiprows=
Number of rows to skip before importing the data.usecols=
List of columns to import, if not all are to be read.
There are many other options; see the documentation.
Excel
The
read_excel
method can read files stored in Excel format (.xls, .xlsx, and similar).
my_data=pd.read_excel('weather.xlsx')
It has fewer options because an Excel file includes more information about formatting. Commonly-used ones include
header=
Number of header lines. Default is 1. Specify header=None if no headernames=
List of column names if no header. Should be provided if header=None. If there is a header, can be used to rename the columns, but then header=0 should be given.usecols=
List of columns to import, if not all are to be readsheet_name=
Can specify a string for a sheet name, an integer for the sheet number, counting from 0. A lists of strings or integers will request multiple sheets. Specify None for all available worksheets. Default is 0 (first sheet only).
Writing CSV and Excel Files
A dataframe can be written to a CSV file with to_csv
outfile='new_data.csv'
my_new_df.to_csv(outfile)
If the Dataframe is to be only one Excel worksheet, a similar method to_excel
can be used.
my_new_df.to_excel("newfile.xlsx")
However, if more than one worksheet is needed, an ExcelWriter object must be created. The following example is from the documentation.
df1 = pd.DataFrame([["AAA", "BBB"]], columns=["Spam", "Egg"])
df2 = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
with pd.ExcelWriter("path_to_file.xlsx") as writer:
df1.to_excel(writer, sheet_name="Sheet1")
df2.to_excel(writer, sheet_name="Sheet2")
HDF5 and NetCDF
HDF (Hierarchical Data Format) and NetCDF are self-describing, cross-platform, binary data formats that are widely used in a number of scientific disciplines, particularly earth sciences. Pandas can also directly import HDF5 files, using Pytables.
my_data=pd.read_hdf("weather.hdf")
Pandas dataframes are strictly two-dimensional objects. The
Xarray package provides an extension of the Pandas dataframe to more than two dimensions.
Since Pandas 0.20, Xarray is the recommended package to manage higher-dimensional data, replacing the Pandas Panel
data structure.
Xarray can read NetCDF files directly if the netCDF4-Python is installed. It can also read HDF5 files via h5netcdf.
import xarray as xr
data=xr.open_dataset("mydata.hdf", engine="h5netcdf")
ncdata=xr.open_dataset("myfile.nc")
HDF and NetCDF data is often in multiple files. Xarray can merge them.
alldata=xr.open_mfdataset("climate/*nc",parallel=True)
To write a NetCDF file:
ds.to_netcdf("newdata.nc")
Exercise Download the weather.csv file. Read the data into a dataframe. Summarize the data. Look at the first 20 lines. Print the columns. Change the names of “Data.Temperature.Avg Temp”, “Data.Temperature.Max Temp” and “Data.Temperature.Min Temp” to “Data.AvgTemp”, “Data.MaxTemp”, and “Data.MinTemp” respectively. Print the mean of the average temperature.