Creating, Reading and Writing

$ pip install pandas

$ python3

>>> import pandas as pd

>>> pd.DataFrame({'BOB':['Ist column of BOB','IInd col of BoB'],'SUSE':['Ist col suse','II col suse']})

BOB SUSE

0 Ist column of BOB Ist col suse

1 IInd col of BoB II col suse

>>>pd.DataFrame({..same as above---},index=['Prouct a ',' Product b'])

>>> pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales 30

2016 Sales 35

2017 Sales 40

Name: Product A, dtype: int64

>>> reviews=pd.read_csv('/mnt/d/wine.csv')

>>> reviews

show the contents of file wine.csv

>>> reviews.country

>>> reviews['country'] to see only country column

>>> reviews.iloc[0] return only one row

>>> review.iloc[5:10,[3,5]]]

>>> review.loc[review.country=='Spain']

reviews.loc[reviews.country == 'Italy']

Out[16]:
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
6ItalyHere's a bright, informal red that opens with ...Belsito8716.0Sicily & SardiniaVittoriaNaNKerin O’Keefe@kerinokeefeTerre di Giurfo 2013 Belsito Frappato (Vittoria)FrappatoTerre di Giurfo
..........................................
129961Italy

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

Note take care of Brackets like [(a==1) & (b==2) ]

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

reviews.loc[reviews.country.isin(['Italy', 'France'])]

Out[19]:
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
6ItalyHere's a bright, informal red that opens with ...Belsito8716.0Sicily & SardiniaVittoriaNaNKerin O’Keefe@kerinokeefeTerre di Giurfo 2013 Belsito Frappato (Vittoria)FrappatoTerre di Giurfo
..........................................
129969FranceA dry style of Pinot Gris, this is crisp with ...NaN9032.0AlsaceAlsaceNaNRoger Voss@vossrogerDomaine Marcel Deiss 2012 Pinot Gris (Alsace)Pinot GrisDomaine Marcel Deiss
129970FranceBig, rich and off-dry, this is powered by inte..

reviews.loc[reviews.price.notnull()]

Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [21]:
reviews['critic'] = 'everyone'
reviews['critic']
Out[21]:
0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone

Make new column Disc = price*.9

review['disc']=review.price*.9

review[['price','disc']]

After doing modification in review file you may save it to another other file or same with command as under

>>> review.to_csv('my.csv') this will create new file name my.csv

Other overwrite file with same name

>>> review.to_csv('wine.csv')

df = pd.DataFrame({'country':['Italy','Portugal','US','US'],\

'provice':['Sicily & Sardina','Douro','California','New York'],\

'region_1':['Enta','NaN','Napa Valley','Finger Lakes'],\

'region_2':['NaN','NaN','Napa','Finger Lakes']},index=[0,1,10,100])

df =reviews.loc[[0,1,10,100],\

['country','province','region_1','region_2']]

	country	province	region_1	region_2
0	Italy	Sicily & Sardinia	Etna	NaN
1	Portugal	Douro	NaN	NaN
10	US	California	Napa Valley	Napa
100	US	New York	Finger Lakes	Finger Lakes

df = reviews.loc[0:99,['country','variety']]

>>> r.country.value_counts()

Summary Functions and Maps

>>> reviews.points.describe()

Maps

A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

=================================

def remean_points(row):

row.points = row.points - review_points_mean

return row

reviews.apply(remean_points, axis='columns')

axis is either 0 or 1

1 means 'columns'

axis = 1 is same as axis='columns'

axis = 0 is ame as axis='rows'

row = each row of review

Always first parameter is taken from commad line like

review.points.apply(function)

Here the value of each points is row

row = points (value)

================

def func1(row):

row.country='PAKISTAN ZINDAHBAD'

row.variety='BAKWAS'

row.winery='HARAM FILL ISLAM'

return row

>>> reviews.apply(func1,axis='columns')

>>>reviews.apply(func1, axis=1)

-----------------------Exersice No5--

I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

Hints to use idxmax() it will return record of Max value

>>> bargain_wine = reviews.loc[(reviews.points/reviews.price) .idxmax(), 'title']

In other simple way by using two lines

Solution:

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

-------------------------Ex 6

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

Hint: Use a map to check each description for the string tropical, then count up the number of times this is True. Repeat this for fruity. Finally, create a Series combining the two values.

Solution:

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

output

tropical    3607
fruity      9090
dtype: int64

-------------------------------Ex 7---------------------------------------------

We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series star_ratings with the number of stars corresponding to each review in the dataset.

Hint: Begin by writing a custom function that accepts a row from the DataFrame as input and returns the star rating corresponding to the row. Then, use DataFrame.apply to apply the custom function to every row in the dataset.

def stars(row):

if row.country == 'Canada':

return 3

if row.points >= 95:

return 3

if row.points >= 85:

return 2

return 1

star_ratings = reviews.apply(stars,axis=1)

axis=1 on columns and axis=0 means processed on ROWs

Grouping and Sorting

groupby() operation

reviews.groupby('points').points.count()

points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

-------------------------------------------

how we would pick out the best wine by country and province:

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

reviews.groupby(['country']).price.agg([len, min, max])

len min max

country

Argentina 3800 4.0 230.0

Armenia 2 14.0 15.0

... ... ... ...

Ukraine 14 6.0 13.0

Uruguay 109 10.0 130.0

43 rows × 3 columns

Effective use of groupby() will allow you to do lots of really powerful things with your dataset.

reviews.groupby(['country', 'province']).description.agg([len])

Multiple Index

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

mi = countries_reviewed.index
mi

Renaming and Combining

>>> reviews.rename(columns={'points': 'score'})

country	description	designation	score	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...	...

>>> r.rename(columns={'points':'scrore','country':'mulk'})

>>> r.rename(index={0:'ONE',2:'THREE'})

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

Out[4]:
fieldscountrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
wines
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaN

pd.concat command to concate two files

>>> r=pd.read_csv('eg.csv')

>>> r1=pd.read_csv('eg2.csv')

>>> pd.concat([r,r2]

out put list r listing then with same colulum r1 listing

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

======================================

powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")

powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")

Both tables include references to a MeetID, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.

Solution:

powerlifting_combined = powerlifting_meets.set_index("MeetID").join(powerlifting_competitors.set_index("MeetID"))

Search This Blog

My Python Rough Notes

PANDAS micro course by www.Kaggle.com https://www.kaggle.com/learn/pandas

Creating, Reading and Writing

Assigning data

9.

Summary Functions and Maps

Maps

Grouping and Sorting

Comments

Post a Comment

Popular posts from this blog

Course No 2 Using Python to Interact with the Operating System Rough Notes

Introduction to Git and GitHub https://www.coursera.org/learn/introduction-git-github/

fields	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
wines
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN