PANDAS micro course by www.Kaggle.com https://www.kaggle.com/learn/pandas

 

Creating, Reading and Writing

$ pip install pandas
$ python3
>>> import pandas as pd
>>> pd.DataFrame({'BOB':['Ist column of BOB','IInd col of BoB'],'SUSE':['Ist col suse','II col suse']})
                 BOB          SUSE
0  Ist column of BOB  Ist col suse
1    IInd col of BoB   II col suse
>>>pd.DataFrame({..same as above---},index=['Prouct a ',' Product b'])

 

>>> pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64
>>> reviews=pd.read_csv('/mnt/d/wine.csv')
>>> reviews
show the contents of file wine.csv
>>> reviews.country
>>> reviews['country']   to see only country column
>>> reviews.iloc[0] return only one row

>>> review.iloc[5:10,[3,5]]]

>>> review.loc[review.country=='Spain']


reviews.loc[reviews.country == 'Italy']
Out[16]:
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
6ItalyHere's a bright, informal red that opens with ...Belsito8716.0Sicily & SardiniaVittoriaNaNKerin O’Keefe@kerinokeefeTerre di Giurfo 2013 Belsito Frappato (Vittoria)FrappatoTerre di Giurfo
..........................................
129961Italy
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

Note take care of Brackets like [(a==1) & (b==2) ]

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

reviews.loc[reviews.country.isin(['Italy', 'France'])]
Out[19]:
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
6ItalyHere's a bright, informal red that opens with ...Belsito8716.0Sicily & SardiniaVittoriaNaNKerin O’Keefe@kerinokeefeTerre di Giurfo 2013 Belsito Frappato (Vittoria)FrappatoTerre di Giurfo
..........................................
129969FranceA dry style of Pinot Gris, this is crisp with ...NaN9032.0AlsaceAlsaceNaNRoger Voss@vossrogerDomaine Marcel Deiss 2012 Pinot Gris (Alsace)Pinot GrisDomaine Marcel Deiss
129970FranceBig, rich and off-dry, this is powered by inte..
reviews.loc[reviews.price.notnull()]

Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

In [21]:
reviews['critic'] = 'everyone'
reviews['critic']
Out[21]:
0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Make new column Disc = price*.9
review['disc']=review.price*.9
review[['price','disc']]


After doing modification in review file you may save it to another other file  or same with command as under
>>> review.to_csv('my.csv')  this will create new file name my.csv
Other overwrite file with same name
>>> review.to_csv('wine.csv')

df = pd.DataFrame({'country':['Italy','Portugal','US','US'],\
'provice':['Sicily & Sardina','Douro','California','New York'],\
'region_1':['Enta','NaN','Napa Valley','Finger Lakes'],\
'region_2':['NaN','NaN','Napa','Finger Lakes']},index=[0,1,10,100])

df =reviews.loc[[0,1,10,100],\
['country','province','region_1','region_2']]

countryprovinceregion_1region_2
0ItalySicily & SardiniaEtnaNaN
1PortugalDouroNaNNaN
10USCaliforniaNapa ValleyNapa
100USNew YorkFinger LakesFinger Lakes
df = reviews.loc[0:99,['country','variety']]

>>> r.country.value_counts()







Summary Functions and Maps

>>> reviews.points.describe()


Maps

map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

=================================
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

axis is either 0 or 1
1 means 'columns' 
axis = 1 is same as axis='columns'
axis = 0 is ame as axis='rows'


row = each row of review
Always first parameter is taken from commad line like
review.points.apply(function)
Here the value of each points is row
row = points (value)

================
def func1(row):
    row.country='PAKISTAN ZINDAHBAD'
    row.variety='BAKWAS'
    row.winery='HARAM FILL ISLAM'
    return row

>>> reviews.apply(func1,axis='columns')
        OR
>>>reviews.apply(func1, axis=1)



-----------------------Exersice No5--
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

Hints to use idxmax() it will return record of Max value

>>> bargain_wine = reviews.loc[(reviews.points/reviews.price) .idxmax(), 'title']
In other simple way by using two lines

Solution:

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

-------------------------Ex 6
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)

Hint: Use a map to check each description for the string tropical, then count up the number of times this is True. Repeat this for fruity. Finally, create a Series combining the two values.

Solution:

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
output 
tropical    3607
fruity      9090
dtype: int64

-------------------------------Ex 7---------------------------------------------

We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series star_ratings with the number of stars corresponding to each review in the dataset.


Hint: Begin by writing a custom function that accepts a row from the DataFrame as input and returns the star rating corresponding to the row. Then, use DataFrame.apply to apply the custom function to every row in the dataset.

def stars(row):
    if row.country == 'Canada':
        return 3
    if row.points >= 95:
        return 3
    if row.points >= 85:
        return 2
    return 1
star_ratings = reviews.apply(stars,axis=1)
axis=1 on columns and axis=0 means processed on ROWs

Grouping and Sorting

groupby() operation
points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64
-------------------------------------------
 
 how we would pick out the best wine by country and province:
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])


Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:
reviews.groupby(['country']).price.agg([len, min, max])


                len         min max
country
Argentina 3800 4.0 230.0
Armenia         2         14.0 15.0
... ... ... ...
Ukraine 14 6.0 13.0
Uruguay 109 10.0 130.0
43 rows × 3 columns
Effective use of groupby() will allow you to do lots of really powerful things with your dataset.

reviews.groupby(['country', 'province']).description.agg([len])


Multiple Index
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
mi = countries_reviewed.index
mi




Renaming and Combining

>>> reviews.rename(columns={'points': 'score'})

countrydescriptiondesignationscorepriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaNKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
1PortugalThis is ripe and fruity, a wine that is smooth...Avidagos8715.0DouroNaNNaNRoger Voss@vossrogerQuinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese RedQuinta dos Avidagos
..........................................
>>> r.rename(columns={'points':'scrore','country':'mulk'})


>>> r.rename(index={0:'ONE',2:'THREE'})


reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
Out[4]:
fieldscountrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
wines
0ItalyAromas include tropical fruit, broom, brimston...Vulkà Bianco87NaNSicily & SardiniaEtnaNaN
pd.concat command to concate two files
>>> r=pd.read_csv('eg.csv')
>>> r1=pd.read_csv('eg2.csv')
>>> pd.concat([r,r2]
out put list r listing then with same colulum r1 listing



canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])


left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')
======================================

powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")

Both tables include references to a MeetID, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.

Solution:

powerlifting_combined = powerlifting_meets.set_index("MeetID").join(powerlifting_competitors.set_index("MeetID"))



\














Comments

Popular posts from this blog

Course No 2 Using Python to Interact with the Operating System Rough Notes

Introduction to Git and GitHub https://www.coursera.org/learn/introduction-git-github/