數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘二
Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,©Silberschatz, Korth and Sudarshan,20.,39,Click to edit Master title style,Database System Concepts - 6,th,Edition,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Click to edit Master title style,Chapter 20: Data Analysis,Chapter 20: Data Analysis,Decision Support Systems,Data Warehousing,Data Mining,Classification,Association Rules,Clustering,,,Decision Support Systems,Decision-support systems,are used to make business decisions, often based on data collected by on-line transaction-processing systems.,Examples of business decisions:,What items to stock?,What insurance premium to change?,To whom to send advertisements?,Examples of data used for making decisions,Retail sales transaction details,Customer profiles (income, age, gender, etc.),Decision-Support Systems: Overview,Data analysis,tasks are simplified by specialized tools and SQL extensions,Example tasks,For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year,As above, for each product category and each customer category,Statistical analysis,packages (e.g., : S++) can be interfaced with databases,Statistical analysis is a large field, but not covered here,Data mining,seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.,A,data warehouse,archives information gathered from multiple sources, and stores it under a unified schema, at a single site.,Important for large businesses that generate data from multiple divisions, possibly at multiple sites,Data may also be purchased externally,Data Warehousing,Data sources often store only current data, not historical data,Corporate decision making requires a unified view of all organizational data, including historical data,A,data warehouse,is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site,Greatly simplifies querying, permits study of historical trends,Shifts decision support query load away from transaction processing systems,,Data Warehousing,Design Issues,When and how to gather data,Source driven architecture,: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night),Destination driven architecture,: warehouse periodically requests new information from data sources,Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive,Usually OK to have slightly out-of-date data at warehouse,Data/updates are periodically downloaded form online transaction processing (OLTP) systems.,What schema to use,Schema integration,More Warehouse Design Issues,Data cleansing,E.g., correct mistakes in addresses (misspellings, zip code errors),Merge,address lists from different sources and,purge,duplicates,How to propagate updates,Warehouse schema may be a (materialized) view of schema from data sources,What data to summarize,Raw data may be too large to store on-line,Aggregate values (totals/subtotals) often suffice,Queries on raw data can often be transformed by query optimizer to use aggregate values,,Warehouse Schemas,Dimension values are usually encoded using small integers and mapped to full values via dimension tables,Resultant schema is called a,star schema,More complicated schema structures,Snowflake schema,: multiple levels of dimension tables,Constellation,: multiple fact tables,Data Warehouse Schema,Data Mining,Data miningistheprocessofsemi-automaticallyanalyzing large databasestofind usefulpatterns,Prediction,basedonpast history,Predict if acredit cardapplicant poses agoodcreditrisk,basedonsomeattributes (income, jobtype,age, ..)andpasthistory,Predict if apatternofphonecalling cardusageislikely to be fraudulent,Some examples of predictionmechanisms:,Classification,Givena newitem whose class is unknown, predicttowhichclassitbelongs,Regression,formulae,Givena setofmappingsforanunknownfunction,predictthefunctionresult fora newparametervalue,Data Mining(Cont.),DescriptivePatterns,Associations,Find books thatare often boughtby,“,“similar”customers.Ifanewsuchcustomerbuys onesuch book, suggestthe otherstoo.,Associationsmay be usedasafirststep in detecting,causation,E.g.,associationbetween exposure to chemical Xand cancer,,Clusters,E.g.,typhoid cases wereclustered in an areasurroundingacontaminatedwell,Detectionofclustersremainsimportantindetecting epidemics,ClassificationRules,Classificationruleshelp assignnewobjectstoclasses.,E.g.,givena newautomobile insuranceapplicant, shouldheorshebeclassifiedaslowrisk,medium riskorhighrisk?,Classificationrulesforaboveexamplecoulduseavariety of data, suchaseducationallevel, salary,age,etc.,?,personP,P.degree =masters,and,P.income> 75,000,?,P.credit= excellent,?,personP,P.degree =bachelors,and,(P.income,?,25,000and P.income,?,75,000),?,P.credit= good,Rulesarenot necessarily exact:theremaybesomemisclassifications,Classificationrulescanbeshowncompactly as adecisiontree.,DecisionTree,ConstructionofDecisionTrees,Trainingset,: adatasampleinwhichthe classification is alreadyknown.,Greedy,topdowngeneration of decision trees.,Each internal nodeofthe treepartitionsthedatainto groupsbasedona,partitioningattribute,, anda,partitioningcondition,forthe node,Leaf,node:,all(or most) of theitemsatthenodebelongtothe sameclass, or,allattributeshave beenconsidered,and no furtherpartitioning is possible.,Best Splits,Pick bestattributesandconditionsonwhichtopartition,Thepurity of aset Softraininginstances canbemeasuredquantitativelyinseveral ways.,Notation:number of classes=,k,,numberofinstances =|S|,fractionofinstances in class,i,=,p,i,.,The,Gini,measureof purity isdefinedas,[,Gini(S)= 1-,?,,Whenallinstances are in asingle class, the Gini valueis0,It reaches its maximum (of 1,–,–1/,k,) ifeach classthesamenumberof instances.,,k,i,- 1,p,2,i,BestSplits(Cont.),Anothermeasureof purity isthe,entropy,measure,which is defined as,,entropy(S)= –,?,,Whena set Sissplit into multiplesetsSi,I=1, 2,,…,…,r, we canmeasure the purity oftheresultant set of sets as:,,purity(,S,1,, S,2,, ….., S,r,) =,?,,Theinformationgainduetoparticular splitofS into S,i,, i= 1,2,,…,….,r,Information-gain,(,S,, {,S,1,,,S,2,, ….,,S,r,) =purity(,S,) –purity (,S,1,,,S,2,, …,S,r,),,,,r,i,= 1,|,S,i,|,|,S,|,purity,(,S,i,),k,i-,1,p,i,log,2,p,i,BestSplits(Cont.),Measureof “cost,”,” ofa split:Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}))= –,?,,Information-gainratio,= Information-gain(,S,,{,S,1,,,S,2,, ……,,S,r,}),Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}),Thebestsplit is the one that givesthemaximuminformationgain ratio,,log,2,r,i,- 1,|,S,i,|,|,S,|,|,S,i,|,|,S,|,,FindingBestSplits,Categoricalattributes (withnomeaningful order):,Multi-way split,onechild for eachvalue,Binary split: try all possible breakup of valuesinto two sets,andpickthebest,Continuous-valued attributes(can besortedin ameaningfulorder),Binary split:,Sortvalues,tryeach asa splitpoint,E.g., ifvaluesare1, 10, 15, 25, splitat,?,?1, ?10,,?,? 15,Pickthevalue thatgives best split,Multi-way split:,A seriesofbinary splits onthesame attributehasroughlyequivalent effect,,,,Decision-Tree Construction Algorithm,Procedure,GrowTree,(,S,)Partition (,S,);,Procedure,Partition (,S,),if,(,purity,(,S,) >,?,p,or |,S,| <,?,?,s,),thenreturn,;,foreach,attribute,A,evaluatesplitson attribute,A,; Usebestsplit found(acrossallattributes)topartition,S,into,S,1,, S,2,, …., S,r,,,for,i,= 1,2,,…,…..,,r,Partition (,S,i,);,Other Typesof Classifiers,Neural net classifiers are studied in artificialintelligence and are not covered here,Bayesianclassifiersuse,Bayes theorem,, whichsays,p,(,c,j,|,d,) =,p,(,d,| c,j,),p,(,c,j,),p,(,d,)where,p,(,c,j,|,d,) =probabilityof instance,d,being inclass,c,j,,,p,(,d,| c,j,) =probabilityof generating instance,d,given class,c,j,,,p,(,c,j,)= probability ofoccurrenceof class,c,j,, and,p,(,d,) =probabilityof instance,d,occuring,,Naïve Bayesian Classifiers,Bayesianclassifiersrequire,computationof,p,(,d,| c,j,),precomputation of,p,(,c,j,),p,(,d,) can beignoredsince it isthesame for all classes,To simplifythetask,,naïve Bayesian classifiers,assume attributes have independent distributions, and thereby estimate,p,(,d,|,c,j,) =,p,(,d,1,|,c,j,) *,p,(,d,2,|,c,j,) *,…,….*(,p,(,d,n,|,c,j,),Eachofthe,p,(,d,i,|,c,j,) can beestimated froma histogramon,d,i,values for eachclass,c,j,thehistogram iscomputed from the traininginstances,Histograms on multiple attributes are more expensivetocomputeandstore,,Regression,Regression dealswith the predictionofa value,ratherthana class.,Given valuesfora set of variables,X,1,, X,2,, …,X,n,, wewish topredictthevalue of avariableY.,Onewayis to infercoefficientsa,0,, a,1,, a,1,, …,a,n,suchthat,Y,=,a,0,+,a,1,*,X,1,+,a,2,*,X,2,+ …+,a,n,*,X,n,Findingsucha linear polynomialiscalled,linear regression,.,In general,theprocessof finding acurve thatfitsthedata isalso called,curve fitting,.,Thefitmayonlybeapproximate,becauseof noiseinthedata, or,becausetherelationshipisnotexactlya polynomial,Regression aimsto findcoefficientsthat give the bestpossiblefit.,AssociationRules,Retail shopsareoften interested inassociations between differentitems that people buy.,Someonewhobuysbread is quitelikely alsoto buy milk,A personwhoboughtthebook,DatabaseSystemConcepts,is quitelikelyalsotobuythebook,Operating SystemConcepts,.,Associationsinformationcanbeusedinseveralways.,E.g., when acustomer buys aparticularbook, anonlineshopmaysuggestassociatedbooks.,Associationrules:,bread,?,milkDB-Concepts,OS-Concepts,?Networks,Lefthandside:,antecedent,,righthandside:,consequent,Anassociationrulemusthaveanassociated,population,;thepopulationconsistsofasetof,instances,E.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation,AssociationRules(Cont.),Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.,Support,isameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.,milk,?,screwdrivers,islow.,Confidence,isameasureofhowoftentheconsequentistruewhentheantecedentistrue.,E.g.,therule,bread,?,milk,hasaconfidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.,,,FindingAssociationRules,Wearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater),Na,ï,ïvealgorithm,Considerallpossiblesetsofrelevantitems.,Foreachsetfinditssupport(i.e.,counthowmanytransactionspurchaseallitemsintheset).,Largeitemsets,:setswithsufficientlyhighsupport,Uselargeitemsetstogenerateassociationrules.,Fromitemset,A,generatetherule,A,-{,b,} ?,b,foreach,b,?,A.,Supportof rule= support (,A),.,Confidence of rule =support(,A,) /support(,A,- {,b,}),FindingSupport,Determine support ofitemsets via asingle passon set of transactions,Large itemsets:setswith ahighcount at the end ofthepass,If memory not enoughtoholdallcountsforallitemsetsusemultiple passes, considering only someitemsetsineachpass.,Optimization: Once an itemset iseliminatedbecauseitscount (support)is too smallnone ofitssupersets needstobe considered.,The,a priori,technique tofind largeitemsets:,Pass1:count support ofallsets with just1 item.Eliminate thoseitems withlowsupport,Pass,i,:,candidates,: everysetof,i,items such thatallits,i-1,itemsubsetsarelarge,Count support ofallcandidates,Stopifthere are nocandidates,Other Typesof Associations,Basic association ruleshaveseverallimitations,Deviations fromtheexpectedprobabilityaremore interesting,E.g., ifmany peoplepurchase bread,andmany peoplepurchase cereal, quitea few wouldbe expectedto purchaseboth,We are interested in,positive,as wellas,negativecorrelations,betweensetsofitems,Positivecorrelation: co-occurrenceis higher than predicted,Negativecorrelation: co-occurrenceis lowerthan predicted,Sequenceassociations /correlations,E.g., whenever bondsgoup,stock pricesgodownin2 days,Deviations fromtemporalpatterns,E.g., deviationfroma steady growth,E.g., salesof winter wear go down in summer,Notsurprising,partofa knownpattern.,Lookfordeviation fromvalue predictedusing past patterns,Clustering,Clustering:Intuitively,findingclusters ofpointsin the givendata such thatsimilarpoints lie in the same cluster,Canbe formalized usingdistancemetricsinseveralways,Group pointsinto,k,sets(for agiven,k,) such thattheaveragedistanceofpoints fromthecentroidoftheir assigned groupisminimized,Centroid: pointdefinedby taking average ofcoordinatesineachdimension.,Anothermetric:minimizeaveragedistance between everypairofpoints in acluster,Hasbeenstudiedextensivelyinstatistics,buton smalldata sets,Dataminingsystemsaimat clustering techniquesthat can handlevery largedatasets,E.g., the Birchclustering algorithm(more shortly),HierarchicalClustering,Examplefrombiologicalclassification,(theword classificationhere does not meana predictionmechanism),chordatamammaliareptilialeopardshumanssnakescrocodiles,Other examples:Internetdirectory systems (e.g., Yahoo,more onthis later),Agglomerative clusteringalgorithms,Build smallclusters, then cluster smallclusters into bigger clusters,andso on,Divisiveclusteringalgorithms,Start with all itemsina singlecluster, repeatedlyrefine(break)clustersinto smaller ones,Clustering Algorithms,Clustering algorithms have beendesignedtohandle verylarge datasets,E.g., the,Birch algorithm,Mainidea: use an in-memoryR-tree to storepoints thatarebeing clustered,Insert points one ata timeintotheR-tree,merginga new pointwith anexisting cluster ifislessthan some,?,distanceaway,If therearemore leaf nodesthan fit inmemory,merge existingclustersthat are closeto eachother,At the end of firstpasswegeta largenumber of clusters at the leavesoftheR-tree,Merge clusters to reducethenumberof clusters,Collaborative Filtering,Goal:predict what movies/books/… aperson may beinterestedin,on the basis of,Pastpreferences ofthe person,Otherpeople with similarpastpreferences,The preferencesof such peoplefora newmovie/book/…,One approach based on repeatedclustering,Cluster peopleon the basis ofpreferences for movies,Thencluster movieson the basis ofbeing liked bythesameclusters of people,Againcluster peoplebased ontheirpreferences for (the newly createdclustersof) movies,Repeat above till equilibrium,Aboveproblem is aninstance of,collaborative filtering,, where users collaboratein the task offilteringinformation tofindinformation ofinterest,OtherTypes ofMining,Textmining,: application of data mining to textualdocuments,cluster Web pages tofindrelated pages,cluster pages auserhasvisited toorganizetheirvisit history,classify Web pages automatically into aWeb directory,Datavisualization,systems help users examine large volumesof data and detectpatterns visually,Can visually encodelargeamounts of information on a singlescreen,Humans areverygooda detecting visualpatterns,End of Chapter,Figure 20.01,Figure 20.02,Figure 20.03,Figure 20.05,演講完畢,,謝,謝謝觀(guān)看!,