DOC

AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION

By Craig Burns,2014-01-26 10:12
11 views 0
AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION

    AN EFFICIENT APPROACH TO

    COMMENT SPAM IDENTIFICATION

    VO1.26No.5FELECTRONICS(CHINA)September2009

    ANEFFICIENTAPPROACHTOC0MMENTSPAMIDENTIFICATION1

    YangYuhangZhaoTiejunZhengDequanYuHao

    (MOE

    MSKeyLaboratoryofNaturalLanguageProcessingandSpeech,HarbinInstituteofTechnology

    Harbin150001,China)

    AbstractThispaperproposesanovelapproachtocommentspamidentificationbasedoncontent

    analysis.Threemainfeaturesincludingthenumberoflinks,

    contentrepetitiveness,andtextsinfilaritv

    areusedforcommentspamidentification.Inpractice,contentrepetitivenessisdeterminedbythe

    lengthandfrequencyofthelongestcommonsubstring.Furthermore,textsimilarityiscalculatedusing

    vectorspacemode1.Theprecisionsofpreliminaryexperimentsoncommentspareidentificationcon..

    ductedonChineseandEnglishareashighas93%and82%respectively.Theresultsshowthevalidity

    andlanguageindependencyofthisapproach.Comparedwithconventionalsparefilteringapproaches,

    ourmethodrequiresnotraining,norulesetsandnolinkrelationships.Theproposedapproachcanalso

    dealwithnewcommentsaswellasexistingcomments.

    KeywordsCeminentspam;Automaticidentification;Contentanalysis;Blog

CLCindexTP391.1:TP391.3

    DOI10.1007/sl1767007-0115z

    I.Introduction

    Asmoreandmorepeoplerelyonsearchengines asstartingpointstofulfilltheirneedforinforma

    tion,ithasbecomeabsolutelyimportanttohave one'spagerankupinthetopfewresultsofpopular searchengines[.Inordertohavetheirpagesrank higherthandeserving,webspamwhichisessen- tiallyagarbageontheweb,isputupforthesole purposeofmisleadingsearchenginesbysomeweb designers.Webspammingtrickscanbedividedinto termspammingandlinkspamming.Termspam

    ruingtriestoenhancetherelevancescoreby stickingorplagiarizingpostandstuffingkeyword; whilelinkspammingattemptstoartificiallyinflate pagerankbyusinglinkdumpsthatcontributetoa linkfarm.Comparedwithtermspamming.1ink spammingisusedmorepopularanditsinfluenceis moreserious.Conventionallinkspammingtricks includeoutgoinglinkssuchasdirectorycloning, andincominglinkssuchascreatingahoneypot,

    Manuscriptreceiveddate:June27,2007;reviseddate: August24,2008.

    SupportedbytheNationalNaturalScienceFoundationof China(No.60736044,60803094).

    Communicationauthor:YangYuhang,bornin1983,male Ph.D.candidate.Room611,NewTechnologyBuilding, Box321,HarbinInstituteofTechnology,Harbin150001, China.

Email:yhyang@mtlab.hit.edu.cn.

    infiltratingawebdirectory,linkexchange,expired domainsandcreatingownspamfarmi1j.

    Recently,anewkindofwebspam,comment

    spamisusedbymoreandmorespammersand

    becomesacrucialproblem.Commentspamises

    sentiallylinkspainoriginatingfromcommentsand responsesaddedtowebpageswhichsupportdy

    namicuserediting_2j.Commentspammingismuch easierthanconventionaltricks:insteadofsetting upcomplexwebsofpageslinkingtothespampage, thespammerwritesasimpleagentthatvisits randomblogandwikipages,postingcomments thatlinkbacktothespampage.Becauseofmuch easierusage,commentspamminghasbecomeone ofthemostpopularusedspamtricks,especiallyin blogsophere.Commercialenginesareseekingnew solutionstothisproblem[.andtheamountofre

    searchconcerninglinkspamisincreasing[. However,existingtechniquessufferfromsome problemssuchaslargeamountofmanualpartici

    pation,unstableperformanceandnon-universality. Inthispaper.weputforwardanovelapproach tocommentspamidentificationbasedoncontent analysistoovercometheseproblems.Compared withpreviousstudies,ourmainprioritiesinthis paperare:(1)Weanalyzespammers'motivation andbehaviorcomprehensivelyandintegratethese featuresintoouridentifymode1.(2)Ourapproach requiresnotraining,norulesetandnoglobal

YANGetaLAnEfficientApproachtoCommentSpareIdentification645

    knowledgeoflinkframework.(3)Ourapproachis independentfromlanguagesandwebapplications. suchasblogwikiandBBS.(4)Ourapproach cannotonlybeusedforofninecomments.butalso canbeusedforonlineidentification.

    Thispaperisorganizedasfollows.Atfirst.it describestherelatedworkinSectionII.InSection III.itshowstheapproachesofcommentspare identificationandfiltering.Experimentalresults areillustratedandanalyzedinSectionIV.Atlast. thereisasummaryinSectionV.

    II.Related,7lrk

    withtheburstofcommentspainanditsserious influence,commentspamidentificationhasbecome acrucialtaskandattractedmoreandmoreatten

    tions.Previousresearchesoncommentspamiden- tificationcanbedividedintomanua1methodsand automaticcontentbasedmethods.

    1.Manualidentifieation

    Mostexistingmethodstopreventingcomment sparearebasedonthereleasemechanismofblog host,suchasrequiringregistrationbeforecorn

    menting.preventingHTMLincomments,and preventingcommentsonoldblogposts[.These methodsmakesomecontributiontocommentspam control,butallthesemakeitdimculttoobtain feedbackandaffectthewaypeopleblog.Besides thesemethodscanonlydealwithnewcomments

butcannotdealwiththenumerousexistingcorn

    ments.PreventingspambasedonIPaddress[8_i8a commonusedtechnique,butwhichrequirescon

    stantmaintenanceandspammerscanconcealtheir intentbyusingproxiesandspoofedlegitimateIP addresses.In2005,somesearchenginesincluding Yahoo.MSNsearchandGoogleannouncedthat theyhavecollaboratedwithbloggingsoftware vendorsandhoststopreventcommentspamusing the"rel=nofollow''attributeaddedtohypertext links[.

    However.thisapproachalsobroughtmany problemsincludingdisturbingvalidlinkframework andbeingabusedbywebmasters.

    SeungyeopHan.etntarguedthatautomatic methodswerenotemcientenoughtodistinguish spamsinblogsophere.Theyproposedacollabora- tiveapproachinwhichkeyideaistorelyonmanual identificationofspamsandsharethisinformation aboutspamsthroughanetworkoftrust.However, itwasdifficulttoattractenoughuserstopartici

    pate.Besides,eachuser'smistakemightleadto globalinfluence.

    2.Automaticcontent.basedidentifieation Variousautomaticcontentbasedmethodsto

    preventingspamhavebeenproposedmostlyfo

    cusingontheemailspams.Somecontentbased

    methodsforcommentspareidentificationworkby analyzingthecontentofcomments,andpossibly alsothecontentsofpageslinkedbythecomments

(e.9.,Ref.[11]).Mostofthesetechniquesarecur

    rentlybasedonkeywordsorregularexpressions detecting.Thesemethodsrequiremanua1partici

    pationforconstructingandmaintainingsetsof keywordsorexpressions,whichleadstomany problemssuchastimeconsuming.1owrecailand

    ruleconflictlikeothermanualapproaches. PranamKolari,eta1.proposedaSupport VectorMachinefSVM)~12Jbasedmode1forsplog (spamblog)detectionl?H],andcharacterizedsplogs bycomparingthemagainstauthenticblogs.This studyfacilitateddetectingandweedingoutcom

    mentspams.However.thisapproachrequiredmuch trainingdataandgloballinkframework.andcould notbeappliedforonlinedetection.

    KazuyukiNarisawa.eta1.proposedaspam detectionmethodbasedontheZipf'slawandthe vocabularysize,whichwasthenumberofsub

    stringswhosefrequenciesarethesame.Thiswasa developedmethodbasedonsubstringamplification algorithm[u,17].

    However.thismethodstillrelied

    heavilyonfrequencyandjtwashardtodistinguish spamsfromvalidcommentswithhighfrequency. Besides,theexperimentsofthisstudywerecon

    ductedontheartificialdatainsteadofcomment spamsfromtherealblogosphere.

    G.Mishne.eta1.presentedamethodforde

    tectingcommonspaminblogbycomparingthe languagemodelsusedintheblogpostandthe

    comment.Thisapproachrequirednotrainingand noknowledgeofcompletewebconnectivity.How-

    ever,thismethodwasfacingwithdatasparseness problemcausedbymanyshortcommentsinthe realblogosphere,anditcouldnotdealwithgen

    cratedcommentspamswhichweresimilartovalid nT1ps

    646JOURNALOFELECTRONICS(CHINA),Vo1.26No.5,September2009

    Inawordpreviousresearchesmainlysuffered fromthreeproblems.Firstofal1.manualidentifl

    cationmethods,andevenmostsemiautomatic

    methodsrequiredlargeamountofmanualpar

    ticipation.Thesecondproblemwasunstableper

    formancewhichwasmainlycausedbysinglefea- tureusedinthesemethods.Thirdly,mostmethods couldonlyhandleeitherexistingcommentsornew comn~mnts.Thispaperpresentsanovelapproachto overcometheseproblems.Comparedwiththe previousstudies,ourapproachrequiresnomanual participation,notraining,andisindependentfrom languageandwebapplication.Thisapproachcan alsobeusedforbothoff]inecommentsandreal timeidentification.

    III.Methodology

    1.Featuredescription

    Areasonableassumptionisthattheidentiflca- tionaccuracyincreaseswiththeamountofuseful featuresused.Basedonthecomprehensiveanalysis ofspammers'motivationandbehavior,somecru

    cialfeaturesarefiguredoutandintegratedinthe

    proposedapproachtopreventingcommentspam. (I)NumberofoutlinksTheremightbe

    someoutlinkspointingtoirrespectivepagesin commentspams.Themotivationofspammersisto artificiallyinflatepagerankbyusinglinkdumps thatcontributetoalinkfarm.Forthispurpose,

    spammersalwaysputcommentswhichinclude linkspointingtotheirownpagesonblogsorwikis. (2)ContentrepetitivenessCommentspams maypartiallyortotallyOCCUrrepeatedly.Spam

    mersalwayscopythesamecommentspamtodif- ferentpagestoinflatepagerankinshorttime. Therefore.theoccurrencesofcommentsarecrucial forsparedetection.

    (3)TextsimilarityThecontentsimilarity betweenthecommentspareandblogpostislow. lidcommentscanbeseenasconversationsre

    latedtothetopicoftheblogpost.Whilecomment spamsarenothingbutartificialtrickforhigher rankingscores,sotheyarenotsimilarincontentto theposts.

    Asdescribedbefore,acommentwithlarge numberofoutlinks,highcontentrepetitiveness andlowsimilaritywiththepostismorelikelytobe aspam.Thesefeaturesareusedforcommentspam identificationwhichwillbedescribedinthefo1

    lowingsubsection.

    2.Commentspainidentification

    Theapproachtocommentspamidentification isshownasFig.1.Foragivencomment.theamount

    oflinksissummed,repeatscoreiscalculated,and thesimilaritybetweenthecommentandthepostis calculatedbyusingVSM(VectorSpaceMode1)for featureselectionandpresentation.Aftereach commenthasbeenprocessedandscored.theindi

    vidualscoresofdifierentfeaturesarestandardized andcombinedtoformanaggregatecommentscore. Thecommentisconsideredaspamifitsscoreis higherthananalgorithmthreshold.

    Smnnfingthe

    anlol}f)foffiuks

    1nsettle(i

    Calculating

    1

    ('alctflatillg

    textsin1i[alIlv

    FeI1tlife

    selectio11gtll(1

    1)IeselltFLtIO11

    higher'~-,.

    Ye

    Fig.1Systemflowofcommentspareidentification Asanalyzedbefore,asthemorelinksacom

    meritcontains,themoretimesthecommentoccurs, andthelowerthesimilaritybetweenthecomment andthepostis,thecommentislikelytobeaspam withhigherprobability.Thevalueofthecomment (c)C)iscalculatedbyEq.(1),Cisconsidereda spamifishigherthananalgorithmthreshold. (C)=oL

max

    nilnkmax

    r

    sire(C,P)

    maXSlm

    (1)

    wherenlik(C)istheamountoflinksinthecorn

    ment,CI)istherepeatscore,andsim(P)isthe contentsimilaritybetweenthecommentandthe post,max

    nlink,max

    randmax

    simaremaxi

    mumsofdifferentsoreswhichareusedasstandard YANGetaLAnEtEcientApproachtoCommentSpareIdentification647

    factors.nand7(Q+-y=1)arealgorithm

    parameterstobedeterminedexperimentally. f1)Repeatscorecalculation

    Repeatscoreisproportionaltothelengthofthe Longest,CommonSubstring(LCS)anditsrepeat times.Themainideaofrepeatscorecalculationi8 tofindLCSamongdifferentcommentsandSHillthe numberofoccurrences.(=,1iscalculatedbymul

    tiplyingthelengthofLCSanditsnumberofoc

    currence.Thedetailisshown38Algorithm1. Algorithm1Repeatscorecalculation

    Begin

    ForeachC?N

    Foreach?0

    FindtheLCSqbetweenC1andC3

Report this document

For any questions or suggestions please email
cust-service@docsford.com