Saturday, July 24, 2010

The Invisible Web

A data visualization of Wikipedia as part of t...Image via Wikipedia
The Deep Web (also called Deepnet, the invisible Web, dark Web or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
To discover content on the Web, search engines use web crawlers that follow hyperlinks. This technique is ideal for discovering resources on the surface Web but is often ineffective at finding deep Web resources. For example, these crawlers do not attempt to find dynamic pages that are the result of database queries due to the infinite number of queries that are possible. It has been noted that this can be (partially) overcome by providing links to query results, but this could unintentionally inflate the popularity for a member of the deep Web.[1]

Google, the largest search database on the planet, currently has around eight billion web pages indexed. That's a lot of information. But it's nothing compared to what else is out there. Google can only index the visible web, or searchable web. But the invisible web, or deep web, is estimated to be 500 times bigger than the searchable web. The invisible web comprises databases and results of specialty search engines that the popular search engines simply are not able to index.[2]

The visible Web is easy to define. It's made up of HTML Web pages that the search engines have chosen to include in their indices. The Invisible Web is much harder to define and classify for several reasons.
First, many Invisible Web sites are made up of straightforward Web pages that search engines could easily crawl and add to their indices but do not, simply because the engines have decided against including them. The Invisible Web is hidden because search engines have deliberately chosen to exclude some types of Web content. These exceptional resources simply cannot be found using general-purpose search engines because they have been effectively locked out.[3]

Invisible web resources can be classified into three broad categories:

1) Non-text files such as PDF files, multimedia, graphics, executable files, CGI scripts,
software, and some other document files.

2) Information contained in databases, real-time content and dynamically generated content.

3) Disconnected Pages,which are pages that exist on the web, but do not have any other pages linking to them. [4]

As more people gain access to information on the Web and more content is continuously added, it is important to know how to evaluate Web sites to determine if the information is reliable.

Educators also rely on the educational web site to work technically in the classroom. Misinformation and technical difficulties can cause a great deal of distress for not only the students but for the teachers as well. For this reason, audience, credibility, accuracy, objectivity, coverage, and currency are the major issues educators should focus on when examining the content of educational web sites. Aesthetic and visual appeal, navigation, and accessibility are the major issues educators should focus on when examining the technical aspects of educational web sites. [5]


[1] Deep Web-From Wikipedia, the free encyclopedia
en.wikipedia.org/wiki/Deep_Web

[2] Research Beyond Google
oedb.org/library/college-basics/research-beyond-google

[3] The Invisible Web: Uncovering Sources Search Engines Can't See
www.allbusiness.com/technology/internet-web-design/943472-1.html

[4] The Invisible Web Explained
www.valenciacc.edu/library/east/invisible_explained.cfm

[5] Criteria for evaluating Educational Web Sites
members.fortunecity.com/vqf99/plain1.htm

Related articles selected by Andreea Loffler

Google is Cracking the “Invisible Web”
www.marketingpilgrim.com/2008/04/google-is-cracking-the-invisible-web.html

Five criteria for evaluating Web pages
www.library.cornell.edu/okuref/research/webcrit.html
Enhanced by Zemanta

Introduction to Copyscape Plagiarism Checker

Protect Against and Check for Internet Plagiarism

This is an excerpt of a screenshot of Referrer...Image via Wikipedia

Internet plagiarism is on the rise ...  This means that if you have created a great website that has loads of useful information, then there is a good chance that someone will copy that material or duplicate the entire website and republish it as their own. This duplication of internet content is a frequent problem that can damage a websites reputation with various visitors (i.e., consumers) as well as with various internet search engines.  There is no excuse for taking someone else's web content without giving them credit. The plagiarism begins with not citing an the work of an author -- to make matters worse -- the offender will then try to disguise this stolen content with cosmetic changes such as adding pictures, paraphrasing, or reformatting the content.

Unfortunately, the World Wide Web (W3) is so massive it would seem that there is no defense against someone's web content from being stolen. If you have a website that is popular and has useful information, it's probably smart to register this web content for a federal copyright. Accomplishing this task will be the first step in taking a proactive approach to protecting your website content against intentional spam or internet plagiarism. Having a federal copyright will allow legal recourse against plagiarism if things cannot be resolved or if things get really ugly.

How to get a Federal Copyright?


To protect your website content against plagiarism file for a website copyright that is enforceable throughout the United States by visiting www.copyright.gov. The U.S. Library of Congress manages the U.S. Copyright Office and they consider websites to be software programs. Therefore, placing a valid copyright notice onto a website will require the following after is has been registered:
  •  © – this is the copyright symbol that is a universal symbol used internationally
  • Year – this is a designation that represents the year the website content was copyrighted
  • Author name – this is the name of the owner of the website content.

Example: © 2010 Jeff Rooney of Thomas Jefferson School of Law

Duplication of Internet Content


The duplication of content is a serious problem on the internet where you have what is called "clueless newbies" taking someone else's website content but don't realize that they've done anything wrong.[1] At the more sophisticated level, there are computer robots taking website content that is duplicated on more than one website. When a search engine such as Google scours the internet and its spiders index the content on websites into their web servers (or databases) it tries to detect whether or not a website is a copy of another website anywhere on the internet. A spider will try to make a determination of which website is the original or true version and which are websites are duplicates -- it may or may not be accurate.

Although there is no penalty for duplicate content detected by the spiders, the filtering of duplicate content could hurt a websites ranking which could mean a loss of business revenues captured from that website. The loss can then be interpreted as a loss of business revenues in which is a major consequence when a business website loses its ranking on a search engine.

Researchers looking for reputable websites is another concern, namely, it's a concern that asks the question: Who should get the credit for the website content? A researcher or a website owner needs to  determine whether or not the web content or the accessed web content has been stolen; this is imperative step. How can this be done?

To do this, use the free or premium services offered by Copyscape at www.copyscape.com that will report the websites that have been scraped and it will show the offending URL's in the results report, but most importantly – that is, from the perspective of a researcher – it will show where the website content has originated; this will free-up potential citation errors for research papers quoting content on a specific website.

Definition: Scrapers are people who send a compute robot to your website to copy (or scrape) the entire site and then republish it as their own; this is a copyright violation and is also theft.[2]


[1] Urban Dictionary
http://www.urbandictionary.com/define.php?term=clueless%20newbie
[2] Wikipedia
http://en.wikipedia.org/wiki/Scraper_site
Enhanced by Zemanta

Monday, July 19, 2010

Analyzing Websites

Image representing Bruce Clay as depicted in C...Image via CrunchBase

Analyzing Websites


A researcher using search engines to scour the internet for information require that it be credible, accurate, and reliable -- other words, it must be "trustworthy". How do you determine if the web site(s) information is trustworthy?

In the search engine world, cheating is known as spam. Spam involves deliberately building webpages that try to trick a search engine into offering inappropriate, redundant, or poor-quality search results. It's not only unethical, but can also get a researcher in trouble.[1]

A researcher can evaluate a web search or a web site by asking some general questions that can become more progressively elaborate. There is no standardized "checklist" for determining whether or not a web site is trustworthy because the checklist needs to be created according to a researcher's technical knowledge (basic to advance). Therefore, I created a short intermediate checklist that includes the following out-of-the-gate items:

Does the content appear reliable?
  • What information does the web site provide? 
  • Are sources documented or cited?  
  • Are there links to more information?
Are there comments about the web site? 
  • Does the web site have related links?
What's in the URL? 
  • Domain name - this is the name of one or more Internet Protocol (IP) addresses used to identify the location of websites and its pages. For example, the URL of www.google.com has the domain name of "google".[2]
  • Directory name - is the name of a folder in which holds a specific category of data for a website. For example, the URL of www.amazon.com/products that has the directory name of products which will hold information relating to amazon products such as books, articles, etc.[3]
  • Sub-directory name - is the name of a folder within a Directory that is used to break-down a category into its parts. For example, the URL www.amazon.com/products/books has the sub-directory name of books.[4]
  • File - is the name of the file in which is holding the desired information. For example, the URL www.amazon.com/products/books/URL_defined.htm has the file name of URL_defined.html that is responsible for holding the information onto a Hypertext Markup Langauge (HTML) file.
What is the purpose of the website?
  • Is the information subjective or objective?
  • Is it a blog, RSS feed, or a Wiki web page?
Sources for Evaluating a Website - the below sources are commonly used for analyzing a website and determining whether or not its information is trustworthy. This is an important step used to  validate internet information that will be used for a research project.
  • Berkeley University -- Website Evaluation Checklist
http://web.archive.org/web/20100718101756/http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/evaluation_checklist_2008_spring.pdf

This checklist is a good resource for the beginner learning how and what to think about when analyzing a website(s).
  • Alexa, The Web Information Company (an Amazon company)
http://www.alexa.com/ 

This website offers a free service that analyzes and assesses the demand for a particular website(s). The Alexa product provides valuable tools and provides a report on a websites ranking, traffic stats, audience types, contact information, reviews, related links, click-streams and back-links; I like to the Way-Back Machine that shows a time-line of when the website originated and revision dates.
  • Copyscape Plagiarism Checker
http://www.copyscape.com/

This website offers free and premium services used to check websites that are violating the rules of plagiarism.
 

[1 ] Bruce Clay, "What the Engines Think is Spam," Search Engine Optimization,
 http://www.bruceclay.com/emergingstandards.htm.
[2] Webopedia
http://www.webopedia.com/TERM/D/domain_name.html
[3] Id
http://www.webopedia.com/TERM/d/directory.html
[4] Id
[5] Webopedia
http://www.webopedia.com/TERM/f/file.html
Enhanced by Zemanta